Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE
FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.
Key use cases of our Large Language Model (LLM) Data:
Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:
Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:
Comments: comments count for the video
Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:
Krish Naik
Nicholas Renotte
Sentdex
DeepLearningAI
Artificial Intelligence — All in One
Siraj Raval
Jeremy Howard
Applied AI Course
Daniel Bourke
Jeff Heaton
DeepLearning.TV
Arxiv Insights
These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is a minimal version of the DynaBench dataset, containing the first 5% of the data. The full dataset is available at https://professor-x.de/dynabench
Abstract:
Previous work on learning physical systems from data has focused on high-resolution grid-structured measurements. However, real-world knowledge of such systems (e.g. weather data) relies on sparsely scattered measuring stations. In this paper, we introduce a novel simulated benchmark dataset, DynaBench, for learning dynamical systems directly from sparsely scattered data without prior knowledge of the equations. The dataset focuses on predicting the evolution of a dynamical system from low-resolution, unstructured measurements. We simulate six different partial differential equations covering a variety of physical systems commonly used in the literature and evaluate several machine learning models, including traditional graph neural networks and point cloud processing models, with the task of predicting the evolution of the system. The proposed benchmark dataset is expected to advance the state of art as an out-of-the-box easy-to-use tool for evaluating models in a setting where only unstructured low-resolution observations are available. The benchmark is available at https://professor-x.de/dynabench.
Technical Info
The dataset is split into 42 parts (6 equations x 7 combinations of resolution/structure). Each part can be downloaded separately and contains 7000 simulations of the given equation at the given resolution and structure. The simulations are grouped into chunks of 500 simulations saved in the hdf5 file format. Each chunk contains the variable "data", where the values of the simulated system are stored, as well as the variable "points", where the coordinates at which the system has been observed are stored. For more details visit the DynaBench website at https://professor-x.de/dynabench/. The dataset is best used as part of the dynabench python package available at https://pypi.org/project/dynabench/.
This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data. See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Top 250 TV Shows: Ratings, Episodes, and Certifications
This dataset features the top 250 TV shows, providing comprehensive information on their titles, certifications, number of episodes, start years, IMDb ratings, and rating counts. This dataset is perfect for enthusiasts, researchers, and data scientists who are interested in analyzing trends and patterns in popular television shows.
This dataset was compiled to support data science projects and analyses related to television shows. The data is sourced from IMDb to ensure accuracy and comprehensiveness.
MIT License
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.
Metadata includes
reviews
purchases, plays, recommends (likes)
product bundles
pricing information
Basic Statistics:
Reviews: 7,793,069
Users: 2,567,538
Items: 15,474
Bundles: 615
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset simulates various aspects of construction project monitoring over time, designed for time-series analysis, and optimization studies. It contains 50,000 records representing data points collected at 1-minute intervals. The dataset includes diverse features related to project management, environmental conditions, resource utilization, safety, and performance evaluation.
Features: timestamp: The recorded time of the observation. temperature: Ambient temperature at the construction site (°C). humidity: Relative humidity at the construction site (%). vibration_level: Measured vibration levels of machinery or equipment (Hz). material_usage: Quantity of materials utilized during the period (kg). machinery_status: Binary status indicating machinery activity (1 = Active, 0 = Idle). worker_count: Number of workers on-site during the period. energy_consumption: Energy consumption recorded for machinery and operations (kWh). task_progress: Cumulative percentage progress of tasks (%). cost_deviation: Financial deviation from the planned budget (USD). time_deviation: Schedule deviation from planned timelines (days). safety_incidents: Number of safety-related incidents reported. equipment_utilization_rate: Utilization rate of machinery and equipment (%). material_shortage_alert: Binary alert for material shortage (1 = Alert, 0 = No Alert). risk_score: Computed risk score for the project (%). simulation_deviation: Percentage deviation in simulation vs. actual outcomes (%). update_frequency: Suggested interval for project status updates (minutes). optimization_suggestion: Suggested optimization actions for the project. performance_score: Categorical performance evaluation of the project based on several metrics (Poor, Average, Good, Excellent).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset for this project is represented by photos, photos for the buildings of the University of Salford, these photos are taken by a mobile phone camera from different angels and different distances , even though this task sounds so easy but it encountered some challenges, these challenges are summarized below:
1. Obstacles.
a. Fixed or unremovable objects.
When taking several photos for a building or a landscape from different angels and directions ,there are some of these angels blocked by a form of a fixed object such as trees and plants, light poles, signs, statues, cabins, bicycle shades, scooter stands, generators/transformers, construction barriers, construction equipment and any other service equipment so it is unavoidable to represent some photos without these objects included, this will raise 3 questions.
- will these objects confuse the model/application we intend to create meaning will that obstacle prevent the model/application from identifying the designated building?
- Or will the photos be more precise with these objects and provide the capability for the model/application to identify these building with these obstacles included?
- How far is the maximum length for detection? In other words, how far will the mobile device with the application be from the building so it could or could not detect the designated building?
b. Removable and moving objects.
- Any University is crowded with staff and students especially in the rush hours of the day so it is hard for some photos to be taken without a personnel appearing in that photo in a certain time period of the day.
But, due to privacy issues and showing respect to that person, these photos are better excluded.
- Parked vehicles, trollies and service equipment can be an obstacle and might appear in these images as well as it can block access to some areas which an image from a certain angel cannot be obtained.
- Animals, like dogs, cats, birds or even squirrels cannot be avoided in some photos which are entitled to the same questions above.
2. Weather.
In a deep learning project, more data means more accuracy and less error, at this stage of our project it was agreed to have 50 photos per building but we can increase the number of photos for more accurate results but due to the limitation of time for this project it was agreed for 50 per building only.
these photos were taken on cloudy days and to expand our work on this project (as future works and recommendations).
Photos on sunny, rainy, foggy, snowy and any other weather condition days can be included.
Even photos in different times of the day can be included such as night, dawn, and sunset times. To provide our designated model with all the possibilities to identify these buildings in all available circumstances.
University House: 60 images Peel building is an important figure of the University of Salford due to its distinct and amazing exterior design but unfortunately it was excluded from the selection due to some maintenance activities at the time of collecting the photos for this project as it is partially covered with scaffolding and a lot of movement by personnel and equipment. If the supervisor suggests that this will be another challenge to include in the project then, it is mandatory to collect its photos. There are many other buildings in the University of Salford and again to expand our project in the future, we can include all the buildings of the University of Salford. The full list of buildings of the university can be reviewed by accessing an interactive map on: www.salford.ac.uk/find-us
Expand Further. This project can be improved furthermore with so many capabilities, again due to the limitation of time given to this project , these improvements can be implemented later as future works. In simple words, this project is to create an application that can display the building’s name when pointing a mobile device with a camera to that building. Future featured to be added: a. Address/ location: this will require collection of additional data which is the longitude and latitude of each building included or the post code which will be the same taking under consideration how close these buildings appear on the interactive map application such as Google maps, Google earth or iMaps. b. Description of the building: what is the building for, by which school is this building occupied? and what facilities are included in this building? c. Interior Images: all the photos at this stage were taken for the exterior of the buildings, will interior photos make an impact on the model/application for example, if the user is inside newton or chapman and opens the application, will the building be identified especially the interior of these buildings have a high level of similarity for the corridors, rooms, halls, and labs? Will the furniture and assets will be as obstacles or identification marks? d. Directions to a specific area/floor inside the building: if the interior images succeed with the model/application, it would be a good idea adding a search option to the model/application so it can guide the user to a specific area showing directions to that area, for example if the user is inside newton building and searches for lab 141 it will direct him to the first floor of the building with an interactive arrow that changes while the user is approaching his destination. Or, if the application can identify the building from its interior, a drop down list will be activated with each floor of this building, for example, if the model/application identifies Newton building, the drop down list will be activated and when pressing on that drop down list it will represent interactive tabs for each floor of the building, selecting one of the floors by clicking on its tab will display the facilities on that floor for example if the user presses on floor 1 tab, another screen will appear displaying which facilities are on that floor. Furthermore, if the model/application identifies another building, it should activate a different number of floors as buildings differ in the number of floors from each other. this feature can be improved with a voice assistant that can direct the user after he applies his search (something similar to the voice assistant in Google maps but applied to the interior of the university’s buildings. e. Top View: if a drone with a camera can be afforded, it can provide arial images and top views for the buildings that can be added to the model/application but these images can be similar to the interior images situation , the buildings can be similar to each other from the top with other obstacles included like water tanks and AC units.
Other Questions:
Will the model/application be reproducible? the presumed answer for this question should be YES, IF, the model/application will be fed with the proper data (images) such as images of restaurants, schools, supermarkets, hospitals, government facilities...etc.
This is a mutli-modal dataset for restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as metadata for each restaurant.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Context
Travel is a diverse and vibrant industry, and India, with its rich cultural heritage and varied landscapes, offers a myriad of experiences for travelers. The India Travel Recommender System Dataset is designed to facilitate the development of personalized travel recommendation systems. This dataset provides an extensive compilation of travel destinations across India, along with user profiles, reviews, and historical travel data. It's an invaluable resource for anyone looking to create AI-powered travel applications focused on the Indian subcontinent.
Content
The dataset is divided into four primary components:
Destinations: Information about various travel destinations in India, including details like type of destination (beach, mountain, historical site, etc.), popularity, and best time to visit.
Users: Profiles of users including their preferences and demographic information. This dataset has been enriched with gender diversity and includes details on the number of adults and children for travel.
Reviews: User-generated reviews and ratings for the different destinations, offering insights into visitor experiences and satisfaction.
User History: Records of users' past travel experiences, including destinations visited and ratings provided.
Each of these components is presented in a separate CSV file, allowing for easy integration and manipulation in data processing and machine learning workflows.
Acknowledgements
This dataset was generated for educational and research purposes and is intended to be used in hackathons, academic projects, and by AI enthusiasts aiming to enhance the travel experience through technology.
Inspiration
The dataset is perfect for exploring a variety of questions and tasks, such as:
- Building a recommendation engine to suggest travel destinations based on user preferences.
- Analyzing travel trends in India.
- Understanding the relationship between user demographics and travel preferences.
- Sentiment analysis of travel destination reviews.
- Forecasting the popularity of travel destinations based on historical data.
We encourage Kaggle users to explore this dataset to uncover unique insights and develop innovative solutions in the realm of travel technology. Whether you're a data scientist, a student, or a travel tech enthusiast, this dataset offers a wealth of opportunities for exploration and creativity.
This dataset is free to use for non-commercial purposes. For commercial use, please contact the dataset provider. Remember to cite the source when using this dataset in your projects.
CC0: Public Domain - The dataset is in the public domain and can be used without restrictions.
This submission contains shapefiles, geotiffs, and symbology for the revised-from-Play-Fairway potential structures/structural settings used in the Nevada Geothermal Machine Learning project. Layers include potential structural setting ellipses, centroids, and distance-to-centroid raster. A submission linking the full GitHub repository for our machine learning Jupyter Notebooks will appear in the related datasets section of this page once available.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global AI Dataset Search Platform market size reached USD 1.87 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 27.6% during the forecast period, reaching an estimated USD 16.17 billion by 2033. This remarkable growth is primarily attributed to the escalating demand for high-quality, diverse, and scalable datasets required to train advanced artificial intelligence and machine learning models across various industries. The proliferation of AI-driven applications and the increasing emphasis on data-centric AI development are key growth factors propelling the adoption of AI dataset search platforms globally.
The surge in AI adoption across sectors such as healthcare, BFSI, retail, automotive, and education is fueling the need for efficient and reliable dataset discovery solutions. Organizations are increasingly recognizing that the success of AI models hinges on the quality and relevance of the training data, leading to a surge in investments in dataset search platforms that offer advanced filtering, metadata tagging, and data governance capabilities. The integration of AI dataset search platforms with cloud infrastructures further streamlines data access, collaboration, and compliance, making them indispensable tools for enterprises aiming to accelerate AI innovation. The growing complexity of AI projects, coupled with the exponential growth in data volumes, is compelling organizations to seek platforms that can automate and optimize the process of dataset discovery and curation.
Another significant growth factor is the rapid evolution of AI regulations and data privacy frameworks worldwide. As data governance becomes a top priority, AI dataset search platforms are evolving to include robust features for data lineage tracking, access control, and compliance with regulations such as GDPR, HIPAA, and CCPA. The ability to ensure ethical sourcing and transparent usage of datasets is increasingly valued by enterprises and academic institutions alike. This regulatory landscape is driving the adoption of platforms that not only facilitate efficient dataset search but also enable organizations to demonstrate accountability and compliance in their AI initiatives.
The expanding ecosystem of AI developers, data scientists, and machine learning engineers is also contributing to the market's growth. The democratization of AI development, supported by open-source frameworks and cloud-based collaboration tools, has increased the demand for platforms that can aggregate, index, and provide easy access to diverse datasets. AI dataset search platforms are becoming central to fostering innovation, reducing development cycles, and enabling cross-domain research. As organizations strive to stay ahead in the competitive AI landscape, the ability to quickly identify and utilize optimal datasets is emerging as a critical differentiator.
From a regional perspective, North America currently dominates the AI dataset search platform market, accounting for over 38% of global revenue in 2024, driven by the strong presence of leading AI technology companies, active research communities, and significant investments in digital transformation. Europe and Asia Pacific are also witnessing rapid adoption, with Asia Pacific expected to exhibit the highest CAGR of 29.3% during the forecast period, fueled by government initiatives, burgeoning AI startups, and increasing digitalization across industries. Latin America and the Middle East & Africa are gradually embracing AI dataset search platforms, supported by growing awareness and investments in AI research and infrastructure.
The AI Dataset Search Platform market is segmented by component into Software and Services. Software solutions constitute the backbone of this market, providing the core functionalities required for dataset discovery, indexing, metadata management, and integration with existing AI workflows. The software segment is witnessing robust growth as organizations seek advanced platforms capable of handling large-scale, multi-source datasets with sophisticated search capabilities powered by natural language processing and machine learning algorithms. These platforms are increasingly incorporating features such as semantic search, automated data labeling, and customizable data pipelines, enabling users to eff
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Corpus Nummorum - Coin Image Dataset
This dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).
The dataset contains 115,160 images with about 29,000 unique coins. The images are split in three main folders with different assignment of the coins. Each main folder is sorted with the help fo subfolders which hold the coin images. The "dataset_coins" folder contains the coin photos divided into obverse and reverse and arranged by coin types. In the "dataset_types" folder the obverse and reverse image of the coins are concatenated and transformed to a quadratic format with black bars on the top and bottom. The images here are sorted by their coin type. The last folder "dataset_mints" contains the also concatenated images sorted by their mint. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.
The main purpose for this dataset in the CN project is the training of Machine Learning based Image Recognition models. We use three different Convolutional Neural Network based architectures: VGG16, VGG19 and ResNet50. Our best model (VGG16) archieves on this dataset a 79% Top-1 and a 97% Top-5 accuracy for the coin type recognition. The mint recognition achieves an 79% Top-1 and 94% Top-5 accuracy. We have a Colab notebook with two models (trained on the whole CN dataset) online.
During the summer semester 2023, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. We gave our students this dataset with the task to achieve better results than us. Here are their experiments:
Team 1: Voting and stacking of models
Team 4: Dockerized TIMM Computer Vision Backend & FastAPI
Now we would like to invite you to try out your own ideas and models on our coin data.
If you have any questions or suggestions, please, feel free to contact us.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).
The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.
The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.
The dataset has a tabular structure and was initially stored in CSV format. It contains:
Rows: 7,043 customer records
Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).
Naming Convention:
The table in the database is named telco_customer_churn_data
.
Software Requirements:
To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).
For machine learning applications, libraries such as pandas
, scikit-learn
, and joblib
are typically used.
Additional Resources:
Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn
When reusing the dataset, users should be aware:
Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).
Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is used to address the Research Questions in the study at Transactions on Software Engineering: Within-Project Defect Prediction of Infrastructure-as-Code using Product and Process Metrics. See also: https://github.com/stefanodallapalma/TSE-2020-05-0217. It provides * repositories.json - a list of repositories selected from open-source GitHub repositories based on the Ansible language. * fixing-commits.json - a list of defect-fixing commits extracted from those repositories. * fixed-files.json - a list of Ansible files fixed in those defect-fixing commits and respective bug-inducing commits. * failure-prone-files.json - a list of failure-prone files through the repository's commit history. * metrics.zip - csv files consisting of releases (set of files) and their IaC-oriented, delta and process metrics extracted from each analyzed repository * projects.zip - for each analyzed project, it contains the data (models, performance, and results of Recursive Feature Elimination) used to answer the Research Questions. Context Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages. This dataset targets the YAML-based Ansible language to devise within-project defects prediction approaches for IaC based on Machine-learning. Content The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria: * The repository has at least one push event to its master branch in the last six months; * The repository has at least 2 releases; * At least 10% of the files in the repository are IaC scripts; * The repository has at least 2 core contributors; * The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file; * The repository has a comments ratio of at least 0.1%; * The repository has commit frequency of at least 2 per month on average; * The repository has an issue frequency of at least 0.01 events per month on average; * The repository has evidence of a license, such as the presence of a LICENSE.md file * The repository has at least 100 source lines of code. Metrics are grouped into three categories: * IaC-Oriented: metrics of structural properties derived from the source code of infrastructure scripts. Click here for more info. * Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric. * Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found here. In addition to the metrics, the dataset contains the pre-trained models (*.joblib) in the folders rq1 and rq2 of projects.zip. You can load the model in Python as follows: from joblib import load model = load('projects/owner/repository/rq1/random_forest.joblib'), mmap_mode='r') best_estimator = model['estimator'] # The estimator that maximized the AUC-PR cv_results = model['cv_results'] # The results of each step of the validation procedure best_index = mode['best_index'] # The index to access the best cv_results
Acknowledgements This work is supported by the European Commission grants no. 825040 (RADON H2020). Inspiration What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Gender Classification Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/elakiricoder/gender-classification-dataset on 12 November 2021.
--- Dataset description provided by original source is as follows ---
While I was practicing machine learning, I wanted to create a simple dataset that is closely aligned to the real world scenario and gives better results to whet my appetite on this domain. If you are a beginner who wants to try solving classification problems in machine learning and if you prefer achieving better results, try using this dataset in your projects which will be a great place to start.
This dataset contains 7 features and a label column.
long_hair - This column contains 0's and 1's where 1 is "long hair" and 0 is "not long hair". forehead_width_cm - This column is in CM's. This is the width of the forehead. forehead_height_cm - This is the height of the forehead and it's in Cm's. nose_wide - This column contains 0's and 1's where 1 is "wide nose" and 0 is "not wide nose". nose_long - This column contains 0's and 1's where 1 is "Long nose" and 0 is "not long nose". lips_thin - This column contains 0's and 1's where 1 represents the "thin lips" while 0 is "Not thin lips". distance_nose_to_lip_long - This column contains 0's and 1's where 1 represents the "long distance between nose and lips" while 0 is "short distance between nose and lips".
gender - This is either "Male" or "Female".
Nothing to acknowledge as this is just a made up data.
It's painful to see bad results at the beginning. Don't begin with complicated datasets if you are a beginner. I'm sure that this dataset will encourage you to proceed further in the domain. Good luck.
--- Original source retains full ownership of the source dataset ---
This submission contains geotiffs, supporting shapefiles and readmes for the inputs and output models of algorithms explored in the Nevada Geothermal Machine Learning project, meant to accompany the final report. Layers include: Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk), input rasters of feature sets, and positive/negative training sites. See readme .txt files and final report for additional metadata. A submission linking the full codebase for generating machine learning output models is available under "related resources" on this page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was compiled as part of a project to design a cold storage system to combat post-harvest food loss in developing regions by integrating IoT technology with predictive machine learning. The project, which was developed for use by smallholder farmers in Uganda, aims to monitor and proactively control the environmental conditions in cold storage units to extend the shelf life of perishable goods.
This dataset is specifically structured for machine learning applications, serving as the training and validation data for machine learning models. It contains environmental data points collected in a controlled cold storage environment. The data is organized into a comma-separated value (CSV) file with a total of 10996 entries in the following six columns:
Fruit: A categorical variable indicating the type of fruit being stored (e.g., Orange, Pineapple, Banana, Tomato).
Temp: The temperature inside the cold storage unit, measured in degrees Celsius (°C).
Humid: The relative humidity (RH) of the environment, measured as a percentage (%).
Light: The intensity of light exposure, measured in Lux.
CO2: The concentration of carbon dioxide (CO₂) in the air, measured in parts per million (ppm).
Class: A binary classification label (Good or Bad) that serves as the target variable for the predictive model, indicating whether the environmental conditions are optimal or suboptimal for spoilage prevention.
The data's primary purpose is to provide a basis for training predictive models to classify environmental conditions and assess spoilage risk.
The dataset is a valuable resource for researchers and practitioners in fields such as smart agriculture, food science, embedded systems, and machine learning. It can be used to:
Train, validate, and test new predictive models for food spoilage.
Analyze the correlation between specific environmental factors (temperature, humidity, CO2, and light) and fruit spoilage outcomes.
Support the development of low-cost, intelligent monitoring systems for cold chain logistics and food preservation.
This dataset and the associated project are intended to contribute to achieving the United Nations Sustainable Development Goals (SDGs), particularly those related to food security and sustainable agriculture.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Documentation: Cucumber Disease Detection
Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.
Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.
Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.
Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.
Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.
Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.
Methodology Machine Learning Algorithms:
Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:
The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.
Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.
Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.
Model Evaluation Evaluation Metrics:
Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:
The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.
Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.
Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.
References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1
Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g
Rafiur Rahman Rafit EWU 2018-3-60-111
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
GitHub page: https://github.com/soarsmu/NICHE