78 datasets found

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21967265.v1
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE
d
FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...
datarade.ai
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-products/filemarket-ai-training-data-large-language-model-llm-data-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jun 28, 2024
Dataset authored and provided by
FileMarket
Area covered
Antigua and Barbuda, Saint Kitts and Nevis, French Southern Territories, Colombia, Benin, China, Brazil, Central African Republic, Papua New Guinea, Saudi Arabia
Description
FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

Key use cases of our Large Language Model (LLM) Data:

Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.
AI/ML Youtube Videos
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asmaa Hadir (2023). AI/ML Youtube Videos [Dataset]. https://www.kaggle.com/datasets/asmaahadir/aiml-youtube-channels-content-2018-2019
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Asmaa Hadir
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
YouTube
Description
I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:

Channel: video's channel

Title: video title

PublishedDate: date the video was uploaded

Likes: likes count for the video

Views: views count for the video

Comments: comments count for the video

Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:

Krish Naik

Nicholas Renotte

Sentdex

DeepLearningAI

Artificial Intelligence — All in One

Siraj Raval

Jeremy Howard

Applied AI Course

Daniel Bourke

Jeff Heaton

DeepLearning.TV

Arxiv Insights

These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.
DynaBench: A benchmark dataset for learning dynamical systems from...
zenodo.org
data.niaid.nih.gov
tar
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Dulny; Andrzej Dulny; Andreas Hotho; Andreas Hotho; Anna Krause; Anna Krause (2023). DynaBench: A benchmark dataset for learning dynamical systems from low-resolution data (minimal) [Dataset]. http://doi.org/10.1007/978-3-031-43412-9_26
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.1007/978-3-031-43412-9_26
Dataset updated
Oct 31, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Dulny; Andrzej Dulny; Andreas Hotho; Andreas Hotho; Anna Krause; Anna Krause
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is a minimal version of the DynaBench dataset, containing the first 5% of the data. The full dataset is available at https://professor-x.de/dynabench
Abstract:
Previous work on learning physical systems from data has focused on high-resolution grid-structured measurements. However, real-world knowledge of such systems (e.g. weather data) relies on sparsely scattered measuring stations. In this paper, we introduce a novel simulated benchmark dataset, DynaBench, for learning dynamical systems directly from sparsely scattered data without prior knowledge of the equations. The dataset focuses on predicting the evolution of a dynamical system from low-resolution, unstructured measurements. We simulate six different partial differential equations covering a variety of physical systems commonly used in the literature and evaluate several machine learning models, including traditional graph neural networks and point cloud processing models, with the task of predicting the evolution of the system. The proposed benchmark dataset is expected to advance the state of art as an out-of-the-box easy-to-use tool for evaluating models in a setting where only unstructured low-resolution observations are available. The benchmark is available at https://professor-x.de/dynabench.
Technical Info
The dataset is split into 42 parts (6 equations x 7 combinations of resolution/structure). Each part can be downloaded separately and contains 7000 simulations of the given equation at the given resolution and structure. The simulations are grouped into chunks of 500 simulations saved in the hdf5 file format. Each chunk contains the variable "data", where the values of the simulated system are stored, as well as the variable "points", where the coordinates at which the system has been observed are stored. For more details visit the DynaBench website at https://professor-x.de/dynabench/. The dataset is best used as part of the dynabench python package available at https://pypi.org/project/dynabench/.
d
Data from: GIS Resource Compilation Map Package - Applications of Machine...
catalog.data.gov
data.openei.org
+3more
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevada Bureau of Mines and Geology (2025). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/gis-resource-compilation-map-package-applications-of-machine-learning-techniques-to-geothe-8f3ee
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
Nevada Bureau of Mines and Geology
Area covered
Great Basin, Nevada
Description
This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data. See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.
IMDB top 250 tv shows dataset
kaggle.com
Updated Aug 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharukh Khan (2024). IMDB top 250 tv shows dataset [Dataset]. https://www.kaggle.com/datasets/sharukhkhan0101/imdb-top-250-tv-shows-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sharukh Khan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Top 250 TV Shows

Dataset Title:

Top 250 TV Shows: Ratings, Episodes, and Certifications

Description:

This dataset features the top 250 TV shows, providing comprehensive information on their titles, certifications, number of episodes, start years, IMDb ratings, and rating counts. This dataset is perfect for enthusiasts, researchers, and data scientists who are interested in analyzing trends and patterns in popular television shows.

Columns:

Title: The title of the TV show.

Certificate: The certification rating (e.g., TV-MA, TV-G).

Number of Episodes: The total number of episodes in the TV show.

Started: The year the TV show first aired.

Rating: The IMDb rating of the TV show.

Rating Count: The number of user ratings the TV show has received on IMDb.

Key Features:

Diverse Range of TV Shows: Includes a variety of TV shows from different genres and time periods.

Detailed Ratings Information: Contains both IMDb ratings and rating counts, offering insights into the popularity and reception of each TV show.

Certification Information: Provides details on the content appropriateness for different audiences.

Potential Uses:

Data Analysis and Visualization: Create visualizations to explore trends in TV show ratings, episode counts, and certifications over the years.

Machine Learning Projects: Use the dataset to build predictive models for TV show ratings or success.

Comparative Studies: Analyze the relationship between the number of episodes, certification, and ratings.

Acknowledgments:

This dataset was compiled to support data science projects and analyses related to television shows. The data is sourced from IMDb to ensure accuracy and comprehensiveness.

License:

MIT License
u
Steam Video Game and Bundle Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

Metadata includes

reviews

purchases, plays, recommends (likes)

product bundles

pricing information

Basic Statistics:

Reviews: 7,793,069

Users: 2,567,538

Items: 15,474

Bundles: 615
Building Performance Dataset
kaggle.com
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziya (2024). Building Performance Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/construction-project-performance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ziya
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset simulates various aspects of construction project monitoring over time, designed for time-series analysis, and optimization studies. It contains 50,000 records representing data points collected at 1-minute intervals. The dataset includes diverse features related to project management, environmental conditions, resource utilization, safety, and performance evaluation.

Features: timestamp: The recorded time of the observation. temperature: Ambient temperature at the construction site (°C). humidity: Relative humidity at the construction site (%). vibration_level: Measured vibration levels of machinery or equipment (Hz). material_usage: Quantity of materials utilized during the period (kg). machinery_status: Binary status indicating machinery activity (1 = Active, 0 = Idle). worker_count: Number of workers on-site during the period. energy_consumption: Energy consumption recorded for machinery and operations (kWh). task_progress: Cumulative percentage progress of tasks (%). cost_deviation: Financial deviation from the planned budget (USD). time_deviation: Schedule deviation from planned timelines (days). safety_incidents: Number of safety-related incidents reported. equipment_utilization_rate: Utilization rate of machinery and equipment (%). material_shortage_alert: Binary alert for material shortage (1 = Alert, 0 = No Alert). risk_score: Computed risk score for the project (%). simulation_deviation: Percentage deviation in simulation vs. actual outcomes (%). update_frequency: Suggested interval for project status updates (minutes). optimization_suggestion: Suggested optimization actions for the project. performance_score: Categorical performance evaluation of the project based on several metrics (Poor, Average, Good, Excellent).
f
UoS Buildings Image Dataset for Computer Vision Algorithms
salford.figshare.com
application/x-gzip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Alameer; Mazin Al-Mosawy (2025). UoS Buildings Image Dataset for Computer Vision Algorithms [Dataset]. http://doi.org/10.17866/rd.salford.20383155.v2
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.17866/rd.salford.20383155.v2
Dataset updated
Jan 23, 2025
Dataset provided by
University of Salford
Authors
Ali Alameer; Mazin Al-Mosawy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for this project is represented by photos, photos for the buildings of the University of Salford, these photos are taken by a mobile phone camera from different angels and different distances , even though this task sounds so easy but it encountered some challenges, these challenges are summarized below: 1. Obstacles. a. Fixed or unremovable objects. When taking several photos for a building or a landscape from different angels and directions ,there are some of these angels blocked by a form of a fixed object such as trees and plants, light poles, signs, statues, cabins, bicycle shades, scooter stands, generators/transformers, construction barriers, construction equipment and any other service equipment so it is unavoidable to represent some photos without these objects included, this will raise 3 questions. - will these objects confuse the model/application we intend to create meaning will that obstacle prevent the model/application from identifying the designated building? - Or will the photos be more precise with these objects and provide the capability for the model/application to identify these building with these obstacles included? - How far is the maximum length for detection? In other words, how far will the mobile device with the application be from the building so it could or could not detect the designated building? b. Removable and moving objects. - Any University is crowded with staff and students especially in the rush hours of the day so it is hard for some photos to be taken without a personnel appearing in that photo in a certain time period of the day. But, due to privacy issues and showing respect to that person, these photos are better excluded. - Parked vehicles, trollies and service equipment can be an obstacle and might appear in these images as well as it can block access to some areas which an image from a certain angel cannot be obtained. - Animals, like dogs, cats, birds or even squirrels cannot be avoided in some photos which are entitled to the same questions above.
2. Weather. In a deep learning project, more data means more accuracy and less error, at this stage of our project it was agreed to have 50 photos per building but we can increase the number of photos for more accurate results but due to the limitation of time for this project it was agreed for 50 per building only. these photos were taken on cloudy days and to expand our work on this project (as future works and recommendations). Photos on sunny, rainy, foggy, snowy and any other weather condition days can be included. Even photos in different times of the day can be included such as night, dawn, and sunset times. To provide our designated model with all the possibilities to identify these buildings in all available circumstances.

The selected buildings. It was agreed to select 10 buildings only from the University of Salford buildings for this project with at least 50 images per building, these selected building for this project with the number of images taken are:

Chapman: 74 images

Clifford Whitworth Library: 60 images

Cockcroft: 67 images

Maxwell: 80 images

Media City Campus: 92 images

New Adelphi: 93 images

New Science, Engineering & Environment: 78 images

Newton: 92 images

Sports Centre: 55 images

University House: 60 images Peel building is an important figure of the University of Salford due to its distinct and amazing exterior design but unfortunately it was excluded from the selection due to some maintenance activities at the time of collecting the photos for this project as it is partially covered with scaffolding and a lot of movement by personnel and equipment. If the supervisor suggests that this will be another challenge to include in the project then, it is mandatory to collect its photos. There are many other buildings in the University of Salford and again to expand our project in the future, we can include all the buildings of the University of Salford. The full list of buildings of the university can be reviewed by accessing an interactive map on: www.salford.ac.uk/find-us

Expand Further. This project can be improved furthermore with so many capabilities, again due to the limitation of time given to this project , these improvements can be implemented later as future works. In simple words, this project is to create an application that can display the building’s name when pointing a mobile device with a camera to that building. Future featured to be added: a. Address/ location: this will require collection of additional data which is the longitude and latitude of each building included or the post code which will be the same taking under consideration how close these buildings appear on the interactive map application such as Google maps, Google earth or iMaps. b. Description of the building: what is the building for, by which school is this building occupied? and what facilities are included in this building? c. Interior Images: all the photos at this stage were taken for the exterior of the buildings, will interior photos make an impact on the model/application for example, if the user is inside newton or chapman and opens the application, will the building be identified especially the interior of these buildings have a high level of similarity for the corridors, rooms, halls, and labs? Will the furniture and assets will be as obstacles or identification marks? d. Directions to a specific area/floor inside the building: if the interior images succeed with the model/application, it would be a good idea adding a search option to the model/application so it can guide the user to a specific area showing directions to that area, for example if the user is inside newton building and searches for lab 141 it will direct him to the first floor of the building with an interactive arrow that changes while the user is approaching his destination. Or, if the application can identify the building from its interior, a drop down list will be activated with each floor of this building, for example, if the model/application identifies Newton building, the drop down list will be activated and when pressing on that drop down list it will represent interactive tabs for each floor of the building, selecting one of the floors by clicking on its tab will display the facilities on that floor for example if the user presses on floor 1 tab, another screen will appear displaying which facilities are on that floor. Furthermore, if the model/application identifies another building, it should activate a different number of floors as buildings differ in the number of floors from each other. this feature can be improved with a voice assistant that can direct the user after he applies his search (something similar to the voice assistant in Google maps but applied to the interior of the university’s buildings. e. Top View: if a drone with a camera can be afforded, it can provide arial images and top views for the buildings that can be added to the model/application but these images can be similar to the interior images situation , the buildings can be similar to each other from the top with other obstacles included like water tanks and AC units.

Other Questions:

Will the model/application be reproducible? the presumed answer for this question should be YES, IF, the model/application will be fed with the proper data (images) such as images of restaurants, schools, supermarkets, hospitals, government facilities...etc.
u
Google Restaurants dataset
cseweb.ucsd.edu
csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Google Restaurants dataset [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
csvAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
This is a mutli-modal dataset for restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as metadata for each restaurant.
Travel Recommendation Dataset
kaggle.com
Updated Jan 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Mehra (2024). Travel Recommendation Dataset [Dataset]. https://www.kaggle.com/datasets/amanmehra23/travel-recommendation-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 23, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aman Mehra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Title: India Travel Recommender System Dataset

Description

Context
Travel is a diverse and vibrant industry, and India, with its rich cultural heritage and varied landscapes, offers a myriad of experiences for travelers. The India Travel Recommender System Dataset is designed to facilitate the development of personalized travel recommendation systems. This dataset provides an extensive compilation of travel destinations across India, along with user profiles, reviews, and historical travel data. It's an invaluable resource for anyone looking to create AI-powered travel applications focused on the Indian subcontinent.

Content
The dataset is divided into four primary components:

Destinations: Information about various travel destinations in India, including details like type of destination (beach, mountain, historical site, etc.), popularity, and best time to visit.

Users: Profiles of users including their preferences and demographic information. This dataset has been enriched with gender diversity and includes details on the number of adults and children for travel.

Reviews: User-generated reviews and ratings for the different destinations, offering insights into visitor experiences and satisfaction.

User History: Records of users' past travel experiences, including destinations visited and ratings provided.

Each of these components is presented in a separate CSV file, allowing for easy integration and manipulation in data processing and machine learning workflows.

Acknowledgements
This dataset was generated for educational and research purposes and is intended to be used in hackathons, academic projects, and by AI enthusiasts aiming to enhance the travel experience through technology.

Inspiration
The dataset is perfect for exploring a variety of questions and tasks, such as: - Building a recommendation engine to suggest travel destinations based on user preferences. - Analyzing travel trends in India. - Understanding the relationship between user demographics and travel preferences. - Sentiment analysis of travel destination reviews. - Forecasting the popularity of travel destinations based on historical data.

We encourage Kaggle users to explore this dataset to uncover unique insights and develop innovative solutions in the realm of travel technology. Whether you're a data scientist, a student, or a travel tech enthusiast, this dataset offers a wealth of opportunities for exploration and creativity.

Usage

This dataset is free to use for non-commercial purposes. For commercial use, please contact the dataset provider. Remember to cite the source when using this dataset in your projects.

License

CC0: Public Domain - The dataset is in the public domain and can be used without restrictions.
d
Data from: Potential structures - Applications of Machine Learning...
catalog.data.gov
data.openei.org
+2more
Updated Jan 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevada Bureau of Mines and Geology (2025). Potential structures - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/potential-structures-applications-of-machine-learning-techniques-to-geothermal-play-fairwa-98f4a
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
Nevada Bureau of Mines and Geology
Area covered
Great Basin, Nevada
Description
This submission contains shapefiles, geotiffs, and symbology for the revised-from-Play-Fairway potential structures/structural settings used in the Nevada Geothermal Machine Learning project. Layers include potential structural setting ellipses, centroids, and distance-to-centroid raster. A submission linking the full GitHub repository for our machine learning Jupyter Notebooks will appear in the related datasets section of this page once available.
D
AI Dataset Search Platform Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). AI Dataset Search Platform Market Research Report 2033 [Dataset]. https://dataintelo.com/report/ai-dataset-search-platform-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
AI Dataset Search Platform Market Outlook

According to our latest research, the global AI Dataset Search Platform market size reached USD 1.87 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 27.6% during the forecast period, reaching an estimated USD 16.17 billion by 2033. This remarkable growth is primarily attributed to the escalating demand for high-quality, diverse, and scalable datasets required to train advanced artificial intelligence and machine learning models across various industries. The proliferation of AI-driven applications and the increasing emphasis on data-centric AI development are key growth factors propelling the adoption of AI dataset search platforms globally.

The surge in AI adoption across sectors such as healthcare, BFSI, retail, automotive, and education is fueling the need for efficient and reliable dataset discovery solutions. Organizations are increasingly recognizing that the success of AI models hinges on the quality and relevance of the training data, leading to a surge in investments in dataset search platforms that offer advanced filtering, metadata tagging, and data governance capabilities. The integration of AI dataset search platforms with cloud infrastructures further streamlines data access, collaboration, and compliance, making them indispensable tools for enterprises aiming to accelerate AI innovation. The growing complexity of AI projects, coupled with the exponential growth in data volumes, is compelling organizations to seek platforms that can automate and optimize the process of dataset discovery and curation.

Another significant growth factor is the rapid evolution of AI regulations and data privacy frameworks worldwide. As data governance becomes a top priority, AI dataset search platforms are evolving to include robust features for data lineage tracking, access control, and compliance with regulations such as GDPR, HIPAA, and CCPA. The ability to ensure ethical sourcing and transparent usage of datasets is increasingly valued by enterprises and academic institutions alike. This regulatory landscape is driving the adoption of platforms that not only facilitate efficient dataset search but also enable organizations to demonstrate accountability and compliance in their AI initiatives.

The expanding ecosystem of AI developers, data scientists, and machine learning engineers is also contributing to the market's growth. The democratization of AI development, supported by open-source frameworks and cloud-based collaboration tools, has increased the demand for platforms that can aggregate, index, and provide easy access to diverse datasets. AI dataset search platforms are becoming central to fostering innovation, reducing development cycles, and enabling cross-domain research. As organizations strive to stay ahead in the competitive AI landscape, the ability to quickly identify and utilize optimal datasets is emerging as a critical differentiator.

From a regional perspective, North America currently dominates the AI dataset search platform market, accounting for over 38% of global revenue in 2024, driven by the strong presence of leading AI technology companies, active research communities, and significant investments in digital transformation. Europe and Asia Pacific are also witnessing rapid adoption, with Asia Pacific expected to exhibit the highest CAGR of 29.3% during the forecast period, fueled by government initiatives, burgeoning AI startups, and increasing digitalization across industries. Latin America and the Middle East & Africa are gradually embracing AI dataset search platforms, supported by growing awareness and investments in AI research and infrastructure.

Component Analysis

The AI Dataset Search Platform market is segmented by component into Software and Services. Software solutions constitute the backbone of this market, providing the core functionalities required for dataset discovery, indexing, metadata management, and integration with existing AI workflows. The software segment is witnessing robust growth as organizations seek advanced platforms capable of handling large-scale, multi-source datasets with sophisticated search capabilities powered by natural language processing and machine learning algorithms. These platforms are increasingly incorporating features such as semantic search, automated data labeling, and customizable data pipelines, enabling users to eff
Corpus Nummorum - Coin Image Dataset
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corpus_Nummorum; Corpus_Nummorum (2023). Corpus Nummorum - Coin Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10033993
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10033993
Dataset updated
Nov 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Corpus_Nummorum; Corpus_Nummorum
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Corpus Nummorum - Coin Image Dataset
This dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).
The dataset contains 115,160 images with about 29,000 unique coins. The images are split in three main folders with different assignment of the coins. Each main folder is sorted with the help fo subfolders which hold the coin images. The "dataset_coins" folder contains the coin photos divided into obverse and reverse and arranged by coin types. In the "dataset_types" folder the obverse and reverse image of the coins are concatenated and transformed to a quadratic format with black bars on the top and bottom. The images here are sorted by their coin type. The last folder "dataset_mints" contains the also concatenated images sorted by their mint. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.
The main purpose for this dataset in the CN project is the training of Machine Learning based Image Recognition models. We use three different Convolutional Neural Network based architectures: VGG16, VGG19 and ResNet50. Our best model (VGG16) archieves on this dataset a 79% Top-1 and a 97% Top-5 accuracy for the coin type recognition. The mint recognition achieves an 79% Top-1 and 94% Top-5 accuracy. We have a Colab notebook with two models (trained on the whole CN dataset) online.
During the summer semester 2023, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. We gave our students this dataset with the task to achieve better results than us. Here are their experiments:
Team 1: Voting and stacking of models
Team 2: Multimodal model
Team 3: Transformer models
Team 4: Dockerized TIMM Computer Vision Backend & FastAPI
Approach | Type Dataset | Mint Dataset
Ours 79% 79%
Team 1 - 86%
Team 2 86% -
Team 3 88% 58%
Team 4 - -

Now we would like to invite you to try out your own ideas and models on our coin data.
If you have any questions or suggestions, please, feel free to contact us.
t
Telco_Customer_churn_Data
test.researchdata.tuwien.at
bin, csv, png
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erum Naz; Erum Naz; Erum Naz; Erum Naz (2025). Telco_Customer_churn_Data [Dataset]. http://doi.org/10.82556/b0ch-cn44
Explore at:
png, csv, binAvailable download formats
Unique identifier
https://doi.org/10.82556/b0ch-cn44
Dataset updated
Apr 28, 2025
Dataset provided by
TU Wien
Authors
Erum Naz; Erum Naz; Erum Naz; Erum Naz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Context and Methodology

The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).

The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.

The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.

Technical Details

The dataset has a tabular structure and was initially stored in CSV format. It contains:

Rows: 7,043 customer records

Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).

Naming Convention:

The table in the database is named telco_customer_churn_data.

Software Requirements:

To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).

For machine learning applications, libraries such as pandas, scikit-learn, and joblib are typically used.

Additional Resources:

Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn

Further Details

When reusing the dataset, users should be aware:

Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).

Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
Defect Prediction Tool Validation Dataset 2
data.europa.eu
zenodo.org
unknown
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). Defect Prediction Tool Validation Dataset 2 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4299908?locale=mt
Explore at:
unknown(119740)Available download formats
Dataset updated
Sep 20, 2024
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is used to address the Research Questions in the study at Transactions on Software Engineering: Within-Project Defect Prediction of Infrastructure-as-Code using Product and Process Metrics. See also: https://github.com/stefanodallapalma/TSE-2020-05-0217. It provides * repositories.json - a list of repositories selected from open-source GitHub repositories based on the Ansible language. * fixing-commits.json - a list of defect-fixing commits extracted from those repositories. * fixed-files.json - a list of Ansible files fixed in those defect-fixing commits and respective bug-inducing commits. * failure-prone-files.json - a list of failure-prone files through the repository's commit history. * metrics.zip - csv files consisting of releases (set of files) and their IaC-oriented, delta and process metrics extracted from each analyzed repository * projects.zip - for each analyzed project, it contains the data (models, performance, and results of Recursive Feature Elimination) used to answer the Research Questions. Context Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages. This dataset targets the YAML-based Ansible language to devise within-project defects prediction approaches for IaC based on Machine-learning. Content The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria: * The repository has at least one push event to its master branch in the last six months; * The repository has at least 2 releases; * At least 10% of the files in the repository are IaC scripts; * The repository has at least 2 core contributors; * The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file; * The repository has a comments ratio of at least 0.1%; * The repository has commit frequency of at least 2 per month on average; * The repository has an issue frequency of at least 0.01 events per month on average; * The repository has evidence of a license, such as the presence of a LICENSE.md file * The repository has at least 100 source lines of code. Metrics are grouped into three categories: * IaC-Oriented: metrics of structural properties derived from the source code of infrastructure scripts. Click here for more info. * Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric. * Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found here. In addition to the metrics, the dataset contains the pre-trained models (*.joblib) in the folders rq1 and rq2 of projects.zip. You can load the model in Python as follows: from joblib import load model = load('projects/owner/repository/rq1/random_forest.joblib'), mmap_mode='r') best_estimator = model['estimator'] # The estimator that maximized the AUC-PR cv_results = model['cv_results'] # The results of each step of the validation procedure best_index = mode['best_index'] # The index to access the best cv_results Acknowledgements This work is supported by the European Commission grants no. 825040 (RADON H2020). Inspiration What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?
A
‘Gender Classification Dataset’ analyzed by Analyst-2
analyst-2.ai
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Gender Classification Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-gender-classification-dataset-84dd/6662a783/?iid=010-319&v=presentation
Explore at:
Dataset updated
Nov 12, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Gender Classification Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/elakiricoder/gender-classification-dataset on 12 November 2021.

--- Dataset description provided by original source is as follows ---

Context

While I was practicing machine learning, I wanted to create a simple dataset that is closely aligned to the real world scenario and gives better results to whet my appetite on this domain. If you are a beginner who wants to try solving classification problems in machine learning and if you prefer achieving better results, try using this dataset in your projects which will be a great place to start.

Content

This dataset contains 7 features and a label column.

long_hair - This column contains 0's and 1's where 1 is "long hair" and 0 is "not long hair". forehead_width_cm - This column is in CM's. This is the width of the forehead. forehead_height_cm - This is the height of the forehead and it's in Cm's. nose_wide - This column contains 0's and 1's where 1 is "wide nose" and 0 is "not wide nose". nose_long - This column contains 0's and 1's where 1 is "Long nose" and 0 is "not long nose". lips_thin - This column contains 0's and 1's where 1 represents the "thin lips" while 0 is "Not thin lips". distance_nose_to_lip_long - This column contains 0's and 1's where 1 represents the "long distance between nose and lips" while 0 is "short distance between nose and lips".

gender - This is either "Male" or "Female".

Acknowledgements

Nothing to acknowledge as this is just a made up data.

Inspiration

It's painful to see bad results at the beginning. Don't begin with complicated datasets if you are a beginner. I'm sure that this dataset will encourage you to proceed further in the domain. Good luck.

--- Original source retains full ownership of the source dataset ---
d
Data from: Machine Learning Model Geotiffs - Applications of Machine...
catalog.data.gov
data.openei.org
+2more
Updated Jan 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevada Bureau of Mines and Geology (2025). Machine Learning Model Geotiffs - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-geotiffs-applications-of-machine-learning-techniques-to-geothermal--a3551
Explore at:
Dataset updated
Jan 20, 2025
Dataset provided by
Nevada Bureau of Mines and Geology
Area covered
Great Basin, Nevada
Description
This submission contains geotiffs, supporting shapefiles and readmes for the inputs and output models of algorithms explored in the Nevada Geothermal Machine Learning project, meant to accompany the final report. Layers include: Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk), input rasters of feature sets, and positive/negative training sites. See readme .txt files and final report for additional metadata. A submission linking the full codebase for generating machine learning output models is available under "related resources" on this page.
m
A Multi-Parameter Dataset for Machine Learning Based Fruit Spoilage...
data.mendeley.com
Updated Sep 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nshekanabo Marius (2025). A Multi-Parameter Dataset for Machine Learning Based Fruit Spoilage Prediction in an IoT-Enabled Cold Storage System [Dataset]. http://doi.org/10.17632/czz68d9fwj.1
Explore at:
Unique identifier
https://doi.org/10.17632/czz68d9fwj.1
Dataset updated
Sep 29, 2025
Authors
Nshekanabo Marius
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was compiled as part of a project to design a cold storage system to combat post-harvest food loss in developing regions by integrating IoT technology with predictive machine learning. The project, which was developed for use by smallholder farmers in Uganda, aims to monitor and proactively control the environmental conditions in cold storage units to extend the shelf life of perishable goods.

This dataset is specifically structured for machine learning applications, serving as the training and validation data for machine learning models. It contains environmental data points collected in a controlled cold storage environment. The data is organized into a comma-separated value (CSV) file with a total of 10996 entries in the following six columns:

Fruit: A categorical variable indicating the type of fruit being stored (e.g., Orange, Pineapple, Banana, Tomato).

Temp: The temperature inside the cold storage unit, measured in degrees Celsius (°C).

Humid: The relative humidity (RH) of the environment, measured as a percentage (%).

Light: The intensity of light exposure, measured in Lux.

CO2: The concentration of carbon dioxide (CO₂) in the air, measured in parts per million (ppm).

Class: A binary classification label (Good or Bad) that serves as the target variable for the predictive model, indicating whether the environmental conditions are optimal or suboptimal for spoilage prevention.

The data's primary purpose is to provide a basis for training predictive models to classify environmental conditions and assess spoilage risk.

The dataset is a valuable resource for researchers and practitioners in fields such as smart agriculture, food science, embedded systems, and machine learning. It can be used to:

Train, validate, and test new predictive models for food spoilage.

Analyze the correlation between specific environmental factors (temperature, humidity, CO2, and light) and fruit spoilage outcomes.

Support the development of low-cost, intelligent monitoring systems for cold chain logistics and food preservation.

This dataset and the associated project are intended to contribute to achieving the United Nations Sustainable Development Goals (SDGs), particularly those related to food security and sustainable agriculture.
R
Cdd Dataset
universe.roboflow.com
zip
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
Explore at:
zipAvailable download formats
Dataset updated
Sep 5, 2023
Dataset authored and provided by
hakuna matata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Cumcumber Diease Detection Bounding Boxes
Description
Project Documentation: Cucumber Disease Detection

Title and Introduction Title: Cucumber Disease Detection

Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

Methodology Machine Learning Algorithms:

Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

Model Evaluation Evaluation Metrics:

Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

Rafiur Rahman Rafit EWU 2018-3-60-111

Facebook

Twitter

Click to copy link

Link copied

Cite

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.21967265.v1

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Clear search

Close search

Google apps

Main menu

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

AI/ML Youtube Videos

DynaBench: A benchmark dataset for learning dynamical systems from...

Data from: GIS Resource Compilation Map Package - Applications of Machine...

IMDB top 250 tv shows dataset

Top 250 TV Shows

Dataset Title:

Description:

Columns:

Key Features:

Potential Uses:

Acknowledgments:

License:

Steam Video Game and Bundle Data

Building Performance Dataset

UoS Buildings Image Dataset for Computer Vision Algorithms

Google Restaurants dataset

Travel Recommendation Dataset

Title: India Travel Recommender System Dataset

Description

Usage

License

Data from: Potential structures - Applications of Machine Learning...

AI Dataset Search Platform Market Research Report 2033

AI Dataset Search Platform Market Outlook

Component Analysis

Corpus Nummorum - Coin Image Dataset

Telco_Customer_churn_Data

Context and Methodology

Technical Details

Further Details

Defect Prediction Tool Validation Dataset 2

‘Gender Classification Dataset’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Data from: Machine Learning Model Geotiffs - Applications of Machine...

A Multi-Parameter Dataset for Machine Learning Based Fruit Spoilage...

Cdd Dataset

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python