78 datasets found
  1. Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

    GitHub page: https://github.com/soarsmu/NICHE

  2. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-products/filemarket-ai-training-data-large-language-model-llm-data-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Antigua and Barbuda, Saint Kitts and Nevis, French Southern Territories, Colombia, Benin, China, Brazil, Central African Republic, Papua New Guinea, Saudi Arabia
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  3. AI/ML Youtube Videos

    • kaggle.com
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asmaa Hadir (2023). AI/ML Youtube Videos [Dataset]. https://www.kaggle.com/datasets/asmaahadir/aiml-youtube-channels-content-2018-2019
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Asmaa Hadir
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    YouTube
    Description

    I created this dataset as part of a data analysis project and concluded that it might be relevant for others who are interested in examining in analyzing content on YouTube. This dataset is a collection of over 6000 videos having the columns:

    • Channel: video's channel
    • Title: video title
    • PublishedDate: date the video was uploaded
    • Likes: likes count for the video
    • Views: views count for the video
    • Comments: comments count for the video

      Through the YouTube API and using Python, I collect data about some of these popular channels' videos that provide educational content about Machine Learning and Data Science in order to extract insights about which topics had been popular within the last couple of years. Featured in the dataset are the following creators:

    • Krish Naik

    • Nicholas Renotte

    • Sentdex

    • DeepLearningAI

    • Artificial Intelligence — All in One

    • Siraj Raval

    • Jeremy Howard

    • Applied AI Course

    • Daniel Bourke

    • Jeff Heaton

    • DeepLearning.TV

    • Arxiv Insights

    These channels are features in multiple top AI channels to subscribe to lists and have seen a big growth in the last couple of years on YouTube. They all have a creation date since or before 2018.

  4. DynaBench: A benchmark dataset for learning dynamical systems from...

    • zenodo.org
    • data.niaid.nih.gov
    tar
    Updated Oct 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Dulny; Andrzej Dulny; Andreas Hotho; Andreas Hotho; Anna Krause; Anna Krause (2023). DynaBench: A benchmark dataset for learning dynamical systems from low-resolution data (minimal) [Dataset]. http://doi.org/10.1007/978-3-031-43412-9_26
    Explore at:
    tarAvailable download formats
    Dataset updated
    Oct 31, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Dulny; Andrzej Dulny; Andreas Hotho; Andreas Hotho; Anna Krause; Anna Krause
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is a minimal version of the DynaBench dataset, containing the first 5% of the data. The full dataset is available at https://professor-x.de/dynabench

    Abstract:

    Previous work on learning physical systems from data has focused on high-resolution grid-structured measurements. However, real-world knowledge of such systems (e.g. weather data) relies on sparsely scattered measuring stations. In this paper, we introduce a novel simulated benchmark dataset, DynaBench, for learning dynamical systems directly from sparsely scattered data without prior knowledge of the equations. The dataset focuses on predicting the evolution of a dynamical system from low-resolution, unstructured measurements. We simulate six different partial differential equations covering a variety of physical systems commonly used in the literature and evaluate several machine learning models, including traditional graph neural networks and point cloud processing models, with the task of predicting the evolution of the system. The proposed benchmark dataset is expected to advance the state of art as an out-of-the-box easy-to-use tool for evaluating models in a setting where only unstructured low-resolution observations are available. The benchmark is available at https://professor-x.de/dynabench.

    Technical Info

    The dataset is split into 42 parts (6 equations x 7 combinations of resolution/structure). Each part can be downloaded separately and contains 7000 simulations of the given equation at the given resolution and structure. The simulations are grouped into chunks of 500 simulations saved in the hdf5 file format. Each chunk contains the variable "data", where the values of the simulated system are stored, as well as the variable "points", where the coordinates at which the system has been observed are stored. For more details visit the DynaBench website at https://professor-x.de/dynabench/. The dataset is best used as part of the dynabench python package available at https://pypi.org/project/dynabench/.

  5. d

    Data from: GIS Resource Compilation Map Package - Applications of Machine...

    • catalog.data.gov
    • data.openei.org
    • +3more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevada Bureau of Mines and Geology (2025). GIS Resource Compilation Map Package - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/gis-resource-compilation-map-package-applications-of-machine-learning-techniques-to-geothe-8f3ee
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Nevada Bureau of Mines and Geology
    Area covered
    Great Basin, Nevada
    Description

    This submission contains an ESRI map package (.mpk) with an embedded geodatabase for GIS resources used or derived in the Nevada Machine Learning project, meant to accompany the final report. The package includes layer descriptions, layer grouping, and symbology. Layer groups include: new/revised datasets (paleo-geothermal features, geochemistry, geophysics, heat flow, slip and dilation, potential structures, geothermal power plants, positive and negative test sites), machine learning model input grids, machine learning models (Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk) - supervised and unsupervised), original NV Play Fairway data and models, and NV cultural/reference data. See layer descriptions for additional metadata. Smaller GIS resource packages (by category) can be found in the related datasets section of this submission. A submission linking the full codebase for generating machine learning output models is available through the "Related Datasets" link on this page, and contains results beyond the top picks present in this compilation.

  6. IMDB top 250 tv shows dataset

    • kaggle.com
    Updated Aug 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sharukh Khan (2024). IMDB top 250 tv shows dataset [Dataset]. https://www.kaggle.com/datasets/sharukhkhan0101/imdb-top-250-tv-shows-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sharukh Khan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Top 250 TV Shows

    Dataset Title:

    Top 250 TV Shows: Ratings, Episodes, and Certifications

    Description:

    This dataset features the top 250 TV shows, providing comprehensive information on their titles, certifications, number of episodes, start years, IMDb ratings, and rating counts. This dataset is perfect for enthusiasts, researchers, and data scientists who are interested in analyzing trends and patterns in popular television shows.

    Columns:

    • Title: The title of the TV show.
    • Certificate: The certification rating (e.g., TV-MA, TV-G).
    • Number of Episodes: The total number of episodes in the TV show.
    • Started: The year the TV show first aired.
    • Rating: The IMDb rating of the TV show.
    • Rating Count: The number of user ratings the TV show has received on IMDb.

    Key Features:

    • Diverse Range of TV Shows: Includes a variety of TV shows from different genres and time periods.
    • Detailed Ratings Information: Contains both IMDb ratings and rating counts, offering insights into the popularity and reception of each TV show.
    • Certification Information: Provides details on the content appropriateness for different audiences.

    Potential Uses:

    • Data Analysis and Visualization: Create visualizations to explore trends in TV show ratings, episode counts, and certifications over the years.
    • Machine Learning Projects: Use the dataset to build predictive models for TV show ratings or success.
    • Comparative Studies: Analyze the relationship between the number of episodes, certification, and ratings.

    Acknowledgments:

    This dataset was compiled to support data science projects and analyses related to television shows. The data is sourced from IMDb to ensure accuracy and comprehensiveness.

    License:

    MIT License

  7. u

    Steam Video Game and Bundle Data

    • cseweb.ucsd.edu
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Steam Video Game and Bundle Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain reviews from the Steam video game platform, and information about which games were bundled together.

    Metadata includes

    • reviews

    • purchases, plays, recommends (likes)

    • product bundles

    • pricing information

    Basic Statistics:

    • Reviews: 7,793,069

    • Users: 2,567,538

    • Items: 15,474

    • Bundles: 615

  8. Building Performance Dataset

    • kaggle.com
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2024). Building Performance Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/construction-project-performance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset simulates various aspects of construction project monitoring over time, designed for time-series analysis, and optimization studies. It contains 50,000 records representing data points collected at 1-minute intervals. The dataset includes diverse features related to project management, environmental conditions, resource utilization, safety, and performance evaluation.

    Features: timestamp: The recorded time of the observation. temperature: Ambient temperature at the construction site (°C). humidity: Relative humidity at the construction site (%). vibration_level: Measured vibration levels of machinery or equipment (Hz). material_usage: Quantity of materials utilized during the period (kg). machinery_status: Binary status indicating machinery activity (1 = Active, 0 = Idle). worker_count: Number of workers on-site during the period. energy_consumption: Energy consumption recorded for machinery and operations (kWh). task_progress: Cumulative percentage progress of tasks (%). cost_deviation: Financial deviation from the planned budget (USD). time_deviation: Schedule deviation from planned timelines (days). safety_incidents: Number of safety-related incidents reported. equipment_utilization_rate: Utilization rate of machinery and equipment (%). material_shortage_alert: Binary alert for material shortage (1 = Alert, 0 = No Alert). risk_score: Computed risk score for the project (%). simulation_deviation: Percentage deviation in simulation vs. actual outcomes (%). update_frequency: Suggested interval for project status updates (minutes). optimization_suggestion: Suggested optimization actions for the project. performance_score: Categorical performance evaluation of the project based on several metrics (Poor, Average, Good, Excellent).

  9. f

    UoS Buildings Image Dataset for Computer Vision Algorithms

    • salford.figshare.com
    application/x-gzip
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Alameer; Mazin Al-Mosawy (2025). UoS Buildings Image Dataset for Computer Vision Algorithms [Dataset]. http://doi.org/10.17866/rd.salford.20383155.v2
    Explore at:
    application/x-gzipAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset provided by
    University of Salford
    Authors
    Ali Alameer; Mazin Al-Mosawy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset for this project is represented by photos, photos for the buildings of the University of Salford, these photos are taken by a mobile phone camera from different angels and different distances , even though this task sounds so easy but it encountered some challenges, these challenges are summarized below: 1. Obstacles. a. Fixed or unremovable objects. When taking several photos for a building or a landscape from different angels and directions ,there are some of these angels blocked by a form of a fixed object such as trees and plants, light poles, signs, statues, cabins, bicycle shades, scooter stands, generators/transformers, construction barriers, construction equipment and any other service equipment so it is unavoidable to represent some photos without these objects included, this will raise 3 questions. - will these objects confuse the model/application we intend to create meaning will that obstacle prevent the model/application from identifying the designated building? - Or will the photos be more precise with these objects and provide the capability for the model/application to identify these building with these obstacles included? - How far is the maximum length for detection? In other words, how far will the mobile device with the application be from the building so it could or could not detect the designated building? b. Removable and moving objects. - Any University is crowded with staff and students especially in the rush hours of the day so it is hard for some photos to be taken without a personnel appearing in that photo in a certain time period of the day. But, due to privacy issues and showing respect to that person, these photos are better excluded. - Parked vehicles, trollies and service equipment can be an obstacle and might appear in these images as well as it can block access to some areas which an image from a certain angel cannot be obtained. - Animals, like dogs, cats, birds or even squirrels cannot be avoided in some photos which are entitled to the same questions above.
    2. Weather. In a deep learning project, more data means more accuracy and less error, at this stage of our project it was agreed to have 50 photos per building but we can increase the number of photos for more accurate results but due to the limitation of time for this project it was agreed for 50 per building only. these photos were taken on cloudy days and to expand our work on this project (as future works and recommendations). Photos on sunny, rainy, foggy, snowy and any other weather condition days can be included. Even photos in different times of the day can be included such as night, dawn, and sunset times. To provide our designated model with all the possibilities to identify these buildings in all available circumstances.

    1. The selected buildings. It was agreed to select 10 buildings only from the University of Salford buildings for this project with at least 50 images per building, these selected building for this project with the number of images taken are:
    2. Chapman: 74 images
    3. Clifford Whitworth Library: 60 images
    4. Cockcroft: 67 images
    5. Maxwell: 80 images
    6. Media City Campus: 92 images
    7. New Adelphi: 93 images
    8. New Science, Engineering & Environment: 78 images
    9. Newton: 92 images
    10. Sports Centre: 55 images
    11. University House: 60 images Peel building is an important figure of the University of Salford due to its distinct and amazing exterior design but unfortunately it was excluded from the selection due to some maintenance activities at the time of collecting the photos for this project as it is partially covered with scaffolding and a lot of movement by personnel and equipment. If the supervisor suggests that this will be another challenge to include in the project then, it is mandatory to collect its photos. There are many other buildings in the University of Salford and again to expand our project in the future, we can include all the buildings of the University of Salford. The full list of buildings of the university can be reviewed by accessing an interactive map on: www.salford.ac.uk/find-us

    12. Expand Further. This project can be improved furthermore with so many capabilities, again due to the limitation of time given to this project , these improvements can be implemented later as future works. In simple words, this project is to create an application that can display the building’s name when pointing a mobile device with a camera to that building. Future featured to be added: a. Address/ location: this will require collection of additional data which is the longitude and latitude of each building included or the post code which will be the same taking under consideration how close these buildings appear on the interactive map application such as Google maps, Google earth or iMaps. b. Description of the building: what is the building for, by which school is this building occupied? and what facilities are included in this building? c. Interior Images: all the photos at this stage were taken for the exterior of the buildings, will interior photos make an impact on the model/application for example, if the user is inside newton or chapman and opens the application, will the building be identified especially the interior of these buildings have a high level of similarity for the corridors, rooms, halls, and labs? Will the furniture and assets will be as obstacles or identification marks? d. Directions to a specific area/floor inside the building: if the interior images succeed with the model/application, it would be a good idea adding a search option to the model/application so it can guide the user to a specific area showing directions to that area, for example if the user is inside newton building and searches for lab 141 it will direct him to the first floor of the building with an interactive arrow that changes while the user is approaching his destination. Or, if the application can identify the building from its interior, a drop down list will be activated with each floor of this building, for example, if the model/application identifies Newton building, the drop down list will be activated and when pressing on that drop down list it will represent interactive tabs for each floor of the building, selecting one of the floors by clicking on its tab will display the facilities on that floor for example if the user presses on floor 1 tab, another screen will appear displaying which facilities are on that floor. Furthermore, if the model/application identifies another building, it should activate a different number of floors as buildings differ in the number of floors from each other. this feature can be improved with a voice assistant that can direct the user after he applies his search (something similar to the voice assistant in Google maps but applied to the interior of the university’s buildings. e. Top View: if a drone with a camera can be afforded, it can provide arial images and top views for the buildings that can be added to the model/application but these images can be similar to the interior images situation , the buildings can be similar to each other from the top with other obstacles included like water tanks and AC units.

    13. Other Questions:

    14. Will the model/application be reproducible? the presumed answer for this question should be YES, IF, the model/application will be fed with the proper data (images) such as images of restaurants, schools, supermarkets, hospitals, government facilities...etc.

  10. u

    Google Restaurants dataset

    • cseweb.ucsd.edu
    csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Google Restaurants dataset [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    This is a mutli-modal dataset for restaurants from Google Local (Google Maps). Data includes images and reviews posted by users, as well as metadata for each restaurant.

  11. Travel Recommendation Dataset

    • kaggle.com
    Updated Jan 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Mehra (2024). Travel Recommendation Dataset [Dataset]. https://www.kaggle.com/datasets/amanmehra23/travel-recommendation-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 23, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Mehra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Title: India Travel Recommender System Dataset

    Description

    Context
    Travel is a diverse and vibrant industry, and India, with its rich cultural heritage and varied landscapes, offers a myriad of experiences for travelers. The India Travel Recommender System Dataset is designed to facilitate the development of personalized travel recommendation systems. This dataset provides an extensive compilation of travel destinations across India, along with user profiles, reviews, and historical travel data. It's an invaluable resource for anyone looking to create AI-powered travel applications focused on the Indian subcontinent.

    Content
    The dataset is divided into four primary components:

    1. Destinations: Information about various travel destinations in India, including details like type of destination (beach, mountain, historical site, etc.), popularity, and best time to visit.

    2. Users: Profiles of users including their preferences and demographic information. This dataset has been enriched with gender diversity and includes details on the number of adults and children for travel.

    3. Reviews: User-generated reviews and ratings for the different destinations, offering insights into visitor experiences and satisfaction.

    4. User History: Records of users' past travel experiences, including destinations visited and ratings provided.

    Each of these components is presented in a separate CSV file, allowing for easy integration and manipulation in data processing and machine learning workflows.

    Acknowledgements
    This dataset was generated for educational and research purposes and is intended to be used in hackathons, academic projects, and by AI enthusiasts aiming to enhance the travel experience through technology.

    Inspiration
    The dataset is perfect for exploring a variety of questions and tasks, such as: - Building a recommendation engine to suggest travel destinations based on user preferences. - Analyzing travel trends in India. - Understanding the relationship between user demographics and travel preferences. - Sentiment analysis of travel destination reviews. - Forecasting the popularity of travel destinations based on historical data.

    We encourage Kaggle users to explore this dataset to uncover unique insights and develop innovative solutions in the realm of travel technology. Whether you're a data scientist, a student, or a travel tech enthusiast, this dataset offers a wealth of opportunities for exploration and creativity.

    Usage

    This dataset is free to use for non-commercial purposes. For commercial use, please contact the dataset provider. Remember to cite the source when using this dataset in your projects.

    License

    CC0: Public Domain - The dataset is in the public domain and can be used without restrictions.

  12. d

    Data from: Potential structures - Applications of Machine Learning...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevada Bureau of Mines and Geology (2025). Potential structures - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/potential-structures-applications-of-machine-learning-techniques-to-geothermal-play-fairwa-98f4a
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Nevada Bureau of Mines and Geology
    Area covered
    Great Basin, Nevada
    Description

    This submission contains shapefiles, geotiffs, and symbology for the revised-from-Play-Fairway potential structures/structural settings used in the Nevada Geothermal Machine Learning project. Layers include potential structural setting ellipses, centroids, and distance-to-centroid raster. A submission linking the full GitHub repository for our machine learning Jupyter Notebooks will appear in the related datasets section of this page once available.

  13. D

    AI Dataset Search Platform Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). AI Dataset Search Platform Market Research Report 2033 [Dataset]. https://dataintelo.com/report/ai-dataset-search-platform-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Dataset Search Platform Market Outlook



    According to our latest research, the global AI Dataset Search Platform market size reached USD 1.87 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 27.6% during the forecast period, reaching an estimated USD 16.17 billion by 2033. This remarkable growth is primarily attributed to the escalating demand for high-quality, diverse, and scalable datasets required to train advanced artificial intelligence and machine learning models across various industries. The proliferation of AI-driven applications and the increasing emphasis on data-centric AI development are key growth factors propelling the adoption of AI dataset search platforms globally.



    The surge in AI adoption across sectors such as healthcare, BFSI, retail, automotive, and education is fueling the need for efficient and reliable dataset discovery solutions. Organizations are increasingly recognizing that the success of AI models hinges on the quality and relevance of the training data, leading to a surge in investments in dataset search platforms that offer advanced filtering, metadata tagging, and data governance capabilities. The integration of AI dataset search platforms with cloud infrastructures further streamlines data access, collaboration, and compliance, making them indispensable tools for enterprises aiming to accelerate AI innovation. The growing complexity of AI projects, coupled with the exponential growth in data volumes, is compelling organizations to seek platforms that can automate and optimize the process of dataset discovery and curation.



    Another significant growth factor is the rapid evolution of AI regulations and data privacy frameworks worldwide. As data governance becomes a top priority, AI dataset search platforms are evolving to include robust features for data lineage tracking, access control, and compliance with regulations such as GDPR, HIPAA, and CCPA. The ability to ensure ethical sourcing and transparent usage of datasets is increasingly valued by enterprises and academic institutions alike. This regulatory landscape is driving the adoption of platforms that not only facilitate efficient dataset search but also enable organizations to demonstrate accountability and compliance in their AI initiatives.



    The expanding ecosystem of AI developers, data scientists, and machine learning engineers is also contributing to the market's growth. The democratization of AI development, supported by open-source frameworks and cloud-based collaboration tools, has increased the demand for platforms that can aggregate, index, and provide easy access to diverse datasets. AI dataset search platforms are becoming central to fostering innovation, reducing development cycles, and enabling cross-domain research. As organizations strive to stay ahead in the competitive AI landscape, the ability to quickly identify and utilize optimal datasets is emerging as a critical differentiator.



    From a regional perspective, North America currently dominates the AI dataset search platform market, accounting for over 38% of global revenue in 2024, driven by the strong presence of leading AI technology companies, active research communities, and significant investments in digital transformation. Europe and Asia Pacific are also witnessing rapid adoption, with Asia Pacific expected to exhibit the highest CAGR of 29.3% during the forecast period, fueled by government initiatives, burgeoning AI startups, and increasing digitalization across industries. Latin America and the Middle East & Africa are gradually embracing AI dataset search platforms, supported by growing awareness and investments in AI research and infrastructure.



    Component Analysis



    The AI Dataset Search Platform market is segmented by component into Software and Services. Software solutions constitute the backbone of this market, providing the core functionalities required for dataset discovery, indexing, metadata management, and integration with existing AI workflows. The software segment is witnessing robust growth as organizations seek advanced platforms capable of handling large-scale, multi-source datasets with sophisticated search capabilities powered by natural language processing and machine learning algorithms. These platforms are increasingly incorporating features such as semantic search, automated data labeling, and customizable data pipelines, enabling users to eff

  14. Corpus Nummorum - Coin Image Dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Nov 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corpus_Nummorum; Corpus_Nummorum (2023). Corpus Nummorum - Coin Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10033993
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Corpus_Nummorum; Corpus_Nummorum
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Corpus Nummorum - Coin Image Dataset

    This dataset is a collection of ancient coin images from three different sources: the Corpus Nummorum (CN) project, the Münzkabinett Berlin and the Bibliothèque nationale de France, Département des Monnaies, médailles et antiques. It covers Greek and Roman coins from ancient Thrace, Moesia Inferior, Troad and Mysia. This is a selection of the coins published on the CN portal (due to copyrights).

    The dataset contains 115,160 images with about 29,000 unique coins. The images are split in three main folders with different assignment of the coins. Each main folder is sorted with the help fo subfolders which hold the coin images. The "dataset_coins" folder contains the coin photos divided into obverse and reverse and arranged by coin types. In the "dataset_types" folder the obverse and reverse image of the coins are concatenated and transformed to a quadratic format with black bars on the top and bottom. The images here are sorted by their coin type. The last folder "dataset_mints" contains the also concatenated images sorted by their mint. An "sources" csv file holds the sources for every image. Due to copyrights the image size is limited to 299*299 pixels. However, this should be sufficient for most ML approaches.

    The main purpose for this dataset in the CN project is the training of Machine Learning based Image Recognition models. We use three different Convolutional Neural Network based architectures: VGG16, VGG19 and ResNet50. Our best model (VGG16) archieves on this dataset a 79% Top-1 and a 97% Top-5 accuracy for the coin type recognition. The mint recognition achieves an 79% Top-1 and 94% Top-5 accuracy. We have a Colab notebook with two models (trained on the whole CN dataset) online.

    During the summer semester 2023, we held the "Data Challenge" event at our Department of Computer Science at the Goethe-University. We gave our students this dataset with the task to achieve better results than us. Here are their experiments:

    Team 1: Voting and stacking of models

    Team 2: Multimodal model

    Team 3: Transformer models

    Team 4: Dockerized TIMM Computer Vision Backend & FastAPI

    • Approach | Type Dataset | Mint Dataset
    • Ours 79% 79%
    • Team 1 - 86%
    • Team 2 86% -
    • Team 3 88% 58%
    • Team 4 - -

    Now we would like to invite you to try out your own ideas and models on our coin data.

    If you have any questions or suggestions, please, feel free to contact us.

  15. t

    Telco_Customer_churn_Data

    • test.researchdata.tuwien.at
    bin, csv, png
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erum Naz; Erum Naz; Erum Naz; Erum Naz (2025). Telco_Customer_churn_Data [Dataset]. http://doi.org/10.82556/b0ch-cn44
    Explore at:
    png, csv, binAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    TU Wien
    Authors
    Erum Naz; Erum Naz; Erum Naz; Erum Naz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Context and Methodology

    The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).

    The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
    The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.

    The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.

    Technical Details

    The dataset has a tabular structure and was initially stored in CSV format. It contains:

    • Rows: 7,043 customer records

    • Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).

    Naming Convention:

    • The table in the database is named telco_customer_churn_data.

    Software Requirements:

    • To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).

    • For machine learning applications, libraries such as pandas, scikit-learn, and joblib are typically used.

    Additional Resources:

    Further Details

    When reusing the dataset, users should be aware:

    • Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

    • Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).

    • Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.

  16. Defect Prediction Tool Validation Dataset 2

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). Defect Prediction Tool Validation Dataset 2 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4299908?locale=mt
    Explore at:
    unknown(119740)Available download formats
    Dataset updated
    Sep 20, 2024
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is used to address the Research Questions in the study at Transactions on Software Engineering: Within-Project Defect Prediction of Infrastructure-as-Code using Product and Process Metrics. See also: https://github.com/stefanodallapalma/TSE-2020-05-0217. It provides * repositories.json - a list of repositories selected from open-source GitHub repositories based on the Ansible language. * fixing-commits.json - a list of defect-fixing commits extracted from those repositories. * fixed-files.json - a list of Ansible files fixed in those defect-fixing commits and respective bug-inducing commits. * failure-prone-files.json - a list of failure-prone files through the repository's commit history. * metrics.zip - csv files consisting of releases (set of files) and their IaC-oriented, delta and process metrics extracted from each analyzed repository * projects.zip - for each analyzed project, it contains the data (models, performance, and results of Recursive Feature Elimination) used to answer the Research Questions. Context Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages. This dataset targets the YAML-based Ansible language to devise within-project defects prediction approaches for IaC based on Machine-learning. Content The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria: * The repository has at least one push event to its master branch in the last six months; * The repository has at least 2 releases; * At least 10% of the files in the repository are IaC scripts; * The repository has at least 2 core contributors; * The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file; * The repository has a comments ratio of at least 0.1%; * The repository has commit frequency of at least 2 per month on average; * The repository has an issue frequency of at least 0.01 events per month on average; * The repository has evidence of a license, such as the presence of a LICENSE.md file * The repository has at least 100 source lines of code. Metrics are grouped into three categories: * IaC-Oriented: metrics of structural properties derived from the source code of infrastructure scripts. Click here for more info. * Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric. * Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found here. In addition to the metrics, the dataset contains the pre-trained models (*.joblib) in the folders rq1 and rq2 of projects.zip. You can load the model in Python as follows: from joblib import load model = load('projects/owner/repository/rq1/random_forest.joblib'), mmap_mode='r') best_estimator = model['estimator'] # The estimator that maximized the AUC-PR cv_results = model['cv_results'] # The results of each step of the validation procedure best_index = mode['best_index'] # The index to access the best cv_results Acknowledgements This work is supported by the European Commission grants no. 825040 (RADON H2020). Inspiration What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?

  17. A

    ‘Gender Classification Dataset’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Gender Classification Dataset’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-gender-classification-dataset-84dd/6662a783/?iid=010-319&v=presentation
    Explore at:
    Dataset updated
    Nov 12, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Gender Classification Dataset’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/elakiricoder/gender-classification-dataset on 12 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    While I was practicing machine learning, I wanted to create a simple dataset that is closely aligned to the real world scenario and gives better results to whet my appetite on this domain. If you are a beginner who wants to try solving classification problems in machine learning and if you prefer achieving better results, try using this dataset in your projects which will be a great place to start.

    Content

    This dataset contains 7 features and a label column.

    long_hair - This column contains 0's and 1's where 1 is "long hair" and 0 is "not long hair". forehead_width_cm - This column is in CM's. This is the width of the forehead. forehead_height_cm - This is the height of the forehead and it's in Cm's. nose_wide - This column contains 0's and 1's where 1 is "wide nose" and 0 is "not wide nose". nose_long - This column contains 0's and 1's where 1 is "Long nose" and 0 is "not long nose". lips_thin - This column contains 0's and 1's where 1 represents the "thin lips" while 0 is "Not thin lips". distance_nose_to_lip_long - This column contains 0's and 1's where 1 represents the "long distance between nose and lips" while 0 is "short distance between nose and lips".

    gender - This is either "Male" or "Female".

    Acknowledgements

    Nothing to acknowledge as this is just a made up data.

    Inspiration

    It's painful to see bad results at the beginning. Don't begin with complicated datasets if you are a beginner. I'm sure that this dataset will encourage you to proceed further in the domain. Good luck.

    --- Original source retains full ownership of the source dataset ---

  18. d

    Data from: Machine Learning Model Geotiffs - Applications of Machine...

    • catalog.data.gov
    • data.openei.org
    • +2more
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevada Bureau of Mines and Geology (2025). Machine Learning Model Geotiffs - Applications of Machine Learning Techniques to Geothermal Play Fairway Analysis in the Great Basin Region, Nevada [Dataset]. https://catalog.data.gov/dataset/machine-learning-model-geotiffs-applications-of-machine-learning-techniques-to-geothermal--a3551
    Explore at:
    Dataset updated
    Jan 20, 2025
    Dataset provided by
    Nevada Bureau of Mines and Geology
    Area covered
    Great Basin, Nevada
    Description

    This submission contains geotiffs, supporting shapefiles and readmes for the inputs and output models of algorithms explored in the Nevada Geothermal Machine Learning project, meant to accompany the final report. Layers include: Artificial Neural Network (ANN), Extreme Learning Machine (ELM), Bayesian Neural Network (BNN), Principal Component Analysis (PCA/PCAk), Non-negative Matrix Factorization (NMF/NMFk), input rasters of feature sets, and positive/negative training sites. See readme .txt files and final report for additional metadata. A submission linking the full codebase for generating machine learning output models is available under "related resources" on this page.

  19. m

    A Multi-Parameter Dataset for Machine Learning Based Fruit Spoilage...

    • data.mendeley.com
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nshekanabo Marius (2025). A Multi-Parameter Dataset for Machine Learning Based Fruit Spoilage Prediction in an IoT-Enabled Cold Storage System [Dataset]. http://doi.org/10.17632/czz68d9fwj.1
    Explore at:
    Dataset updated
    Sep 29, 2025
    Authors
    Nshekanabo Marius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was compiled as part of a project to design a cold storage system to combat post-harvest food loss in developing regions by integrating IoT technology with predictive machine learning. The project, which was developed for use by smallholder farmers in Uganda, aims to monitor and proactively control the environmental conditions in cold storage units to extend the shelf life of perishable goods.

    This dataset is specifically structured for machine learning applications, serving as the training and validation data for machine learning models. It contains environmental data points collected in a controlled cold storage environment. The data is organized into a comma-separated value (CSV) file with a total of 10996 entries in the following six columns:

    Fruit: A categorical variable indicating the type of fruit being stored (e.g., Orange, Pineapple, Banana, Tomato).

    Temp: The temperature inside the cold storage unit, measured in degrees Celsius (°C).

    Humid: The relative humidity (RH) of the environment, measured as a percentage (%).

    Light: The intensity of light exposure, measured in Lux.

    CO2: The concentration of carbon dioxide (CO₂) in the air, measured in parts per million (ppm).

    Class: A binary classification label (Good or Bad) that serves as the target variable for the predictive model, indicating whether the environmental conditions are optimal or suboptimal for spoilage prevention.

    The data's primary purpose is to provide a basis for training predictive models to classify environmental conditions and assess spoilage risk.

    The dataset is a valuable resource for researchers and practitioners in fields such as smart agriculture, food science, embedded systems, and machine learning. It can be used to:

    Train, validate, and test new predictive models for food spoilage.

    Analyze the correlation between specific environmental factors (temperature, humidity, CO2, and light) and fruit spoilage outcomes.

    Support the development of low-cost, intelligent monitoring systems for cold chain logistics and food preservation.

    This dataset and the associated project are intended to contribute to achieving the United Nations Sustainable Development Goals (SDGs), particularly those related to food security and sustainable agriculture.

  20. R

    Cdd Dataset

    • universe.roboflow.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    hakuna matata (2023). Cdd Dataset [Dataset]. https://universe.roboflow.com/hakuna-matata/cdd-g8a6g/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2023
    Dataset authored and provided by
    hakuna matata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Cumcumber Diease Detection Bounding Boxes
    Description

    Project Documentation: Cucumber Disease Detection

    1. Title and Introduction Title: Cucumber Disease Detection

    Introduction: A machine learning model for the automatic detection of diseases in cucumber plants is to be developed as part of the "Cucumber Disease Detection" project. This research is crucial because it tackles the issue of early disease identification in agriculture, which can increase crop yield and cut down on financial losses. To train and test the model, we use a dataset of pictures of cucumber plants.

    1. Problem Statement Problem Definition: The research uses image analysis methods to address the issue of automating the identification of diseases, including Downy Mildew, in cucumber plants. Effective disease management in agriculture depends on early illness identification.

    Importance: Early disease diagnosis helps minimize crop losses, stop the spread of diseases, and better allocate resources in farming. Agriculture is a real-world application of this concept.

    Goals and Objectives: Develop a machine learning model to classify cucumber plant images into healthy and diseased categories. Achieve a high level of accuracy in disease detection. Provide a tool for farmers to detect diseases early and take appropriate action.

    1. Data Collection and Preprocessing Data Sources: The dataset comprises of pictures of cucumber plants from various sources, including both healthy and damaged specimens.

    Data Collection: Using cameras and smartphones, images from agricultural areas were gathered.

    Data Preprocessing: Data cleaning to remove irrelevant or corrupted images. Handling missing values, if any, in the dataset. Removing outliers that may negatively impact model training. Data augmentation techniques applied to increase dataset diversity.

    1. Exploratory Data Analysis (EDA) The dataset was examined using visuals like scatter plots and histograms. The data was examined for patterns, trends, and correlations. Understanding the distribution of photos of healthy and ill plants was made easier by EDA.

    2. Methodology Machine Learning Algorithms:

    Convolutional Neural Networks (CNNs) were chosen for image classification due to their effectiveness in handling image data. Transfer learning using pre-trained models such as ResNet or MobileNet may be considered. Train-Test Split:

    The dataset was split into training and testing sets with a suitable ratio. Cross-validation may be used to assess model performance robustly.

    1. Model Development The CNN model's architecture consists of layers, units, and activation operations. On the basis of experimentation, hyperparameters including learning rate, batch size, and optimizer were chosen. To avoid overfitting, regularization methods like dropout and L2 regularization were used.

    2. Model Training During training, the model was fed the prepared dataset across a number of epochs. The loss function was minimized using an optimization method. To ensure convergence, early halting and model checkpoints were used.

    3. Model Evaluation Evaluation Metrics:

    Accuracy, precision, recall, F1-score, and confusion matrix were used to assess model performance. Results were computed for both training and test datasets. Performance Discussion:

    The model's performance was analyzed in the context of disease detection in cucumber plants. Strengths and weaknesses of the model were identified.

    1. Results and Discussion Key project findings include model performance and disease detection precision. a comparison of the many models employed, showing the benefits and drawbacks of each. challenges that were faced throughout the project and the methods used to solve them.

    2. Conclusion recap of the project's key learnings. the project's importance to early disease detection in agriculture should be highlighted. Future enhancements and potential research directions are suggested.

    3. References Library: Pillow,Roboflow,YELO,Sklearn,matplotlib Datasets:https://data.mendeley.com/datasets/y6d3z6f8z9/1

    4. Code Repository https://universe.roboflow.com/hakuna-matata/cdd-g8a6g

    Rafiur Rahman Rafit EWU 2018-3-60-111

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO (2023). NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python [Dataset]. http://doi.org/10.6084/m9.figshare.21967265.v1
Organization logo

Data from: NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Related Article
Explore at:
txtAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ratnadira Widyasari; Zhou YANG; Ferdian Thung; Sheng Qin Sim; Fiona Wee; Camellia Lok; Jack Phan; Haodi Qi; Constance Tan; Qijin Tay; David LO
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on evidences of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. In this repository we provide "NICHE.csv" file that contains the list of the project names along with their labels, descriptive information for every dimension, and several basic statistics, such as the number of stars and commits. This dataset can help researchers understand the practices that are followed in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.

GitHub page: https://github.com/soarsmu/NICHE

Search
Clear search
Close search
Google apps
Main menu