100+ datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. issues-kaggle-notebooks

    • huggingface.co
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face Smol Models Research
    Description

    GitHub Issues & Kaggle Notebooks

      Description
    

    GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.

  3. Github Indian users deep data

    • kaggle.com
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archit Tyagi (2024). Github Indian users deep data [Dataset]. https://www.kaggle.com/datasets/architty108/github-indian-users-deep-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Archit Tyagi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a rich snapshot of GitHub users from India, capturing various aspects of their public profiles. It's a valuable resource for analyzing trends in coding activity, repository management, and user engagement within the Indian developer community. Whether you're interested in exploring how developers grow their followers, examining language preferences, or identifying patterns in contributions and achievements, this dataset offers multiple points of analysis.

    Key Features: - Username: GitHub usernames of the individuals. - Gender Pronoun: Preferred gender pronouns (if available). - Followings: Number of people each user follows. - Joining Year: The year they joined GitHub. - Contributions: Number of contributions made in the last year. - Achievements: Number of GitHub achievements unlocked by the user. - Stars: Total number of stars on their repositories. - Repositories: Number of repositories created. - Followers: Number of followers each user has. - Location: User location details, primarily from India. - Languages: Primary programming language used by the individual. - Social Links: Links to their other social platforms (LinkedIn, personal websites, etc.). - Sorting Type: Categorized based on followers, repositories, or recent joining.

    This dataset can be used for: - Profiling the Indian developer community. - Tracking open-source contributions and achievements. - Analyzing programming language preferences and repository management. - Exploring the relationship between social followings and coding contributions.

    Perfect for data science, social network analysis, and open-source research.

  4. Fruit and Vegetable Disease (Healthy vs Rotten)

    • kaggle.com
    Updated May 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Subhan (2024). Fruit and Vegetable Disease (Healthy vs Rotten) [Dataset]. http://doi.org/10.34740/kaggle/dsv/8463025
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Muhammad Subhan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview The Fruit and Vegetable Diseases Dataset directory contains a comprehensive collection of images categorized to assist in the development of machine learning models for detecting diseases in various fruits and vegetables. This dataset is ideal for tasks such as image classification and deep learning.

    About Dataset Directories The Fruit and Vegetable Diseases Dataset consists of 28 directories, each representing a combination of healthy and rotten images for 14 different types of fruits and vegetables. This dataset is intended for training and evaluating machine learning models for disease detection. Images are categorized and organized to facilitate easy access and utilization for image classification and deep learning tasks.

    Data Collection This dataset has been compiled from various reputable sources, including Kaggle and GitHub repositories. Each image has been manually inspected and categorized to ensure high quality and relevance for the task of disease detection in fruits and vegetables.

    Usage The dataset is highly versatile and can be used for a variety of machine learning tasks, including: - Image Classification - Transfer Learning - Deep Learning It is particularly useful for training models that can identify and distinguish between healthy and rotten produce, making it ideal for applications in agricultural technology and food quality control.

    Metadata - Number of Classes: 28 (Healthy and Rotten categories for 14 different fruits and vegetables) - Image Format: JPEG/PNG - Image Size: Varies (recommended to resize to a standard size for consistent training)

    Recommended Practices To maximize the utility of this dataset, consider the following: - Preprocessing: Resize images to a uniform size suitable for your model. - Augmentation: Apply data augmentation techniques to enhance model robustness. - Validation: Use a separate validation set to tune and evaluate your models effectively.

    This dataset is an excellent resource for developing advanced machine learning models for detecting diseases in fruits and vegetables, contributing to improved food safety and quality assurance practices.

  5. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  6. A

    ‘Machine Learning users on Github’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Machine Learning users on Github’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-machine-learning-users-on-github-c26c/b6d9c9f7/?iid=003-635&v=presentation
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Machine Learning users on Github’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/prosperchuks/machine-learning-users-on-github on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Data was scraped from Github's API.

    Columns

    LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

    This dataset contains over 600 users from Lagos, Nigeria and Rwanda

    Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data

    --- Original source retains full ownership of the source dataset ---

  7. t

    Programming Language Ecosystem Project TU Wien

    • test.researchdata.tuwien.ac.at
    csv, text/markdown
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer (2024). Programming Language Ecosystem Project TU Wien [Dataset]. http://doi.org/10.70124/gnbse-ts649
    Explore at:
    text/markdown, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Time period covered
    Dec 12, 2023
    Area covered
    Vienna
    Description

    About Dataset

    This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.

    The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.

    About Data collection methodology

    The dataset was created using the github repository above. As input data, three public datasets where used.

    github_metadata

    Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.

    PYPL_survey_2004-2023

    Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.

    stack_overflow_developer_survey

    Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.

    All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above

    Description of the data

    The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.

    The languages that are going to be considered for the project can be seen here:

    - Python

    - C

    - C++

    - Java

    - C#

    - JavaScript

    - PHP

    - SQL

    - Assembly

    - Scratch

    - Fortran

    - Go

    - Kotlin

    - Delphi

    - Swift

    - Rust

    - Ruby

    - R

    - COBOL

    - F#

    - Perl

    - TypeScript

    - Haskell

    - Scala

    License

    This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.

    TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.

    Acknowledgments

    Thanks go out to

    - stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.

    - the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.

    - Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.

  8. Github Repo Languages DataSet

    • kaggle.com
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saujan Tiwari (2023). Github Repo Languages DataSet [Dataset]. https://www.kaggle.com/datasets/saujantiwari/repo-language-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saujan Tiwari
    Description

    Dataset

    This dataset was created by Saujan Tiwari

    Contents

  9. Google Landmarks Dataset v2

    • github.com
    • opendatalab.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
    Explore at:
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.

  10. F

    Data from: A Neural Approach for Text Extraction from Scholarly Figures

    • data.uni-hannover.de
    zip
    Updated Jan 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
    Explore at:
    zip(798357692)Available download formats
    Dataset updated
    Jan 20, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    A Neural Approach for Text Extraction from Scholarly Figures

    This is the readme for the supplemental data for our ICDAR 2019 paper.

    You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

    If you found this dataset useful, please consider citing our paper:

    @inproceedings{DBLP:conf/icdar/MorrisTE19,
     author  = {David Morris and
            Peichen Tang and
            Ralph Ewerth},
     title   = {A Neural Approach for Text Extraction from Scholarly Figures},
     booktitle = {2019 International Conference on Document Analysis and Recognition,
            {ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
     pages   = {1438--1443},
     publisher = {{IEEE}},
     year   = {2019},
     url    = {https://doi.org/10.1109/ICDAR.2019.00231},
     doi    = {10.1109/ICDAR.2019.00231},
     timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
     biburl  = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
     bibsource = {dblp computer science bibliography, https://dblp.org}
    }
    

    This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

    Datasets

    We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

    Testing

    These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

    Validation

    The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

    Training

    We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

    Code

    We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

    Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

    We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

    We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

    Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.

  11. github-trending

    • kaggle.com
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahesh Dhingra (2025). github-trending [Dataset]. https://www.kaggle.com/datasets/maheshdhingra/github-trending/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mahesh Dhingra
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Mahesh Dhingra

    Released under MIT

    Contents

  12. A

    ‘us-statewise-covid data’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘us-statewise-covid data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-statewise-covid-data-f6e4/95e686b2/?iid=004-991&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘us-statewise-covid data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/dhamur/usstatewisecovid-data on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    About the data

    This is a covid19 data set from United States. It includes date, Number of cases, Number of deaths. The other countries data are also available in my Kaggle and Github profile. The links are provided below - Github - Kaggle If you want to read more about the data Click here

    --- Original source retains full ownership of the source dataset ---

  13. C

    Community-Driven Model Service Platform Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73124
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing availability of open-source machine learning models and datasets, facilitated by platforms like Kaggle, GitHub, and Hugging Face, significantly lowers the barrier to entry for developers and researchers. Furthermore, the growing demand for specialized, niche models that cater to specific industry needs is propelling market growth. The rise of cloud-based solutions offers scalability and cost-effectiveness, attracting a wider user base. The market segmentation reveals strong growth in both adult and children's applications, reflecting the broadening use cases for these platforms across various age demographics. Cloud-based platforms dominate the market share due to their flexibility and accessibility, though on-premise solutions retain a significant presence in enterprises prioritizing data security and control. Geographic distribution showcases a strong presence in North America and Europe, driven by established tech ecosystems and high adoption rates of AI technologies. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like China and India, due to increasing digitalization and investment in AI infrastructure. The market's restraints are primarily focused on the need for robust data governance and ethical considerations surrounding AI model development and deployment. Concerns about data bias, model explainability, and potential misuse of AI models need to be addressed to ensure responsible growth. Furthermore, the skill gap in AI development and deployment presents a challenge, requiring substantial investment in education and training initiatives to support the expanding market. Competition among platform providers is intense, necessitating continuous innovation and the development of unique value propositions to maintain market share. Despite these challenges, the long-term outlook for the Community-Driven Model Service Platform market remains exceptionally positive, driven by sustained technological advancements and the increasing reliance on AI across diverse industries. The platform's potential to foster collaboration and democratize access to powerful machine learning tools is poised to drive considerable market expansion in the coming years.

  14. R

    Cat Dog Spider Pumpkin Hooman Dataset

    • universe.roboflow.com
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Guhl (2023). Cat Dog Spider Pumpkin Hooman Dataset [Dataset]. https://universe.roboflow.com/peter-guhl-de1vy/cat-dog-spider-pumpkin-hooman
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2023
    Dataset authored and provided by
    Peter Guhl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Pumpkins Bounding Boxes
    Description

    Started out as a pumpkin detector to test training YOLOv5. Now suffering from extensive feature creep and probably ending up as a cat/dog/spider/pumpkin/randomobjects-detector. Or as a desaster.

    The dataset does not fit https://docs.ultralytics.com/tutorials/training-tips-best-results/ well. There are no background images and the labeling is often only partial. Especially in the humans and pumpkin category where there are often lots of objects in one photo people apparently (and understandably) got bored and did not labe everything. And of course the images from the cat-category don't have the humans in it labeled since they come from a cat-identification model which ignored humans. It will need a lot of time to fixt that.

    Dataset used: - Cat and Dog Data: Cat / Dog Tutorial NVIDIA Jetson https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md © 2016-2019 NVIDIA according to bottom of linked page - Spider Data: Kaggle Animal 10 image set https://www.kaggle.com/datasets/alessiocorrado99/animals10 Animal pictures of 10 different categories taken from google images Kaggle project licensed GPL 2 - Pumpkin Data: Kaggle "Vegetable Images" https://www.researchgate.net/publication/352846889_DCNN-Based_Vegetable_Image_Classification_Using_Transfer_Learning_A_Comparative_Study https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset Kaggle project licensed CC BY-SA 4.0 - Some pumpkin images manually copied from google image search - https://universe.roboflow.com/chess-project/chess-sample-rzbmc Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/steve-pamer-cvmbg/pumpkins-gfjw5 Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/nbduy/pumpkin-ryavl Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/homeworktest-wbx8v/cat_test-1x0bl/dataset/2 - https://universe.roboflow.com/220616nishikura/catdetector - https://universe.roboflow.com/atoany/cats-s4d4i/dataset/2 - https://universe.roboflow.com/personal-vruc2/agricultured-ioth22 - https://universe.roboflow.com/sreyoshiworkspace-radu9/pet_detection - https://universe.roboflow.com/artyom-hystt/my-dogs-lcpqe - license: Public Domain url: https://universe.roboflow.com/dolazy7-gmail-com-3vj05/sweetpumpkin/dataset/2 - https://universe.roboflow.com/tristram-dacayan/social-distancing-g4pbu - https://universe.roboflow.com/fyp-3edkl/social-distancing-2ygx5 License MIT - Spiders: https://universe.roboflow.com/lucas-lins-souza/animals-train-yruka

    Currently I can't guarantee it's all correctly licenced. Checks are in progress. Inform me if you see one of your pictures and want it to be removed!

  15. Weather prediction dataset

    • zenodo.org
    • kaggle.com
    csv, png, txt
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Huber; Florian Huber (2024). Weather prediction dataset [Dataset]. http://doi.org/10.5281/zenodo.4770937
    Explore at:
    csv, png, txtAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florian Huber; Florian Huber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset created for machine learning and deep learning training and teaching purposes.
    Can for instance be used for classification, regression, and forecasting tasks.
    Complex enough to demonstrate realistic issues such as overfitting and unbalanced data, while still remaining intuitively accessible.

    ORIGINAL DATA TAKEN FROM:

    EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 22-04-2021
    THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:

    Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
    air temperature and precipitation series for the European Climate Assessment.
    Int. J. of Climatol., 22, 1441-1453.
    Data and metadata available at http://www.ecad.eu

    For more information see metadata.txt file.

    The Python code used to create the weather prediction dataset from the ECA&D data can be found on GitHub: https://github.com/florian-huber/weather_prediction_dataset
    (this repository also contains Jupyter notebooks with teaching examples)

  16. j

    Sample dataset

    • demo.jkan.io
    • gsa.github.io
    • +7more
    api, csv, shp
    Updated Mar 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Sample dataset [Dataset]. https://demo.jkan.io/datasets/sample-dataset/
    Explore at:
    api, shp, csvAvailable download formats
    Dataset updated
    Mar 31, 2016
    Description

    This is an example dataset that comes with a new installation of JKAN

  17. Testing github actions for uploading datasets

    • kaggle.com
    Updated Feb 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayush Thakur (2022). Testing github actions for uploading datasets [Dataset]. https://www.kaggle.com/datasets/ayuraj/test-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayush Thakur
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Ayush Thakur

    Released under CC0: Public Domain

    Contents

  18. h

    hinglish_top

    • huggingface.co
    • kaggle.com
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Held (2023). hinglish_top [Dataset]. https://huggingface.co/datasets/WillHeld/hinglish_top
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2023
    Authors
    Will Held
    Description
  19. R

    Electric Pylon Detection In Rsi Dataset

    • universe.roboflow.com
    zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Public (2022). Electric Pylon Detection In Rsi Dataset [Dataset]. https://universe.roboflow.com/robin-public/electric-pylon-detection-in-rsi/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset authored and provided by
    Robin Public
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Electric Pylons Bounding Boxes
    Description

    From the Authors:

    EPD dataset contains 1500 images in total: 720 images were captured by Pleiades satellite along Huimao Line in Guangdong Province, China, a main line of power network in south China, while the remaining images were collected from Google Earth to further improve the representativeness of the dataset by expanding the source of samples. The spatial resolution of images in EPD dataset is 1 m/pixel.

    Moreover, to test and evaluate the adaptability of the detectors in face of actual situations, we specially selected 50 relatively complex images from EPD dataset comprising a complex test subset called EPD-C ...

    1450 images in EPD dataset excluding EPD-C as a standard subset named EPD-S, which involves more than 3000 electric pylons. EPD-S subset was used to train detectors and perform random experiments.

    Dataset Source:

    This dataset is obtained from the listing in Robin Cole's satellite-image-deep-learning GitHub repository * https://www.satellite-image-deep-learning.com/ * https://twitter.com/robmarkcole

    Electric-Pylon-Detection-in-RSI -> a dataset which contains 1500 remote sensing images of electric pylons used to train ten deep learning models * GitHub Repository * Kaggle EPD Dataset: https://www.kaggle.com/datasets/qiaosijia/epd-dataset * Link to the research paper: Deep Learning Based Electric Pylon Detection in Remote Sensing Images

  20. g

    Ten Thousand German News Articles Dataset

    • tblock.github.io
    • kaggle.com
    csv
    Updated Mar 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    T. Block (2019). Ten Thousand German News Articles Dataset [Dataset]. https://tblock.github.io/10kGNAD/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Mar 5, 2019
    Authors
    T. Block
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu