100+ datasets found

GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
issues-kaggle-notebooks
huggingface.co
Updated Aug 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2025). issues-kaggle-notebooks [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks
Explore at:
Dataset updated
Aug 12, 2025
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
Description
GitHub Issues & Kaggle Notebooks

Description

GitHub Issues & Kaggle Notebooks is a collection of two code datasets intended for language models training, they are sourced from GitHub issues and notebooks in Kaggle platform. These datasets are a modified part of the StarCoder2 model training corpus, precisely the bigcode/StarCoder2-Extras dataset. We reformat the samples to remove StarCoder2's special tokens and use natural text to delimit comments in issues and display… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/issues-kaggle-notebooks.
Github Indian users deep data
kaggle.com
Updated Oct 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archit Tyagi (2024). Github Indian users deep data [Dataset]. https://www.kaggle.com/datasets/architty108/github-indian-users-deep-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Archit Tyagi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides a rich snapshot of GitHub users from India, capturing various aspects of their public profiles. It's a valuable resource for analyzing trends in coding activity, repository management, and user engagement within the Indian developer community. Whether you're interested in exploring how developers grow their followers, examining language preferences, or identifying patterns in contributions and achievements, this dataset offers multiple points of analysis.

Key Features: - Username: GitHub usernames of the individuals. - Gender Pronoun: Preferred gender pronouns (if available). - Followings: Number of people each user follows. - Joining Year: The year they joined GitHub. - Contributions: Number of contributions made in the last year. - Achievements: Number of GitHub achievements unlocked by the user. - Stars: Total number of stars on their repositories. - Repositories: Number of repositories created. - Followers: Number of followers each user has. - Location: User location details, primarily from India. - Languages: Primary programming language used by the individual. - Social Links: Links to their other social platforms (LinkedIn, personal websites, etc.). - Sorting Type: Categorized based on followers, repositories, or recent joining.

This dataset can be used for: - Profiling the Indian developer community. - Tracking open-source contributions and achievements. - Analyzing programming language preferences and repository management. - Exploring the relationship between social followings and coding contributions.

Perfect for data science, social network analysis, and open-source research.
Fruit and Vegetable Disease (Healthy vs Rotten)
kaggle.com
Updated May 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Subhan (2024). Fruit and Vegetable Disease (Healthy vs Rotten) [Dataset]. http://doi.org/10.34740/kaggle/dsv/8463025
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8463025
Dataset updated
May 20, 2024
Dataset provided by
Kaggle
Authors
Muhammad Subhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview The Fruit and Vegetable Diseases Dataset directory contains a comprehensive collection of images categorized to assist in the development of machine learning models for detecting diseases in various fruits and vegetables. This dataset is ideal for tasks such as image classification and deep learning.

About Dataset Directories The Fruit and Vegetable Diseases Dataset consists of 28 directories, each representing a combination of healthy and rotten images for 14 different types of fruits and vegetables. This dataset is intended for training and evaluating machine learning models for disease detection. Images are categorized and organized to facilitate easy access and utilization for image classification and deep learning tasks.

Data Collection This dataset has been compiled from various reputable sources, including Kaggle and GitHub repositories. Each image has been manually inspected and categorized to ensure high quality and relevance for the task of disease detection in fruits and vegetables.

Usage The dataset is highly versatile and can be used for a variety of machine learning tasks, including: - Image Classification - Transfer Learning - Deep Learning It is particularly useful for training models that can identify and distinguish between healthy and rotten produce, making it ideal for applications in agricultural technology and food quality control.

Metadata - Number of Classes: 28 (Healthy and Rotten categories for 14 different fruits and vegetables) - Image Format: JPEG/PNG - Image Size: Varies (recommended to resize to a standard size for consistent training)

Recommended Practices To maximize the utility of this dataset, consider the following: - Preprocessing: Resize images to a uniform size suitable for your model. - Augmentation: Apply data augmentation techniques to enhance model robustness. - Validation: Use a separate validation set to tune and evaluate your models effectively.

This dataset is an excellent resource for developing advanced machine learning models for detecting diseases in fruits and vegetables, contributing to improved food safety and quality assurance practices.
R
Custom Yolov7 On Kaggle On Custom Dataset
universe.roboflow.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Jan 29, 2023
Dataset authored and provided by
Owais Ahmad
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Person Car Bounding Boxes
Description
Custom Training with YOLOv7 🔥

Some Important links

Model Inference🤖

🚀Training Yolov7 on Kaggle

Weight and Biases 🐝

HuggingFace 🤗 Model Repo

Contact Information

Name - Owais Ahmad

Phone - +91-9515884381

Email - owaiskhan9654@gmail.com

Portfolio - https://owaiskhan9654.github.io/

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

Link to the Downloadable Dataset

from IPython.display import Markdown, display display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))

Custom Training with YOLOv7 🔥

In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

Export the dataset to YOLOv7

Train YOLOv7 to recognize the objects in our dataset

Evaluate our YOLOv7 model's performance

Run test inference to view performance of YOLOv7 model at work

📦 YOLOv7

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

Image Credit - jinfagang

Step 1: Install Requirements

!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements %cd yolov7 !pip install -qr requirements.txt !pip install -q roboflow

Downloading YOLOV7 starting checkpoint

!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

import os import glob import wandb import torch from roboflow import Roboflow from kaggle_secrets import UserSecretsClient from IPython.display import Image, clear_output, display # to display images print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")

https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

YOLOv7-Car-Person-Custom

try: user_secrets = UserSecretsClient() wandb_api_key = user_secrets.get_secret("wandb_api") wandb.login(key=wandb_api_key) anonymous = None except: wandb.login(anonymous='must') print('To use your W&B account, Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. Get your W&B access token from here: https://wandb.ai/authorize') wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")

Step 2: Assemble Our Dataset

https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

In Roboflow, We can choose between two paths:

Convert an existing Coco dataset to YOLOv7 format. In Roboflow it supports over 30 formats object detection formats for conversion.

Uploading only these raw images and annotate them in Roboflow with Roboflow Annotate.

Version v2 Aug 12, 2022 Looks like this.

https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

user_secrets = UserSecretsClient() roboflow_api_key = user_secrets.get_secret("roboflow_api")

rf = Roboflow(api_key=roboflow_api_key) project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq") dataset = project.version(2).download("yolov7")

Step 3: Training Custom pretrained YOLOv7 model

Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
A
‘Machine Learning users on Github’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Machine Learning users on Github’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-machine-learning-users-on-github-c26c/b6d9c9f7/?iid=003-635&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Machine Learning users on Github’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/prosperchuks/machine-learning-users-on-github on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Data was scraped from Github's API.

Columns

LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

This dataset contains over 600 users from Lagos, Nigeria and Rwanda

Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data

--- Original source retains full ownership of the source dataset ---
t
Programming Language Ecosystem Project TU Wien
test.researchdata.tuwien.ac.at
csv, text/markdown
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer (2024). Programming Language Ecosystem Project TU Wien [Dataset]. http://doi.org/10.70124/gnbse-ts649
Explore at:
text/markdown, csvAvailable download formats
Unique identifier
https://doi.org/10.70124/gnbse-ts649
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Time period covered
Dec 12, 2023
Area covered
Vienna
Description
About Dataset
This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.
The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.

About Data collection methodology
The dataset was created using the github repository above. As input data, three public datasets where used.
github_metadata
Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.
PYPL_survey_2004-2023
Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.
stack_overflow_developer_survey
Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.
All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above

Description of the data
The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.

The languages that are going to be considered for the project can be seen here:
- Python
- C
- C++
- Java
- C#
- JavaScript
- PHP
- SQL
- Assembly
- Scratch
- Fortran
- Go
- Kotlin
- Delphi
- Swift
- Rust
- Ruby
- R
- COBOL
- F#
- Perl
- TypeScript
- Haskell
- Scala

License
This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.
TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.

Acknowledgments
Thanks go out to
- stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.
- the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.
- Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.
Github Repo Languages DataSet
kaggle.com
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saujan Tiwari (2023). Github Repo Languages DataSet [Dataset]. https://www.kaggle.com/datasets/saujantiwari/repo-language-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Saujan Tiwari
Description
Dataset

This dataset was created by Saujan Tiwari

Contents
Google Landmarks Dataset v2
github.com
opendatalab.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
F
Data from: A Neural Approach for Text Extraction from Scholarly Figures
data.uni-hannover.de
zip
Updated Jan 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). A Neural Approach for Text Extraction from Scholarly Figures [Dataset]. https://data.uni-hannover.de/dataset/a-neural-approach-for-text-extraction-from-scholarly-figures
Explore at:
zip(798357692)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
A Neural Approach for Text Extraction from Scholarly Figures

This is the readme for the supplemental data for our ICDAR 2019 paper.

You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202

If you found this dataset useful, please consider citing our paper:

@inproceedings{DBLP:conf/icdar/MorrisTE19, author = {David Morris and Peichen Tang and Ralph Ewerth}, title = {A Neural Approach for Text Extraction from Scholarly Figures}, booktitle = {2019 International Conference on Document Analysis and Recognition, {ICDAR} 2019, Sydney, Australia, September 20-25, 2019}, pages = {1438--1443}, publisher = {{IEEE}}, year = {2019}, url = {https://doi.org/10.1109/ICDAR.2019.00231}, doi = {10.1109/ICDAR.2019.00231}, timestamp = {Tue, 04 Feb 2020 13:28:39 +0100}, biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).

Datasets

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

Testing

These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2

Validation

The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.

Training

We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.

Code

We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.

Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.

We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.

We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.

Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
github-trending
kaggle.com
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahesh Dhingra (2025). github-trending [Dataset]. https://www.kaggle.com/datasets/maheshdhingra/github-trending/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mahesh Dhingra
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Mahesh Dhingra

Released under MIT

Contents
A
‘us-statewise-covid data’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘us-statewise-covid data’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-us-statewise-covid-data-f6e4/95e686b2/?iid=004-991&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘us-statewise-covid data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/dhamur/usstatewisecovid-data on 28 January 2022.

--- Dataset description provided by original source is as follows ---

About the data

This is a covid19 data set from United States. It includes date, Number of cases, Number of deaths. The other countries data are also available in my Kaggle and Github profile. The links are provided below - Github - Kaggle If you want to read more about the data Click here

--- Original source retains full ownership of the source dataset ---
C
Community-Driven Model Service Platform Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Community-Driven Model Service Platform Report [Dataset]. https://www.marketreportanalytics.com/reports/community-driven-model-service-platform-73124
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Apr 9, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Community-Driven Model Service Platform market is experiencing robust growth, projected to reach $35.14 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 10.1% from 2025 to 2033. This expansion is fueled by several key drivers. The increasing availability of open-source machine learning models and datasets, facilitated by platforms like Kaggle, GitHub, and Hugging Face, significantly lowers the barrier to entry for developers and researchers. Furthermore, the growing demand for specialized, niche models that cater to specific industry needs is propelling market growth. The rise of cloud-based solutions offers scalability and cost-effectiveness, attracting a wider user base. The market segmentation reveals strong growth in both adult and children's applications, reflecting the broadening use cases for these platforms across various age demographics. Cloud-based platforms dominate the market share due to their flexibility and accessibility, though on-premise solutions retain a significant presence in enterprises prioritizing data security and control. Geographic distribution showcases a strong presence in North America and Europe, driven by established tech ecosystems and high adoption rates of AI technologies. However, rapid growth is anticipated in the Asia-Pacific region, particularly in countries like China and India, due to increasing digitalization and investment in AI infrastructure. The market's restraints are primarily focused on the need for robust data governance and ethical considerations surrounding AI model development and deployment. Concerns about data bias, model explainability, and potential misuse of AI models need to be addressed to ensure responsible growth. Furthermore, the skill gap in AI development and deployment presents a challenge, requiring substantial investment in education and training initiatives to support the expanding market. Competition among platform providers is intense, necessitating continuous innovation and the development of unique value propositions to maintain market share. Despite these challenges, the long-term outlook for the Community-Driven Model Service Platform market remains exceptionally positive, driven by sustained technological advancements and the increasing reliance on AI across diverse industries. The platform's potential to foster collaboration and democratize access to powerful machine learning tools is poised to drive considerable market expansion in the coming years.
R
Cat Dog Spider Pumpkin Hooman Dataset
universe.roboflow.com
zip
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Guhl (2023). Cat Dog Spider Pumpkin Hooman Dataset [Dataset]. https://universe.roboflow.com/peter-guhl-de1vy/cat-dog-spider-pumpkin-hooman
Explore at:
zipAvailable download formats
Dataset updated
Jan 13, 2023
Dataset authored and provided by
Peter Guhl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Pumpkins Bounding Boxes
Description
Started out as a pumpkin detector to test training YOLOv5. Now suffering from extensive feature creep and probably ending up as a cat/dog/spider/pumpkin/randomobjects-detector. Or as a desaster.

The dataset does not fit https://docs.ultralytics.com/tutorials/training-tips-best-results/ well. There are no background images and the labeling is often only partial. Especially in the humans and pumpkin category where there are often lots of objects in one photo people apparently (and understandably) got bored and did not labe everything. And of course the images from the cat-category don't have the humans in it labeled since they come from a cat-identification model which ignored humans. It will need a lot of time to fixt that.

Dataset used: - Cat and Dog Data: Cat / Dog Tutorial NVIDIA Jetson https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md © 2016-2019 NVIDIA according to bottom of linked page - Spider Data: Kaggle Animal 10 image set https://www.kaggle.com/datasets/alessiocorrado99/animals10 Animal pictures of 10 different categories taken from google images Kaggle project licensed GPL 2 - Pumpkin Data: Kaggle "Vegetable Images" https://www.researchgate.net/publication/352846889_DCNN-Based_Vegetable_Image_Classification_Using_Transfer_Learning_A_Comparative_Study https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset Kaggle project licensed CC BY-SA 4.0 - Some pumpkin images manually copied from google image search - https://universe.roboflow.com/chess-project/chess-sample-rzbmc Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/steve-pamer-cvmbg/pumpkins-gfjw5 Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/nbduy/pumpkin-ryavl Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/homeworktest-wbx8v/cat_test-1x0bl/dataset/2 - https://universe.roboflow.com/220616nishikura/catdetector - https://universe.roboflow.com/atoany/cats-s4d4i/dataset/2 - https://universe.roboflow.com/personal-vruc2/agricultured-ioth22 - https://universe.roboflow.com/sreyoshiworkspace-radu9/pet_detection - https://universe.roboflow.com/artyom-hystt/my-dogs-lcpqe - license: Public Domain url: https://universe.roboflow.com/dolazy7-gmail-com-3vj05/sweetpumpkin/dataset/2 - https://universe.roboflow.com/tristram-dacayan/social-distancing-g4pbu - https://universe.roboflow.com/fyp-3edkl/social-distancing-2ygx5 License MIT - Spiders: https://universe.roboflow.com/lucas-lins-souza/animals-train-yruka

Currently I can't guarantee it's all correctly licenced. Checks are in progress. Inform me if you see one of your pictures and want it to be removed!
Weather prediction dataset
zenodo.org
kaggle.com
csv, png, txt
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Huber; Florian Huber (2024). Weather prediction dataset [Dataset]. http://doi.org/10.5281/zenodo.4770937
Explore at:
csv, png, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4770937
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Florian Huber; Florian Huber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset created for machine learning and deep learning training and teaching purposes.
Can for instance be used for classification, regression, and forecasting tasks.
Complex enough to demonstrate realistic issues such as overfitting and unbalanced data, while still remaining intuitively accessible.

ORIGINAL DATA TAKEN FROM:

EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 22-04-2021
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu

For more information see metadata.txt file.

The Python code used to create the weather prediction dataset from the ECA&D data can be found on GitHub: https://github.com/florian-huber/weather_prediction_dataset
(this repository also contains Jupyter notebooks with teaching examples)
j
Sample dataset
demo.jkan.io
gsa.github.io
+7more
api, csv, shp
Updated Mar 31, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Sample dataset [Dataset]. https://demo.jkan.io/datasets/sample-dataset/
Explore at:
api, shp, csvAvailable download formats
Dataset updated
Mar 31, 2016
Description
This is an example dataset that comes with a new installation of JKAN
Testing github actions for uploading datasets
kaggle.com
Updated Feb 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Thakur (2022). Testing github actions for uploading datasets [Dataset]. https://www.kaggle.com/datasets/ayuraj/test-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ayush Thakur
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Ayush Thakur

Released under CC0: Public Domain

Contents
h
hinglish_top
huggingface.co
kaggle.com
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Will Held (2023). hinglish_top [Dataset]. https://huggingface.co/datasets/WillHeld/hinglish_top
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2023
Authors
Will Held
Description
Dataset Card for "hinglish_top"

License: https://github.com/google-research-datasets/Hinglish-TOP-Dataset/blob/main/LICENSE.md Original Repo: https://github.com/google-research-datasets/Hinglish-TOP-Dataset Paper Link For Citation: https://arxiv.org/pdf/2211.07514.pdf More Information needed
R
Electric Pylon Detection In Rsi Dataset
universe.roboflow.com
zip
Updated Dec 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Public (2022). Electric Pylon Detection In Rsi Dataset [Dataset]. https://universe.roboflow.com/robin-public/electric-pylon-detection-in-rsi/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 24, 2022
Dataset authored and provided by
Robin Public
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Electric Pylons Bounding Boxes
Description
From the Authors:

EPD dataset contains 1500 images in total: 720 images were captured by Pleiades satellite along Huimao Line in Guangdong Province, China, a main line of power network in south China, while the remaining images were collected from Google Earth to further improve the representativeness of the dataset by expanding the source of samples. The spatial resolution of images in EPD dataset is 1 m/pixel.

Moreover, to test and evaluate the adaptability of the detectors in face of actual situations, we specially selected 50 relatively complex images from EPD dataset comprising a complex test subset called EPD-C ...

1450 images in EPD dataset excluding EPD-C as a standard subset named EPD-S, which involves more than 3000 electric pylons. EPD-S subset was used to train detectors and perform random experiments.

Dataset Source:

This dataset is obtained from the listing in Robin Cole's satellite-image-deep-learning GitHub repository * https://www.satellite-image-deep-learning.com/ * https://twitter.com/robmarkcole

Electric-Pylon-Detection-in-RSI -> a dataset which contains 1500 remote sensing images of electric pylons used to train ten deep learning models * GitHub Repository * Kaggle EPD Dataset: https://www.kaggle.com/datasets/qiaosijia/epd-dataset * Link to the research paper: Deep Learning Based Electric Pylon Detection in Remote Sensing Images
g
Ten Thousand German News Articles Dataset
tblock.github.io
kaggle.com
csv
Updated Mar 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
T. Block (2019). Ten Thousand German News Articles Dataset [Dataset]. https://tblock.github.io/10kGNAD/
Explore at:
csvAvailable download formats
Dataset updated
Mar 5, 2019
Authors
T. Block
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
10kGNAD - A german topic classification dataset. Visit the dataset page for more information: https://tblock.github.io/10kGNAD/

Facebook

Twitter

Click to copy link

Link copied

Cite

Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos

GitHub Repos

Code and comments from 2.8 million repos

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

GitHubhttps://github.com/

Authors

Github

Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.
Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?

Clear search

Close search

Google apps

Main menu

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

issues-kaggle-notebooks

Github Indian users deep data

Fruit and Vegetable Disease (Healthy vs Rotten)

Custom Yolov7 On Kaggle On Custom Dataset

Custom Training with YOLOv7 🔥

Some Important links

Contact Information

Objective

To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

Data Acquisition

Custom Training with YOLOv7 🔥

📦 YOLOv7

Step 1: Install Requirements

Downloading YOLOV7 starting checkpoint

Step 2: Assemble Our Dataset

Version v2 Aug 12, 2022 Looks like this.

Step 3: Training Custom pretrained YOLOv7 model

‘Machine Learning users on Github’ analyzed by Analyst-2

Columns

Programming Language Ecosystem Project TU Wien

About Dataset

About Data collection methodology

github_metadata

PYPL_survey_2004-2023

stack_overflow_developer_survey

Description of the data

License

Acknowledgments

Github Repo Languages DataSet

Dataset

Contents

Google Landmarks Dataset v2

Data from: A Neural Approach for Text Extraction from Scholarly Figures

A Neural Approach for Text Extraction from Scholarly Figures

Datasets

Testing

Validation

Training

Code

github-trending

Dataset

Contents

‘us-statewise-covid data’ analyzed by Analyst-2

About the data

Community-Driven Model Service Platform Report

Cat Dog Spider Pumpkin Hooman Dataset

Weather prediction dataset

Sample dataset

Testing github actions for uploading datasets

Dataset

Contents

hinglish_top

Electric Pylon Detection In Rsi Dataset

From the Authors:

Dataset Source:

Ten Thousand German News Articles Dataset

GitHub ReposSee More Versions

Code and comments from 2.8 million repos

Querying BigQuery tables

Acknowledgements

Inspiration

GitHub Repos