100+ datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. R

    Custom Yolov7 On Kaggle On Custom Dataset

    • universe.roboflow.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Owais Ahmad (2023). Custom Yolov7 On Kaggle On Custom Dataset [Dataset]. https://universe.roboflow.com/owais-ahmad/custom-yolov7-on-kaggle-on-custom-dataset-rakiq/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 29, 2023
    Dataset authored and provided by
    Owais Ahmad
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Person Car Bounding Boxes
    Description

    Custom Training with YOLOv7 🔥

    Some Important links

    Contact Information

    Objective

    To Showcase custom Object Detection on the Given Dataset to train and Infer the Model using newly launched YoloV7.

    Data Acquisition

    The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.

    from IPython.display import Markdown, display
    
    display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
    

    Custom Training with YOLOv7 🔥

    In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:

    • Export the dataset to YOLOv7
    • Train YOLOv7 to recognize the objects in our dataset
    • Evaluate our YOLOv7 model's performance
    • Run test inference to view performance of YOLOv7 model at work

    📦 YOLOv7

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/car-person-2.PNG" width=800>

    Image Credit - jinfagang

    Step 1: Install Requirements

    !git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
    %cd yolov7
    !pip install -qr requirements.txt
    !pip install -q roboflow
    

    Downloading YOLOV7 starting checkpoint

    !wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
    
    import os
    import glob
    import wandb
    import torch
    from roboflow import Roboflow
    from kaggle_secrets import UserSecretsClient
    from IPython.display import Image, clear_output, display # to display images
    
    
    
    print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
    

    https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">

    I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!

    YOLOv7-Car-Person-Custom

    try:
      user_secrets = UserSecretsClient()
      wandb_api_key = user_secrets.get_secret("wandb_api")
      wandb.login(key=wandb_api_key)
      anonymous = None
    except:
      wandb.login(anonymous='must')
      print('To use your W&B account,
    Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB. 
    Get your W&B access token from here: https://wandb.ai/authorize')
      
      
      
    wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
    

    Step 2: Assemble Our Dataset

    https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">

    In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.

    In Roboflow, We can choose between two paths:

    Version v2 Aug 12, 2022 Looks like this.

    https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">

    user_secrets = UserSecretsClient()
    roboflow_api_key = user_secrets.get_secret("roboflow_api")
    
    rf = Roboflow(api_key=roboflow_api_key)
    project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
    dataset = project.version(2).download("yolov7")
    

    Step 3: Training Custom pretrained YOLOv7 model

    Here, I am able to pass a number of arguments: - img: define input image size - batch: determine

  3. resemblyzer Github repository

    • kaggle.com
    zip
    Updated Aug 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Cavallin (2021). resemblyzer Github repository [Dataset]. https://www.kaggle.com/mawanda/resemblyzer-github-repository
    Explore at:
    zip(106320043 bytes)Available download formats
    Dataset updated
    Aug 18, 2021
    Authors
    Giovanni Cavallin
    Description

    Resemblyzer repository

    I added this repository to the Kaggle datasets since I found it very useful in the audio features extraction. All credits goes to the creator of this amazing package. For references please visit the repository in Github.

  4. t

    Programming Language Ecosystem Project TU Wien

    • test.researchdata.tuwien.ac.at
    csv, text/markdown
    Updated Jun 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer (2024). Programming Language Ecosystem Project TU Wien [Dataset]. http://doi.org/10.70124/gnbse-ts649
    Explore at:
    text/markdown, csvAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset provided by
    TU Wien
    Authors
    Valentin Futterer; Valentin Futterer; Valentin Futterer; Valentin Futterer
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Time period covered
    Dec 12, 2023
    Area covered
    Vienna
    Description

    About Dataset

    This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.

    The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.

    About Data collection methodology

    The dataset was created using the github repository above. As input data, three public datasets where used.

    github_metadata

    Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.

    PYPL_survey_2004-2023

    Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.

    stack_overflow_developer_survey

    Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.

    All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above

    Description of the data

    The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.

    The languages that are going to be considered for the project can be seen here:

    - Python

    - C

    - C++

    - Java

    - C#

    - JavaScript

    - PHP

    - SQL

    - Assembly

    - Scratch

    - Fortran

    - Go

    - Kotlin

    - Delphi

    - Swift

    - Rust

    - Ruby

    - R

    - COBOL

    - F#

    - Perl

    - TypeScript

    - Haskell

    - Scala

    License

    This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.

    TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.

    Acknowledgments

    Thanks go out to

    - stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.

    - the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.

    - Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.

  5. Space X to Y Data Analysis & Landing Prediction

    • kaggle.com
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Britta Smith (2023). Space X to Y Data Analysis & Landing Prediction [Dataset]. https://www.kaggle.com/datasets/brittasmith/spacextoy-dataanalysis-launchprediction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Britta Smith
    Description

    GitHub Project Link: Space X to Y

    Peer Audience Presentation Slides: See PDF uploaded below

    Tableau Dashboard Link: Space X to Y

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10293677%2Fa6d81c06dc03412bfd063941bd1dfa18%2Fspacex-falcon9-reaching-orbit-wide.jpg?generation=1672337964521833&alt=media" alt="">

  6. Google Landmarks Dataset v2

    • github.com
    • opendatalab.com
    Updated Sep 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
    Explore at:
    Dataset updated
    Sep 27, 2019
    Dataset provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.

  7. Stable Diffusion generated images - AIS-4SD dataset

    • zenodo.org
    zip
    Updated Apr 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). Stable Diffusion generated images - AIS-4SD dataset [Dataset]. http://doi.org/10.5281/zenodo.15131117
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Time period covered
    Feb 3, 2025
    Description

    AIS-4SD

    AIS-4SD (AI Summit - 4 Stable Diffusion models) is a collection of 4.000 images, generated using a set of Stability AI text-to-image diffusion models

    Context

    This dataset was developed during the development of a collaborative project between PEReN and VIGINUM for the AI Summit held in Paris in February 2025. This open-source project aims at assessing generated images detectors performances and their robustness to different models and transformations. The code is free and open source, and contributions to connect additional detectors are also welcome.

    Official repository: https://code.peren.gouv.fr/open-source/ai-action-summit/generated-image-detection.

    Dataset summary

    This dataset can be used to assess detection models performances, and in particular their robustness to successive updates of the generation model.

    Dataset description

    1.000 generated images with four different versions of stability AI text-to-image diffusion model.

    For each models, we generated:

    ModelNumber of images
    stabilityai/stable-diffusion-xl-base-1.0500 👨 + 500 🖼️
    stabilityai/stable-diffusion-2-1500 👨 + 500 🖼️
    stabilityai/stable-diffusion-3-medium-diffusers500 👨 + 500 🖼️
    stabilityai/stable-diffusion-3.5-large500 👨 + 500 🖼️
     

    Reproducibility

    The scripts used to generated these images can be found on our open-source repository (see this specific file). After setting-up our project, you can run:

    $ poetry run python scripts/generate_images.py

    With minor updates to these scripts you can enrich this dataset with your specific needs.

    Dataset structure

    One zip file with the following structure, each directory containing the associated 500 images:

    AIS-4SD/
    ├── generation_metadata.csv
    ├── StableDiffusion-2.1-faces-20250203-1448
    ├── StableDiffusion-2.1-other-20250203-1548
    ├── StableDiffusion-3.5-faces-20250203-1012
    ├── StableDiffusion-3.5-other-20250203-1603
    ├── StableDiffusion-3-faces-20250203-1545
    ├── StableDiffusion-3-other-20250203-1433
    ├── StableDiffusion-XL-faces-20250203-0924
    └── StableDiffusion-XL-other-20250203-1727

    The metadata for generated images (see generation_metadata.csv) are:

    • model: model used for generation,
    • prompt: prompt used for generation (ie Conceptual Captions caption or sfhqt2i prompt, with some minor prompt engineering),
    • guidance_scale: guidance scale of diffusion process,
    • num_inference_steps: number of inference steps of diffusion process,
    • generated_img_relative_path: relative path to image in zip structure.

    Project status

    Project is under ongoing development. A preliminary blog post can be found here: https://www.peren.gouv.fr/en/perenlab/2025-02-11_ai_summit/.

  8. h

    financial_data

    • huggingface.co
    Updated Mar 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeong (2024). financial_data [Dataset]. https://huggingface.co/datasets/csujeong/financial_data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2024
    Authors
    Jeong
    Description

    This dataset is a combination of Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5 Script for tuning through Kaggle's (https://www.kaggle.com) free resources using PEFT/LoRa: https://www.kaggle.com/code/gbhacker23/wealth-alpaca-lora GitHub repo with performance analyses, training and data generation scripts, and inference notebooks: https://github.com/gaurangbharti1/wealth-alpaca… See the full description on the dataset page: https://huggingface.co/datasets/csujeong/financial_data.

  9. Colon-Cancer-datasets

    • kaggle.com
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Apn_Gupta (2025). Colon-Cancer-datasets [Dataset]. https://www.kaggle.com/datasets/apngupta/colon-cancer-datasets/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Apn_Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧬 Colon Cancer Histopathology Dataset

    This dataset contains histopathological image data for the identification of colon cancer using deep learning. It includes high-resolution images labeled as cancerous or non-cancerous, intended for training and validating computer vision models in medical imaging.

    📁 Dataset Structure

    The dataset is organised into two main image folders and two supporting CSV files:

    ├── train/       # 7,560 labelled images for training
    ├── test/        # 5,041 unlabeled images for inference/testing
    ├── train.csv      # Contains image filenames and corresponding labels (for train/ folder)
    ├── example.csv     # Sample format for custom data input
    

    📊 Description

    Folder/FileDescription
    train/Contains labeled histopathology images
    test/Contains images without labels for model inference
    train.csvCSV file with two columns: image_id, label
    example.csvA demonstration CSV with the expected structure
    • Label Encoding:

      • Id → The Id of the Image
      • Type → Cancer / Connective / Immune / Normal

    💡 Usage Example

    Load the training labels:

    import pandas as pd
    df = pd.read_csv("train.csv")
    print(df.head())
    

    Read an image:

    from PIL import Image
    img = Image.open("train/image_00123.jpg")
    img.show()
    

    📦 Intended Use

    • 🔍 Research in medical imaging and digital pathology
    • 🧠 Training deep learning models (CNNs, transfer learning)
    • 🧪 Educational purposes for learning supervised image classification

    ⚠️ Licensing & Ethics

    • Please ensure ethical use, especially in any clinical or diagnostic context.
    • Dataset is for educational and research purposes only.
    • Source data must be anonymised and not traceable to patients.

    🙋‍♂️ Contact & Attribution

    Uploaded by: Arpan Gupta
    Full project using this dataset: GitHub Repo
    Notebook Using Dataset: Kaggle

  10. Github Indian users deep data

    • kaggle.com
    Updated Oct 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archit Tyagi (2024). Github Indian users deep data [Dataset]. https://www.kaggle.com/datasets/architty108/github-indian-users-deep-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Archit Tyagi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides a rich snapshot of GitHub users from India, capturing various aspects of their public profiles. It's a valuable resource for analyzing trends in coding activity, repository management, and user engagement within the Indian developer community. Whether you're interested in exploring how developers grow their followers, examining language preferences, or identifying patterns in contributions and achievements, this dataset offers multiple points of analysis.

    Key Features: - Username: GitHub usernames of the individuals. - Gender Pronoun: Preferred gender pronouns (if available). - Followings: Number of people each user follows. - Joining Year: The year they joined GitHub. - Contributions: Number of contributions made in the last year. - Achievements: Number of GitHub achievements unlocked by the user. - Stars: Total number of stars on their repositories. - Repositories: Number of repositories created. - Followers: Number of followers each user has. - Location: User location details, primarily from India. - Languages: Primary programming language used by the individual. - Social Links: Links to their other social platforms (LinkedIn, personal websites, etc.). - Sorting Type: Categorized based on followers, repositories, or recent joining.

    This dataset can be used for: - Profiling the Indian developer community. - Tracking open-source contributions and achievements. - Analyzing programming language preferences and repository management. - Exploring the relationship between social followings and coding contributions.

    Perfect for data science, social network analysis, and open-source research.

  11. R

    Cat Dog Spider Pumpkin Hooman Dataset

    • universe.roboflow.com
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Guhl (2023). Cat Dog Spider Pumpkin Hooman Dataset [Dataset]. https://universe.roboflow.com/peter-guhl-de1vy/cat-dog-spider-pumpkin-hooman
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 13, 2023
    Dataset authored and provided by
    Peter Guhl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Pumpkins Bounding Boxes
    Description

    Started out as a pumpkin detector to test training YOLOv5. Now suffering from extensive feature creep and probably ending up as a cat/dog/spider/pumpkin/randomobjects-detector. Or as a desaster.

    The dataset does not fit https://docs.ultralytics.com/tutorials/training-tips-best-results/ well. There are no background images and the labeling is often only partial. Especially in the humans and pumpkin category where there are often lots of objects in one photo people apparently (and understandably) got bored and did not labe everything. And of course the images from the cat-category don't have the humans in it labeled since they come from a cat-identification model which ignored humans. It will need a lot of time to fixt that.

    Dataset used: - Cat and Dog Data: Cat / Dog Tutorial NVIDIA Jetson https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md © 2016-2019 NVIDIA according to bottom of linked page - Spider Data: Kaggle Animal 10 image set https://www.kaggle.com/datasets/alessiocorrado99/animals10 Animal pictures of 10 different categories taken from google images Kaggle project licensed GPL 2 - Pumpkin Data: Kaggle "Vegetable Images" https://www.researchgate.net/publication/352846889_DCNN-Based_Vegetable_Image_Classification_Using_Transfer_Learning_A_Comparative_Study https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset Kaggle project licensed CC BY-SA 4.0 - Some pumpkin images manually copied from google image search - https://universe.roboflow.com/chess-project/chess-sample-rzbmc Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/steve-pamer-cvmbg/pumpkins-gfjw5 Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/nbduy/pumpkin-ryavl Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/homeworktest-wbx8v/cat_test-1x0bl/dataset/2 - https://universe.roboflow.com/220616nishikura/catdetector - https://universe.roboflow.com/atoany/cats-s4d4i/dataset/2 - https://universe.roboflow.com/personal-vruc2/agricultured-ioth22 - https://universe.roboflow.com/sreyoshiworkspace-radu9/pet_detection - https://universe.roboflow.com/artyom-hystt/my-dogs-lcpqe - license: Public Domain url: https://universe.roboflow.com/dolazy7-gmail-com-3vj05/sweetpumpkin/dataset/2 - https://universe.roboflow.com/tristram-dacayan/social-distancing-g4pbu - https://universe.roboflow.com/fyp-3edkl/social-distancing-2ygx5 License MIT - Spiders: https://universe.roboflow.com/lucas-lins-souza/animals-train-yruka

    Currently I can't guarantee it's all correctly licenced. Checks are in progress. Inform me if you see one of your pictures and want it to be removed!

  12. Github_repo_embedded

    • kaggle.com
    Updated Mar 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allaneee (2025). Github_repo_embedded [Dataset]. https://www.kaggle.com/datasets/allaneee/github-repo-embedded/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Allaneee
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Allaneee

    Released under MIT

    Contents

  13. R

    Electric Pylon Detection In Rsi Dataset

    • universe.roboflow.com
    zip
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Public (2022). Electric Pylon Detection In Rsi Dataset [Dataset]. https://universe.roboflow.com/robin-public/electric-pylon-detection-in-rsi/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 24, 2022
    Dataset authored and provided by
    Robin Public
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Electric Pylons Bounding Boxes
    Description

    From the Authors:

    EPD dataset contains 1500 images in total: 720 images were captured by Pleiades satellite along Huimao Line in Guangdong Province, China, a main line of power network in south China, while the remaining images were collected from Google Earth to further improve the representativeness of the dataset by expanding the source of samples. The spatial resolution of images in EPD dataset is 1 m/pixel.

    Moreover, to test and evaluate the adaptability of the detectors in face of actual situations, we specially selected 50 relatively complex images from EPD dataset comprising a complex test subset called EPD-C ...

    1450 images in EPD dataset excluding EPD-C as a standard subset named EPD-S, which involves more than 3000 electric pylons. EPD-S subset was used to train detectors and perform random experiments.

    Dataset Source:

    This dataset is obtained from the listing in Robin Cole's satellite-image-deep-learning GitHub repository * https://www.satellite-image-deep-learning.com/ * https://twitter.com/robmarkcole

    Electric-Pylon-Detection-in-RSI -> a dataset which contains 1500 remote sensing images of electric pylons used to train ten deep learning models * GitHub Repository * Kaggle EPD Dataset: https://www.kaggle.com/datasets/qiaosijia/epd-dataset * Link to the research paper: Deep Learning Based Electric Pylon Detection in Remote Sensing Images

  14. h

    Fast-Math-R1-SFT

    • huggingface.co
    Updated Jan 15, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hiroshi Yoshihara (2015). Fast-Math-R1-SFT [Dataset]. https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT
    Explore at:
    Dataset updated
    Jan 15, 2015
    Authors
    Hiroshi Yoshihara
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This repository contains the First stage SFT dataset as presented in the paper A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning. This dataset is used for the intensive Supervised Fine-Tuning (SFT) phase, crucial for pushing the model's mathematical accuracy. Project GitHub Repository: https://github.com/RabotniKuma/Kaggle-AIMO-Progress-Prize-2-9th-Place-Solution

      Dataset Construction
    

    This dataset was… See the full description on the dataset page: https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT.

  15. Trackloaded Artist Dataset

    • figshare.com
    bin
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taofeek Aperoja (2025). Trackloaded Artist Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.29900933.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 13, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Taofeek Aperoja
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Trackloaded Artist Dataset is a curated open data collection of Nigerian music artists and related metadata from Trackloaded.com. It includes artist names, biographies, birth dates, profile pages, social media handles, and external identity links such as Wikidata and Instagram.The dataset is published as Linked Open Data (LOD) and is available in RDF/Turtle, RDF/XML, and JSON-LD formats. It is accessible through a SPARQL endpoint, VoID metadata, and an RDF sitemap for programmatic access.**Data access points:**- SPARQL Endpoint: https://trackloaded.com/sparql-endpoint.php- SPARQL Browser UI: https://trackloaded.com/sparql-browser- VoID Metadata (Turtle): https://trackloaded.com/?void=1- RDF Sitemap (Turtle): https://trackloaded.com/?build_rdf_sitemap**Per-Artist Linked Data examples:**- https://trackloaded.com/tag/olamide/?rdf=ttl- https://trackloaded.com/tag/olamide/?format=rdf**Related resources:**- Zenodo DOI: https://doi.org/10.5281/zenodo.16777415- GitHub repository: https://github.com/trackloaded/data- Kaggle dataset: https://www.kaggle.com/datasets/trackloaded/artist-datasets- LOD Cloud profile: https://lod-cloud.net/dataset/trackloaded- Trackloaded dataset page: https://trackloaded.com/data

  16. A

    ‘My Uber Drives’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Nov 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘My Uber Drives’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-my-uber-drives-e494/79b87b1c/?iid=006-189&v=presentation
    Explore at:
    Dataset updated
    Nov 21, 2021
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘My Uber Drives’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/uberdrives on 21 November 2021.

    --- Dataset description provided by original source is as follows ---

    Context

    My Uber Drives (2016)

    Here are the details of my Uber Drives of 2016. I am sharing this dataset for data science community to learn from the behavior of an ordinary Uber customer.

    Content

    Geography: USA, Sri Lanka and Pakistan

    Time period: January - December 2016

    Unit of analysis: Drives

    Total Drives: 1,155

    Total Miles: 12,204

    Dataset: The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)

    Acknowledgements & References

    Users are allowed to use, download, copy, distribute and cite the dataset for their pet projects and training. Please cite it as follows: “Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017.”

    Past Research

    Uber TLC FOIL Response - The dataset contains over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015 https://github.com/fivethirtyeight/uber-tlc-foil-response

    1.1 Billion Taxi Pickups from New York - http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

    What you can do with this data - a good example by Yao-Jen Kuo - https://yaojenkuo.github.io/uber.html

    Inspiration

    Some ideas worth exploring:

    • What is the average length of the trip?

    • Average number of rides per week or per month?

    • Total tax savings based on traveled business miles?

    • Percentage of business miles vs personal vs. Meals

    • How much money can be saved by a typical customer using Uber, Careem, or Lyft versus regular cab service?

    --- Original source retains full ownership of the source dataset ---

  17. Google-Github-Repository-Analsysis

    • kaggle.com
    Updated Feb 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BİLAL YAŞAR (2020). Google-Github-Repository-Analsysis [Dataset]. https://www.kaggle.com/blalyasar/googlegithubrepositoryanalsysis/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 6, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    BİLAL YAŞAR
    Description

    Dataset

    This dataset was created by BİLAL YAŞAR

    Contents

  18. GitHub Repositories

    • kaggle.com
    zip
    Updated Nov 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diego Castro (2018). GitHub Repositories [Dataset]. https://www.kaggle.com/datasets/qopuir/github-repositories
    Explore at:
    zip(15363388 bytes)Available download formats
    Dataset updated
    Nov 29, 2018
    Authors
    Diego Castro
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Diego Castro

    Released under CC BY-NC-SA 4.0

    Contents

  19. MeDAL Dataset

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). MeDAL Dataset [Dataset]. https://www.kaggle.com/xhlulu/medal-emnlp
    Explore at:
    zip(7324382521 bytes)Available download formats
    Dataset updated
    Nov 16, 2020
    Authors
    xhlulu
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">

    Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

    💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv)Pre-trained ELECTRA (Hugging Face)

    Downloading the data

    We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.

    First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API: pip install kaggle

    Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run: kaggle datasets download xhlulu/medal-emnlp

    Now, unzip everything and place them inside the data directory: unzip -nq crawl-300d-2M-subword.zip -d data mv data/pretrain_sample/* data/

    Loading FastText Embeddings

    For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights: wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip unzip -nq data/crawl-300d-2M-subword.zip -d data/

    Model Quickstart

    Using Torch Hub

    You can directly load LSTM and LSTM-SA with torch.hub: ```python import torch

    lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```

    If you want to use the Electra model, you need to first install transformers: pip install transformers Then, you can load it with torch.hub: python import torch electra = torch.hub.load("BruceWen120/medal", "electra")

    Using Huggingface transformers

    If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:

    from transformers import AutoModel, AutoTokenizer
    
    model = AutoModel.from_pretrained("xhlu/electra-medal")
    tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
    

    Citation

    Download the bibtex here, or copy the text below: @inproceedings{wen-etal-2020-medal, title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining", author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva", booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15", pages = "130--135", }

    License, Terms and Conditions

    The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers, pytorch, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.

    The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:

    INTRODUCTION

    Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.

    MEDLINE/PUBMED SPECIFIC TERMS

    NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.

    GENERAL TERMS AND CONDITIONS

    • Users of the data agree to:

      • acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
      • properly use registration and/or trademark symbols when referring to NLM products, and
      • not indicate or imply that NLM has endorsed its products/services/applications.
    • Users who republish or redistribute the data (services, products or raw data) agree to:

      • maintain the most current version of all distributed data, or
      • make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
    • These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.

    • NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.

    • NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.

  20. Bank Account Fraud Dataset Suite (NeurIPS 2022)

    • kaggle.com
    Updated Nov 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sérgio Jesus (2023). Bank Account Fraud Dataset Suite (NeurIPS 2022) [Dataset]. https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sérgio Jesus
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!

    This suite of datasets is: - Realistic, based on a present-day real-world dataset for fraud detection; - Biased, each dataset has distinct controlled types of bias; - Imbalanced, this setting presents a extremely low prevalence of positive class; - Dynamic, with temporal data and observed distribution shifts;
    - Privacy preserving, to protect the identity of potential applicants we have applied differential privacy techniques (noise addition), feature encoding and trained a generative model (CTGAN).

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2F4271ec763b04362801df2660c6e2ec30%2FScreenshot%20from%202022-11-29%2017-42-41.png?generation=1669743799938811&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Faf502caf5b9e370b869b85c9d4642c5c%2FScreenshot%20from%202022-12-15%2015-17-59.png?generation=1671117525527314&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Ff3789bd484ee392d648b7809429134df%2FScreenshot%20from%202022-11-29%2017-40-58.png?generation=1669743681526133&alt=media" alt="">

    Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of “month”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income).

    Detailed information (datasheet) on the suite: https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf

    Check out the github repository for more resources and some example notebooks: https://github.com/feedzai/bank-account-fraud

    Read the NeurIPS 2022 paper here: https://arxiv.org/abs/2211.13358

    Learn more about Feedzai Research here: https://research.feedzai.com/

    Please, use the following citation of BAF dataset suite @article{jesusTurningTablesBiased2022, title={Turning the {{Tables}}: {{Biased}}, {{Imbalanced}}, {{Dynamic Tabular Datasets}} for {{ML Evaluation}}}, author={Jesus, S{\'e}rgio and Pombal, Jos{\'e} and Alves, Duarte and Cruz, Andr{\'e} and Saleiro, Pedro and Ribeiro, Rita P. and Gama, Jo{\~a}o and Bizarro, Pedro}, journal={Advances in Neural Information Processing Systems}, year={2022} }

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu