GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
I added this repository to the Kaggle datasets since I found it very useful in the audio features extraction. All credits goes to the creator of this amazing package. For references please visit the repository in Github.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset was created during the Programming Language Ecosystem project from TU Wien using the code inside the repository https://github.com/ValentinFutterer/UsageOfProgramminglanguages2011-2023?tab=readme-ov-file.
The centerpiece of this repository is the usage_of_programming_languages_2011-2023.csv. This csv file shows the popularity of programming languages over the last 12 years in yearly increments. The repository also contains graphs created with the dataset. To get an accurate estimate on the popularity of programming languages, this dataset was created using 3 vastly different sources.
The dataset was created using the github repository above. As input data, three public datasets where used.
Taken from https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/ by Peter Elmers. It is licensed under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/. It shows metadata information (no code) of all github repositories with more than 5 stars.
Taken from https://github.com/pypl/pypl.github.io/tree/master, put online by the user pcarbonn. It is licensed under CC BY 3.0 https://creativecommons.org/licenses/by/3.0/. It shows from 2004 to 2023 for each month the share of programming related google searches per language.
Taken from https://insights.stackoverflow.com/survey. It is licensed under Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/. It shows from 2011 to 2023 the results of the yearly stackoverflow developer survey.
All these datasets were downloaded on the 12.12.2023. The datasets are all in the github repository above
The dataset contains a column for the year and then many columns for the different languages, denoting their usage in percent. Additionally, vertical barcharts and piecharts for each year plus a line graph for each language over the whole timespan as png's are provided.
The languages that are going to be considered for the project can be seen here:
- Python
- C
- C++
- Java
- C#
- JavaScript
- PHP
- SQL
- Assembly
- Scratch
- Fortran
- Go
- Kotlin
- Delphi
- Swift
- Rust
- Ruby
- R
- COBOL
- F#
- Perl
- TypeScript
- Haskell
- Scala
This project is licensed under the Open Data Commons Open Database License (ODbL) v1.0 https://opendatacommons.org/licenses/odbl/1-0/ license.
TLDR: You are free to share, adapt, and create derivative works from this dataser as long as you attribute me, keep the database open (if you redistribute it), and continue to share-alike any adapted database under the ODbl.
Thanks go out to
- stackoverflow https://insights.stackoverflow.com/survey for providing the data from the yearly stackoverflow developer survey.
- the PYPL survey, https://github.com/pypl/pypl.github.io/tree/master for providing google search data.
- Peter Elmers, for crawling metadata on github repositories and providing the data https://www.kaggle.com/datasets/pelmers/github-repository-metadata-with-5-stars/.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10293677%2Fa6d81c06dc03412bfd063941bd1dfa18%2Fspacex-falcon9-reaching-orbit-wide.jpg?generation=1672337964521833&alt=media" alt="">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
AIS-4SD (AI Summit - 4 Stable Diffusion models) is a collection of 4.000 images, generated using a set of Stability AI text-to-image diffusion models
This dataset was developed during the development of a collaborative project between PEReN and VIGINUM for the AI Summit held in Paris in February 2025. This open-source project aims at assessing generated images detectors performances and their robustness to different models and transformations. The code is free and open source, and contributions to connect additional detectors are also welcome.
Official repository: https://code.peren.gouv.fr/open-source/ai-action-summit/generated-image-detection.
This dataset can be used to assess detection models performances, and in particular their robustness to successive updates of the generation model.
1.000 generated images with four different versions of stability AI text-to-image diffusion model.
For each models, we generated:
Model | Number of images |
---|---|
stabilityai/stable-diffusion-xl-base-1.0 | 500 👨 + 500 🖼️ |
stabilityai/stable-diffusion-2-1 | 500 👨 + 500 🖼️ |
stabilityai/stable-diffusion-3-medium-diffusers | 500 👨 + 500 🖼️ |
stabilityai/stable-diffusion-3.5-large | 500 👨 + 500 🖼️ |
The scripts used to generated these images can be found on our open-source repository (see this specific file). After setting-up our project, you can run:
$ poetry run python scripts/generate_images.py
With minor updates to these scripts you can enrich this dataset with your specific needs.
One zip file with the following structure, each directory containing the associated 500 images:
AIS-4SD/
├── generation_metadata.csv
├── StableDiffusion-2.1-faces-20250203-1448
├── StableDiffusion-2.1-other-20250203-1548
├── StableDiffusion-3.5-faces-20250203-1012
├── StableDiffusion-3.5-other-20250203-1603
├── StableDiffusion-3-faces-20250203-1545
├── StableDiffusion-3-other-20250203-1433
├── StableDiffusion-XL-faces-20250203-0924
└── StableDiffusion-XL-other-20250203-1727
The metadata for generated images (see generation_metadata.csv
) are:
Project is under ongoing development. A preliminary blog post can be found here: https://www.peren.gouv.fr/en/perenlab/2025-02-11_ai_summit/.
This dataset is a combination of Stanford's Alpaca (https://github.com/tatsu-lab/stanford_alpaca) and FiQA (https://sites.google.com/view/fiqa/) with another 1.3k pairs custom generated using GPT3.5 Script for tuning through Kaggle's (https://www.kaggle.com) free resources using PEFT/LoRa: https://www.kaggle.com/code/gbhacker23/wealth-alpaca-lora GitHub repo with performance analyses, training and data generation scripts, and inference notebooks: https://github.com/gaurangbharti1/wealth-alpaca… See the full description on the dataset page: https://huggingface.co/datasets/csujeong/financial_data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains histopathological image data for the identification of colon cancer using deep learning. It includes high-resolution images labeled as cancerous or non-cancerous, intended for training and validating computer vision models in medical imaging.
The dataset is organised into two main image folders and two supporting CSV files:
├── train/ # 7,560 labelled images for training
├── test/ # 5,041 unlabeled images for inference/testing
├── train.csv # Contains image filenames and corresponding labels (for train/ folder)
├── example.csv # Sample format for custom data input
Folder/File | Description |
---|---|
train/ | Contains labeled histopathology images |
test/ | Contains images without labels for model inference |
train.csv | CSV file with two columns: image_id , label |
example.csv | A demonstration CSV with the expected structure |
Label Encoding:
Id
→ The Id of the ImageType
→ Cancer / Connective / Immune / NormalLoad the training labels:
import pandas as pd
df = pd.read_csv("train.csv")
print(df.head())
Read an image:
from PIL import Image
img = Image.open("train/image_00123.jpg")
img.show()
Uploaded by: Arpan Gupta
Full project using this dataset: GitHub Repo
Notebook Using Dataset: Kaggle
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides a rich snapshot of GitHub users from India, capturing various aspects of their public profiles. It's a valuable resource for analyzing trends in coding activity, repository management, and user engagement within the Indian developer community. Whether you're interested in exploring how developers grow their followers, examining language preferences, or identifying patterns in contributions and achievements, this dataset offers multiple points of analysis.
Key Features: - Username: GitHub usernames of the individuals. - Gender Pronoun: Preferred gender pronouns (if available). - Followings: Number of people each user follows. - Joining Year: The year they joined GitHub. - Contributions: Number of contributions made in the last year. - Achievements: Number of GitHub achievements unlocked by the user. - Stars: Total number of stars on their repositories. - Repositories: Number of repositories created. - Followers: Number of followers each user has. - Location: User location details, primarily from India. - Languages: Primary programming language used by the individual. - Social Links: Links to their other social platforms (LinkedIn, personal websites, etc.). - Sorting Type: Categorized based on followers, repositories, or recent joining.
This dataset can be used for: - Profiling the Indian developer community. - Tracking open-source contributions and achievements. - Analyzing programming language preferences and repository management. - Exploring the relationship between social followings and coding contributions.
Perfect for data science, social network analysis, and open-source research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Started out as a pumpkin detector to test training YOLOv5. Now suffering from extensive feature creep and probably ending up as a cat/dog/spider/pumpkin/randomobjects-detector. Or as a desaster.
The dataset does not fit https://docs.ultralytics.com/tutorials/training-tips-best-results/ well. There are no background images and the labeling is often only partial. Especially in the humans and pumpkin category where there are often lots of objects in one photo people apparently (and understandably) got bored and did not labe everything. And of course the images from the cat-category don't have the humans in it labeled since they come from a cat-identification model which ignored humans. It will need a lot of time to fixt that.
Dataset used: - Cat and Dog Data: Cat / Dog Tutorial NVIDIA Jetson https://github.com/dusty-nv/jetson-inference/blob/master/docs/pytorch-cat-dog.md © 2016-2019 NVIDIA according to bottom of linked page - Spider Data: Kaggle Animal 10 image set https://www.kaggle.com/datasets/alessiocorrado99/animals10 Animal pictures of 10 different categories taken from google images Kaggle project licensed GPL 2 - Pumpkin Data: Kaggle "Vegetable Images" https://www.researchgate.net/publication/352846889_DCNN-Based_Vegetable_Image_Classification_Using_Transfer_Learning_A_Comparative_Study https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset Kaggle project licensed CC BY-SA 4.0 - Some pumpkin images manually copied from google image search - https://universe.roboflow.com/chess-project/chess-sample-rzbmc Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/steve-pamer-cvmbg/pumpkins-gfjw5 Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/nbduy/pumpkin-ryavl Provided by a Roboflow user License: CC BY 4.0 - https://universe.roboflow.com/homeworktest-wbx8v/cat_test-1x0bl/dataset/2 - https://universe.roboflow.com/220616nishikura/catdetector - https://universe.roboflow.com/atoany/cats-s4d4i/dataset/2 - https://universe.roboflow.com/personal-vruc2/agricultured-ioth22 - https://universe.roboflow.com/sreyoshiworkspace-radu9/pet_detection - https://universe.roboflow.com/artyom-hystt/my-dogs-lcpqe - license: Public Domain url: https://universe.roboflow.com/dolazy7-gmail-com-3vj05/sweetpumpkin/dataset/2 - https://universe.roboflow.com/tristram-dacayan/social-distancing-g4pbu - https://universe.roboflow.com/fyp-3edkl/social-distancing-2ygx5 License MIT - Spiders: https://universe.roboflow.com/lucas-lins-souza/animals-train-yruka
Currently I can't guarantee it's all correctly licenced. Checks are in progress. Inform me if you see one of your pictures and want it to be removed!
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Allaneee
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
EPD dataset contains 1500 images in total: 720 images were captured by Pleiades satellite along Huimao Line in Guangdong Province, China, a main line of power network in south China, while the remaining images were collected from Google Earth to further improve the representativeness of the dataset by expanding the source of samples. The spatial resolution of images in EPD dataset is 1 m/pixel.
Moreover, to test and evaluate the adaptability of the detectors in face of actual situations, we specially selected 50 relatively complex images from EPD dataset comprising a complex test subset called EPD-C ...
1450 images in EPD dataset excluding EPD-C as a standard subset named EPD-S, which involves more than 3000 electric pylons. EPD-S subset was used to train detectors and perform random experiments.
This dataset is obtained from the listing in Robin Cole's satellite-image-deep-learning GitHub repository * https://www.satellite-image-deep-learning.com/ * https://twitter.com/robmarkcole
Electric-Pylon-Detection-in-RSI -> a dataset which contains 1500 remote sensing images of electric pylons used to train ten deep learning models * GitHub Repository * Kaggle EPD Dataset: https://www.kaggle.com/datasets/qiaosijia/epd-dataset * Link to the research paper: Deep Learning Based Electric Pylon Detection in Remote Sensing Images
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the First stage SFT dataset as presented in the paper A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning. This dataset is used for the intensive Supervised Fine-Tuning (SFT) phase, crucial for pushing the model's mathematical accuracy. Project GitHub Repository: https://github.com/RabotniKuma/Kaggle-AIMO-Progress-Prize-2-9th-Place-Solution
Dataset Construction
This dataset was… See the full description on the dataset page: https://huggingface.co/datasets/RabotniKuma/Fast-Math-R1-SFT.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Trackloaded Artist Dataset is a curated open data collection of Nigerian music artists and related metadata from Trackloaded.com. It includes artist names, biographies, birth dates, profile pages, social media handles, and external identity links such as Wikidata and Instagram.The dataset is published as Linked Open Data (LOD) and is available in RDF/Turtle, RDF/XML, and JSON-LD formats. It is accessible through a SPARQL endpoint, VoID metadata, and an RDF sitemap for programmatic access.**Data access points:**- SPARQL Endpoint: https://trackloaded.com/sparql-endpoint.php- SPARQL Browser UI: https://trackloaded.com/sparql-browser- VoID Metadata (Turtle): https://trackloaded.com/?void=1- RDF Sitemap (Turtle): https://trackloaded.com/?build_rdf_sitemap**Per-Artist Linked Data examples:**- https://trackloaded.com/tag/olamide/?rdf=ttl- https://trackloaded.com/tag/olamide/?format=rdf**Related resources:**- Zenodo DOI: https://doi.org/10.5281/zenodo.16777415- GitHub repository: https://github.com/trackloaded/data- Kaggle dataset: https://www.kaggle.com/datasets/trackloaded/artist-datasets- LOD Cloud profile: https://lod-cloud.net/dataset/trackloaded- Trackloaded dataset page: https://trackloaded.com/data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘My Uber Drives’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/zusmani/uberdrives on 21 November 2021.
--- Dataset description provided by original source is as follows ---
My Uber Drives (2016)
Here are the details of my Uber Drives of 2016. I am sharing this dataset for data science community to learn from the behavior of an ordinary Uber customer.
Geography: USA, Sri Lanka and Pakistan
Time period: January - December 2016
Unit of analysis: Drives
Total Drives: 1,155
Total Miles: 12,204
Dataset: The dataset contains Start Date, End Date, Start Location, End Location, Miles Driven and Purpose of drive (Business, Personal, Meals, Errands, Meetings, Customer Support etc.)
Users are allowed to use, download, copy, distribute and cite the dataset for their pet projects and training. Please cite it as follows: “Zeeshan-ul-hassan Usmani, My Uber Drives Dataset, Kaggle Dataset Repository, March 23, 2017.”
Uber TLC FOIL Response - The dataset contains over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015 https://github.com/fivethirtyeight/uber-tlc-foil-response
1.1 Billion Taxi Pickups from New York - http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/
What you can do with this data - a good example by Yao-Jen Kuo - https://yaojenkuo.github.io/uber.html
Some ideas worth exploring:
• What is the average length of the trip?
• Average number of rides per week or per month?
• Total tax savings based on traveled business miles?
• Percentage of business miles vs personal vs. Meals
• How much money can be saved by a typical customer using Uber, Careem, or Lyft versus regular cab service?
--- Original source retains full ownership of the source dataset ---
This dataset was created by BİLAL YAŞAR
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset was created by Diego Castro
Released under CC BY-NC-SA 4.0
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2352583%2F868a18fb09d7a1d3da946d74a9857130%2FLogo.PNG?generation=1604973725053566&alt=media" alt="">
Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.
💻 Code 🤗 Dataset (Hugging Face) 💾 Dataset (Kaggle) 💽 Dataset (Zenodo) 📜 Paper (ACL) 📝 Paper (Arxiv) ⚡ Pre-trained ELECTRA (Hugging Face)
We recommend downloading from Kaggle if you can authenticate through their API. The advantage to Kaggle is that the data is compressed, so it will be faster to download. Links to the data can be found at the top of the readme.
First, you will need to create an account on kaggle.com. Afterwards, you will need to install the kaggle API:
pip install kaggle
Then, you will need to follow the instructions here to add your username and key. Once that's done, you can run:
kaggle datasets download xhlulu/medal-emnlp
Now, unzip everything and place them inside the data
directory:
unzip -nq crawl-300d-2M-subword.zip -d data
mv data/pretrain_sample/* data/
For the LSTM models, we will need to use the fastText embeddings. To do so, first download and extract the weights:
wget -nc -P data/ https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip
unzip -nq data/crawl-300d-2M-subword.zip -d data/
You can directly load LSTM and LSTM-SA with torch.hub
:
```python
import torch
lstm = torch.hub.load("BruceWen120/medal", "lstm") lstm_sa = torch.hub.load("BruceWen120/medal", "lstm_sa") ```
If you want to use the Electra model, you need to first install transformers:
pip install transformers
Then, you can load it with torch.hub
:
python
import torch
electra = torch.hub.load("BruceWen120/medal", "electra")
transformers
If you are only interested in the pre-trained ELECTRA weights (without the disambiguation head), you can load it directly from the Hugging Face Repository:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("xhlu/electra-medal")
tokenizer = AutoTokenizer.from_pretrained("xhlu/electra-medal")
Download the bibtex
here, or copy the text below:
@inproceedings{wen-etal-2020-medal,
title = "{M}e{DAL}: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining",
author = "Wen, Zhi and Lu, Xing Han and Reddy, Siva",
booktitle = "Proceedings of the 3rd Clinical Natural Language Processing Workshop",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.clinicalnlp-1.15",
pages = "130--135",
}
The ELECTRA model is licensed under Apache 2.0. The license for the libraries used in this project (transformers
, pytorch
, etc.) can be found in their respective GitHub repository. Our model is released under a MIT license.
The original dataset was retrieved and modified from the NLM website. By using this dataset, you are bound by the terms and conditions specified by NLM:
INTRODUCTION
Downloading data from the National Library of Medicine FTP servers indicates your acceptance of the following Terms and Conditions: No charges, usage fees or royalties are paid to NLM for this data.
MEDLINE/PUBMED SPECIFIC TERMS
NLM freely provides PubMed/MEDLINE data. Please note some PubMed/MEDLINE abstracts may be protected by copyright.
GENERAL TERMS AND CONDITIONS
Users of the data agree to:
- acknowledge NLM as the source of the data by including the phrase "Courtesy of the U.S. National Library of Medicine" in a clear and conspicuous manner,
- properly use registration and/or trademark symbols when referring to NLM products, and
- not indicate or imply that NLM has endorsed its products/services/applications.
Users who republish or redistribute the data (services, products or raw data) agree to:
- maintain the most current version of all distributed data, or
- make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.
These data are produced with a reasonable standard of care, but NLM makes no warranties express or implied, including no warranty of merchantability or fitness for particular purpose, regarding the accuracy or completeness of the data. Users agree to hold NLM and the U.S. Government harmless from any liability resulting from errors in the data. NLM disclaims any liability for any consequences due to use, misuse, or interpretation of information contained or not contained in the data.
NLM does not provide legal advice regarding copyright, fair use, or other aspects of intellectual property rights. See the NLM Copyright page.
NLM reserves the right to change the type and format of its machine-readable data. NLM will take reasonable steps to inform users of any changes to the format of the data before the data are distributed via the announcement section or subscription to email and RSS updates.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022 and it comprises a total of 6 different synthetic bank account fraud tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and the first of its kind!
This suite of datasets is:
- Realistic, based on a present-day real-world dataset for fraud detection;
- Biased, each dataset has distinct controlled types of bias;
- Imbalanced, this setting presents a extremely low prevalence of positive class;
- Dynamic, with temporal data and observed distribution shifts;
- Privacy preserving, to protect the identity of potential applicants we have applied differential privacy techniques (noise addition), feature encoding and trained a generative model (CTGAN).
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2F4271ec763b04362801df2660c6e2ec30%2FScreenshot%20from%202022-11-29%2017-42-41.png?generation=1669743799938811&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Faf502caf5b9e370b869b85c9d4642c5c%2FScreenshot%20from%202022-12-15%2015-17-59.png?generation=1671117525527314&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3349776%2Ff3789bd484ee392d648b7809429134df%2FScreenshot%20from%202022-11-29%2017-40-58.png?generation=1669743681526133&alt=media" alt="">
Each dataset is composed of: - 1 million instances; - 30 realistic features used in the fraud detection use-case; - A column of “month”, providing temporal information about the dataset; - Protected attributes, (age group, employment status and % income).
Detailed information (datasheet) on the suite: https://github.com/feedzai/bank-account-fraud/blob/main/documents/datasheet.pdf
Check out the github repository for more resources and some example notebooks: https://github.com/feedzai/bank-account-fraud
Read the NeurIPS 2022 paper here: https://arxiv.org/abs/2211.13358
Learn more about Feedzai Research here: https://research.feedzai.com/
Please, use the following citation of BAF dataset suite
@article{jesusTurningTablesBiased2022,
title={Turning the {{Tables}}: {{Biased}}, {{Imbalanced}}, {{Dynamic Tabular Datasets}} for {{ML Evaluation}}},
author={Jesus, S{\'e}rgio and Pombal, Jos{\'e} and Alves, Duarte and Cruz, Andr{\'e} and Saleiro, Pedro and Ribeiro, Rita P. and Gama, Jo{\~a}o and Bizarro, Pedro},
journal={Advances in Neural Information Processing Systems},
year={2022}
}
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.