49 datasets found

NYC Open Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
NYC Open Data
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

Content

Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

Over 8 million 311 service requests from 2012-2016

More than 1 million motor vehicle collisions 2012-present

Citi Bike stations and 30 million Citi Bike trips 2013-present

Over 1 billion Yellow and Green Taxi rides from 2009-present

Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

This dataset is deprecated and not being updated.

Fork this kernel to get started with this dataset.

Acknowledgements

https://opendata.cityofnewyork.us/

https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

Banner Photo by @bicadmedia from Unplash.

Inspiration

On which New York City streets are you most likely to find a loud party?

Can you find the Virginia Pines in New York City?

Where was the only collision caused by an animal that injured a cyclist?

What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
Meta Kaggle Code
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(151045619431 bytes)Available download formats
Dataset updated
Jul 31, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
A
‘Top Streamers on Twitch’ analyzed by Analyst-2
analyst-2.ai
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2021). ‘Top Streamers on Twitch’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-top-streamers-on-twitch-1235/45ddf2b0/?iid=010-665&v=presentation
Explore at:
Dataset updated
Sep 30, 2021
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Top Streamers on Twitch’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aayushmishra1512/twitchdata on 30 September 2021.

--- Dataset description provided by original source is as follows ---

Context

Gaming is a very big industry now. Every year there are millions of Dollars invested in Esports and many new companies want to invest in the Esports scene now. One of bigegest ever deals was when Mixer opened up and brought Ninja and Shroud to their platform from twitch. But Twitch has been a home to streamers since day 1 and now that Mixer has been shut down, streamers are returning to the platform again.Millions, if not billions, watch twitch streams everyday and i myself like to watch twitch streams. So i put together Top 1000 Streamers from past one year who were streaming on twitch.

Content

This data consists of different things like number of viewers, number of active viewers, followers gained and many other relevant columns regarding a particular streamer. It has 11 different columns with all the necessary information that is needed.

--- Original source retains full ownership of the source dataset ---
Google Landmarks Dataset v2
github.com
opendatalab.com
Updated Sep 27, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2019). Google Landmarks Dataset v2 [Dataset]. https://github.com/cvdfoundation/google-landmark
Explore at:
Dataset updated
Sep 27, 2019
Dataset provided by
Googlehttp://google.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test. The dataset was presented in our CVPR'20 paper. In this repository, we present download links for all dataset files and relevant code for metric computation. This dataset was associated to two Kaggle challenges, on landmark recognition and landmark retrieval. Results were discussed as part of a CVPR'19 workshop. In this repository, we also provide scores for the top 10 teams in the challenges, based on the latest ground-truth version. Please visit the challenge and workshop webpages for more details on the data, tasks and technical solutions from top teams.
H-1B Visa Petitions 2015-2019
kaggle.com
Updated Jul 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ABeyer (2021). H-1B Visa Petitions 2015-2019 [Dataset]. https://www.kaggle.com/abrambeyer/h1b-visa-petitions-20152019/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2021
Dataset provided by
Kaggle
Authors
ABeyer
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Description
Context

This dataset is an update of Sharan Naribole's earlier dataset titled H-1B Visa Petitions 2011-2016. Inspired by his work and using a modified, updated version of his R script, I wrangled U.S. H1-B visa petitions data for the years 2015-2019. The previous dataset can be found here: link to Sharan's dataset.

H1-B visas are the most common visa status applied for and held by international students once they begin working full-time in the U.S.

Please see the original dataset for more context information.

Content

This dataset includes 5 years worth of H1-B visa petitions in the U.S. The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.

This file contains H1-B data from the LCA Program data files (H1-B, H-1B1, E-3). These datasets can be found on the U.S. Department of Labor Site.

Acknowledgements

Shout out to Sharan Naribole for the original project idea and easy-to-update R script.

U.S. Department of Labor Data Source

Inspiration

Which states/cities/companies provide the most H1-B visas? For your job description, which city should you be in to have the most opportunities? Which companies should you apply to if you would like the best odds of obtaining a H1-B visa?
T
wit_kaggle
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wit_kaggle [Dataset]. https://www.tensorflow.org/datasets/catalog/wit_kaggle
Explore at:
Dataset updated
Dec 22, 2022
Description
Wikipedia - Image/Caption Matching Kaggle Competition.

This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.

In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wit_kaggle', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/wit_kaggle-train_with_extended_features-1.0.2.png" alt="Visualization" width="500px">
MNAD : Moroccan News Articles Dataset
kaggle.com
Updated Jan 16, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
JM100 (2022). MNAD : Moroccan News Articles Dataset [Dataset]. https://www.kaggle.com/jmourad100/mnad-moroccan-news-articles-dataset/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 16, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JM100
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

Dataset Fields

Title: The title of the article

Body: The body of the article

Category: The category of the article

Source: The Electronic News paper source of the article

About Version 1 of the Dataset (MNAD.v1)

Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

About Version 2 of the Dataset (MNAD.v2)

Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

Citation

If you use our data, please cite the following paper:

@inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }
Processed twitter sentiment Dataset | Added Tokens
kaggle.com
Updated Aug 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Halemo GPA (2024). Processed twitter sentiment Dataset | Added Tokens [Dataset]. http://doi.org/10.34740/kaggle/ds/5568348
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5568348
Dataset updated
Aug 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Halemo GPA
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset is a processed version of the Sentiment140 corpus, containing 1.6 million tweets with binary sentiment labels. The original data has been cleaned, tokenized, and prepared for natural language processing (NLP) and machine learning tasks. It provides a rich resource for sentiment analysis, text classification, and other NLP applications. The dataset includes the full processed corpus (train-processed.csv) and a smaller sample of 10,000 tweets (train-processed-sample.csv) for quick experimentation and model prototyping. Key Features:

1.6 million labeled tweets Binary sentiment classification (0 for negative, 1 for positive) Preprocessed and tokenized text Balanced class distribution Suitable for various NLP tasks and model architectures

Citation If you use this dataset in your research or project, please cite the original Sentiment140 dataset: Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009), p.12.
Doodle Dataset
kaggle.com
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ashish Jangra (2024). Doodle Dataset [Dataset]. https://www.kaggle.com/datasets/ashishjangra27/doodle-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ashish Jangra
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description: Doodle Classifier Prepared Dataset

Overview

This dataset consists of over 1 million images covering 340 classes of doodles. It contains grayscale images of doodles, organized by class, extracted from the Quick, Draw! dataset. Each image represents a hand-drawn sketch from various categories, processed to be ready for machine learning tasks.

This dataset is a clean, processed, and easy-to-use version of the original Quick, Draw! dataset by Google, which has approximately 50 million images.

Content

Images: Grayscale images of doodles, each 255x255 pixels.

Classes: 340 categories of doodles, each stored in its directory.

Total Images: 1,020,000 images, with each class containing exactly 3,000 images.

Structure

Doodle Folder: Contains 340 subfolders, each representing a different doodle class. Each subfolder includes exactly 3,000 images.

CSV File (master_doodle_dataframe.csv): Contains additional metadata about the images, including:

countrycode: The country code of the user who drew the doodle.

drawing: The drawing data is in JSON format.

key_id: Unique identifier for each doodle.

recognized: Boolean indicating whether the doodle was recognized.

word: The class label (e.g., "traffic light").

image_path: The file path where the image is stored.

Usage

This dataset is your playground for: - Training and evaluating machine learning models, especially for image classification tasks. - Conducting research and educational activities with a well-organized set of doodle images. - Benchmarking doodle recognition algorithms.

Acknowledgements

This dataset is a clean and processed version of the original Quick, Draw! dataset by Google, which contains approximately 50 million images. Special thanks to the original creators and contributors of the dataset.

License

This dataset is shared under the CC BY 4.0 license. Please attribute the source when using this dataset in your work.

Best of Luck

We hope this dataset serves as a valuable resource for your projects. Happy coding and may your models achieve high accuracy!
Data Scientists vs Size of Datasets
kaggle.com
Updated Oct 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laurae (2016). Data Scientists vs Size of Datasets [Dataset]. https://www.kaggle.com/datasets/laurae2/data-scientists-vs-size-of-datasets/suggestions?status=pending
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laurae
Description
This research study was conducted to analyze the (potential) relationship between hardware and data set sizes. 100 data scientists from France between Jan-2016 and Aug-2016 were interviewed in order to have exploitable data. Therefore, this sample might not be representative of the true population.

What can you do with the data?

Look up whether Kagglers has "stronger" hardware than non-Kagglers

Whether there is a correlation between a preferred data set size and hardware

Is proficiency a predictor of specific preferences?

Are data scientists more Intel or AMD?

How spread is GPU computing, and is there any relationship with Kaggling?

Are you able to predict the amount of euros a data scientist might invest, provided their current workstation details?

I did not find any past research on a similar scale. You are free to play with this data set. For re-usage of this data set out of Kaggle, please contact the author directly on Kaggle (use "Contact User"). Please mention:

Your intended usage (research? business use? blogging?...)

Your first/last name

Arbitrarily, we chose characteristics to describe Data Scientists and data set sizes.

Data set size:

Small: under 1 million values

Medium: between 1 million and 1 billion values

Large: over 1 billion values

For the data, it uses the following fields (DS = Data Scientist, W = Workstation):

DS_1 = Are you working with "large" data sets at work? (large = over 1 billion values) => Yes or No

DS_2 = Do you enjoy working with large data sets? => Yes or No

DS_3 = Would you rather have small, medium, or large data sets for work? => Small, Medium, or Large

DS_4 = Do you have any presence at Kaggle or any other Data Science platforms? => Yes or No

DS_5 = Do you view yourself proficient at working in Data Science? => Yes, A bit, or No

W_1 = What is your CPU brand? => Intel or AMD

W_2 = Do you have access to a remote server to perform large workloads? => Yes or No

W_3 = How much Euros would you invest in Data Science brand new hardware? => numeric output, rounded by 100s

W_4 = How many cores do you have to work with data sets? => numeric output

W_5 = How much RAM (in GB) do you have to work with data sets? => numeric output

W_6 = Do you do GPU computing? => Yes or No

W_7 = What programming languages do you use for Data Science? => R or Python (any other answer accepted)

W_8 = What programming languages do you use for pure statistical analysis? => R or Python (any other answer accepted)

W_9 = What programming languages do you use for training models? => R or Python (any other answer accepted)

You should expect potential noise in the data set. It might not be "free" of internal contradictions, as with all researches.
Adult Datasets
kaggle.com
Updated Jan 22, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brijesh B. Mehta (2019). Adult Datasets [Dataset]. https://www.kaggle.com/datasets/brijeshbmehta/adult-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Brijesh B. Mehta
Description
Context

I am working in the area of Privacy Preserving Big Data Publishing. The state-of-art approaches were tested on Adult dataset. I found that Adult dataset is available at UCI repository but synthetic version wasn't available anywhere. As I am working with big data, I need large size of data to justify my contribution. Therefore, I created my own version of synthetic datasets with 100 thousands, 1 million, 10 millions and 100 millions numbers of records. Here I am sharing the original Adult dataset with approx 33 thousands records and the synthesis versions Adult100k, Adult 1m, Adult10m and Adult100m.

Content

Adult dataset contains census information.

Acknowledgements

I would like to thank UCI repository for providing the base dataset without which I may not be able to synthesis the large data.

Inspiration

The datasets might be helpful to all those who wants to work on Big Data Privacy.
C4_200M
kaggle.com
Updated Nov 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A0155991R_Li Liwei (2021). C4_200M [Dataset]. https://www.kaggle.com/datasets/a0155991rliwei/c4-200m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 13, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A0155991R_Li Liwei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

Grammar Error Correction dataset synthesized based on: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

Content

This dataset contains roughly 185 Million sentence pairs generated using C4/en/3.0.1 dataset

The data is stored in the format: { "input": "This is an grammatically wrong sentences.", "output": "This is a grammatically correct sentence." }

Acknowledgements

The C4 dataset was downloaded from allenai: https://github.com/allenai/allennlp/discussions/5056 The modified scripts used to generate the sentence pairs were referenced from: https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction.

Inspiration

We hope that this dataset will help others by saving the trouble and time of generating this dataset.
Bank Transaction Dataset for Fraud Detection
kaggle.com
Updated Nov 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vala khorasani (2024). Bank Transaction Dataset for Fraud Detection [Dataset]. https://www.kaggle.com/datasets/valakhorasani/bank-transaction-dataset-for-fraud-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
vala khorasani
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides a detailed look into transactional behavior and financial activity patterns, ideal for exploring fraud detection and anomaly identification. It contains 2,512 samples of transaction data, covering various transaction attributes, customer demographics, and usage patterns. Each entry offers comprehensive insights into transaction behavior, enabling analysis for financial security and fraud detection applications.

Key Features:

TransactionID: Unique alphanumeric identifier for each transaction.

AccountID: Unique identifier for each account, with multiple transactions per account.

TransactionAmount: Monetary value of each transaction, ranging from small everyday expenses to larger purchases.

TransactionDate: Timestamp of each transaction, capturing date and time.

TransactionType: Categorical field indicating 'Credit' or 'Debit' transactions.

Location: Geographic location of the transaction, represented by U.S. city names.

DeviceID: Alphanumeric identifier for devices used to perform the transaction.

IP Address: IPv4 address associated with the transaction, with occasional changes for some accounts.

MerchantID: Unique identifier for merchants, showing preferred and outlier merchants for each account.

AccountBalance: Balance in the account post-transaction, with logical correlations based on transaction type and amount.

PreviousTransactionDate: Timestamp of the last transaction for the account, aiding in calculating transaction frequency.

Channel: Channel through which the transaction was performed (e.g., Online, ATM, Branch).

CustomerAge: Age of the account holder, with logical groupings based on occupation.

CustomerOccupation: Occupation of the account holder (e.g., Doctor, Engineer, Student, Retired), reflecting income patterns.

TransactionDuration: Duration of the transaction in seconds, varying by transaction type.

LoginAttempts: Number of login attempts before the transaction, with higher values indicating potential anomalies.

This dataset is ideal for data scientists, financial analysts, and researchers looking to analyze transactional patterns, detect fraud, and build predictive models for financial security applications. The dataset was designed for machine learning and pattern analysis tasks and is not intended as a primary data source for academic publications.
🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations
kaggle.com
Updated Apr 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enric Domingo (2024). 🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations [Dataset]. https://www.kaggle.com/datasets/edomingo/bicing-stations-dataset-bcn-bike-sharing/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Enric Domingo
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains 250 million rows of information from the ~500 bike stations of the Barcelona public bicycle sharing service. The data consists in time series information of the electric and mechanical bicycles available every 4 minutes aprox., from March 2019 to March 2024 (latest available csv file, with the idea of being updated with every new month's file). This data could inspire many different use cases, from geographical data analysis to hierarchical ML time series models or Graph Neural Networks among others. Feel free to create a New Notebook from this page to use it and share your ideas with everyone!

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3317928%2F64409b5bd3c220993e05f5e155fd8c25%2Fstations_map_2024.png?generation=1713725887609128&alt=media" alt="">

Every month's information is separated in a different file as {year}_{month}_STATIONS.csv. Then the metadata info of every station has been simplified and compressed in the {year}_INFO.csv files where there is a single entry for every station and day, separated in a different file for every year.

The original data has some different errors, few of them have been already corrected but there are still some missing values, columns with wrong data types and other fewer artifacts or missing data. From time to time I may be manually correcting more of those.

The data is collected from the public BCN Open Data website, which is available for everyone (some resources need from creating a free account and token): - Stations data: https://opendata-ajuntament.barcelona.cat/data/en/dataset/estat-estacions-bicing - Stations info: https://opendata-ajuntament.barcelona.cat/data/en/dataset/informacio-estacions-bicing

You can find more information in them.

Please, consider upvoting this dataset if you find it interesting! 🤗

Some observations:
The historical data for June '19 does not have data for the 20th between 7:40 am and 2:00 pm.
The historical data for July '19 does not have data from the 26th at 1:30 pm until the 29th at 10:40 am.
The historical data for November '19 may not have some data from 10:00 pm on the 26th to 11:00 am on the 27th.
The historical data for August '20 does not have data from the 7th at 2:25 am until the 10th at 10:40 am.
The historical data for November '20 does not have data on the following days/times: 4th from 1:45 am to 11:05 am 20th from 7:50 pm to the 21st at 10:50 am 27th from 2:50 am to the 30th at 9:50 am.
The historical data for August '23 does not have data from the 22nd to the 31st due to a technical incident.
The historical data for September '23 does not have data from the 1st to the 5th due to a technical incident.
The historical data for February '24 does not have data on the 5th between 12:50 pm and 1:05 pm.
Others: Due to COVID-19 measures, the Bicing service was temporarily stopped, reflecting this situation in the historical data.

Field Description:

Array of data for each station:

station_id: Identifier of the station
num_bikes_available: Number of available bikes
num_bikes_available_types: Array of types of available bikes
mechanical: Number of available mechanical bikes
ebike: Number of available electric bikes
num_docks_available: Number of available docks
is_installed: The station is properly installed (0-NO,1-YES)
is_renting: The station is providing bikes correctly
is_returning: The station is docking bikes correctly
last_reported: Timestamp of the station information
is_charging_station: The station has electric bike charging capacity
status: Status of the station (IN_SERVICE=In service, CLOSED=Closed)
S&P 500 stock data
kaggle.com
zip
Updated Aug 11, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cam Nugent (2017). S&P 500 stock data [Dataset]. https://www.kaggle.com/camnugent/sandp500
Explore at:
zip(31994392 bytes)Available download formats
Dataset updated
Aug 11, 2017
Authors
Cam Nugent
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.

The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.

Content

The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder) and a smaller version of the dataset (all_stocks_1yr.csv) with only the past year's stock data for those wishing to use something more manageable in size.

The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv and all_stocks_1yr.csv contain this same data, presented in merged .csv files. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.

All the files have the following columns: Date - in format: yy-mm-dd Open - price of the stock at market open (this is NYSE data so all in USD) High - Highest price reached in the day Low Close - Lowest price reached in the day Volume - Number of shares traded Name - the stock's ticker name

Acknowledgements

I scraped this data from Google finance using the python library 'pandas_datareader'. Special thanks to Kaggle, Github and The Market.

Inspiration

This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!
Retail Transactions Dataset
kaggle.com
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2024). Retail Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/prasad22/retail-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset was created to simulate a market basket dataset, providing insights into customer purchasing behavior and store operations. The dataset facilitates market basket analysis, customer segmentation, and other retail analytics tasks. Here's more information about the context and inspiration behind this dataset:

Context:

Retail businesses, from supermarkets to convenience stores, are constantly seeking ways to better understand their customers and improve their operations. Market basket analysis, a technique used in retail analytics, explores customer purchase patterns to uncover associations between products, identify trends, and optimize pricing and promotions. Customer segmentation allows businesses to tailor their offerings to specific groups, enhancing the customer experience.

Inspiration:

The inspiration for this dataset comes from the need for accessible and customizable market basket datasets. While real-world retail data is sensitive and often restricted, synthetic datasets offer a safe and versatile alternative. Researchers, data scientists, and analysts can use this dataset to develop and test algorithms, models, and analytical tools.

Dataset Information:

The columns provide information about the transactions, customers, products, and purchasing behavior, making the dataset suitable for various analyses, including market basket analysis and customer segmentation. Here's a brief explanation of each column in the Dataset:

Transaction_ID: A unique identifier for each transaction, represented as a 10-digit number. This column is used to uniquely identify each purchase.

Date: The date and time when the transaction occurred. It records the timestamp of each purchase.

Customer_Name: The name of the customer who made the purchase. It provides information about the customer's identity.

Product: A list of products purchased in the transaction. It includes the names of the products bought.

Total_Items: The total number of items purchased in the transaction. It represents the quantity of products bought.

Total_Cost: The total cost of the purchase, in currency. It represents the financial value of the transaction.

Payment_Method: The method used for payment in the transaction, such as credit card, debit card, cash, or mobile payment.

City: The city where the purchase took place. It indicates the location of the transaction.

Store_Type: The type of store where the purchase was made, such as a supermarket, convenience store, department store, etc.

Discount_Applied: A binary indicator (True/False) representing whether a discount was applied to the transaction.

Customer_Category: A category representing the customer's background or age group.

Season: The season in which the purchase occurred, such as spring, summer, fall, or winter.

Promotion: The type of promotion applied to the transaction, such as "None," "BOGO (Buy One Get One)," or "Discount on Selected Items."

Use Cases:

Market Basket Analysis: Discover associations between products and uncover buying patterns.

Customer Segmentation: Group customers based on purchasing behavior.

Pricing Optimization: Optimize pricing strategies and identify opportunities for discounts and promotions.

Retail Analytics: Analyze store performance and customer trends.

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.
1Million_Rows_of_Motel_Data
kaggle.com
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FORSEES WRITING (2022). 1Million_Rows_of_Motel_Data [Dataset]. https://www.kaggle.com/datasets/forseeswriting/1million-rows-of-motel-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 8, 2022
Dataset provided by
Kaggle
Authors
FORSEES WRITING
Description
Dataset

This dataset was created by FORSEES WRITING

Contents
Duolingo Spaced Repetition Data
kaggle.com
Updated Feb 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinicius Araujo (2024). Duolingo Spaced Repetition Data [Dataset]. https://www.kaggle.com/datasets/aravinii/duolingo-spaced-repetition-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vinicius Araujo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
PLEASE UPVOTE IF YOU LIKE THIS CONTENT! 😍

Duolingo is an American educational technology company that produces learning apps and provides language certification. There main app is considered the most popular language learning app in the world.

To progress in their learning journey, each user of the application needs to complete a set of lessons in which they are presented with the words of the language they want to learn. In an infinite set of lessons, each word is applied in a different context and, on top of that, Duolingo uses a spaced repetition approach, where the user sees an already known word again to reinforce their learning.

Each line in this file refers to a Duolingo lesson that had a target word to practice.

The columns are as follows:

p_recall - proportion of exercises from this lesson/practice where the word/lexeme was correctly recalled

timestamp - UNIX timestamp of the current lesson/practice

delta - time (in seconds) since the last lesson/practice that included this word/lexeme

user_id - student user ID who did the lesson/practice (anonymized)

learning_language - language being learned

ui_language - user interface language (presumably native to the student)

lexeme_id - system ID for the lexeme tag (i.e., word)

lexeme_string - lexeme tag (see below)

history_seen - total times user has seen the word/lexeme prior to this lesson/practice

history_correct - total times user has been correct for the word/lexeme prior to this lesson/practice

session_seen - times the user saw the word/lexeme during this lesson/practice

session_correct - times the user got the word/lexeme correct during this lesson/practice

The lexeme_string column contains a string representation of the "lexeme tag" used by Duolingo for each lesson/practice (data instance) in our experiments. The lexeme_string field uses the following format:

`surface-form/lemma

Facebook

Twitter

Click to copy link

Link copied

Cite

NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york

NYC Open Data

NYC Open Data (BigQuery Dataset)

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset authored and provided by

NYC Open Data

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

Content

Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

Over 8 million 311 service requests from 2012-2016
More than 1 million motor vehicle collisions 2012-present
Citi Bike stations and 30 million Citi Bike trips 2013-present
Over 1 billion Yellow and Green Taxi rides from 2009-present
Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

This dataset is deprecated and not being updated.

Fork this kernel to get started with this dataset.

Acknowledgements

https://opendata.cityofnewyork.us/

https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

Banner Photo by @bicadmedia from Unplash.

Inspiration

On which New York City streets are you most likely to find a loud party?

Can you find the Virginia Pines in New York City?

Where was the only collision caused by an animal that injured a cyclist?

What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png

Clear search

Close search

Google apps

Main menu

NYC Open Data

Context

Content

Acknowledgements

Inspiration

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

‘Top Streamers on Twitch’ analyzed by Analyst-2

Context

Content

Google Landmarks Dataset v2

H-1B Visa Petitions 2015-2019

Context

Content

Acknowledgements

Inspiration

wit_kaggle

MNAD : Moroccan News Articles Dataset

Dataset Fields

About Version 1 of the Dataset (MNAD.v1)

About Version 2 of the Dataset (MNAD.v2)

Citation

Processed twitter sentiment Dataset | Added Tokens

Doodle Dataset

Dataset Description: Doodle Classifier Prepared Dataset

Overview

Content

Structure

Usage

Acknowledgements

License

Best of Luck

Data Scientists vs Size of Datasets

Adult Datasets

Context

Content

Acknowledgements

Inspiration

C4_200M

Context

Content

Acknowledgements

Inspiration

Bank Transaction Dataset for Fraud Detection

🚴🗃️ BCN Bike Sharing Dataset - Bicing Stations

S&P 500 stock data

Context

Content

Acknowledgements

Inspiration

Retail Transactions Dataset

Context:

Inspiration:

Dataset Information:

Use Cases:

Note: This dataset is entirely synthetic and was generated using the Python Faker library, which means it doesn't contain real customer data. It's designed for educational and research purposes.

1Million_Rows_of_Motel_Data

Dataset

Contents

Duolingo Spaced Repetition Data

NYC Open Data

NYC Open Data (BigQuery Dataset)

Context

Content

`Context:`

`Inspiration:`

`Dataset Information:`

`Use Cases:`