6 datasets found

GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Data from: Hacker News
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hacker News (2019). Hacker News [Dataset]. https://www.kaggle.com/hacker-news/hacker-news
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset authored and provided by
Hacker Newshttp://news.ycombinator.com/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Content

Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.hacker_news.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

This dataset was kindly made publicly available by Hacker News under the MIT license.

Inspiration

Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

Is the amount of coverage by Hacker News predictive of a startup’s success?
3D MNIST
kaggle.com
Updated Oct 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David de la Iglesia Castro (2019). 3D MNIST [Dataset]. https://www.kaggle.com/daavoo/3d-mnist/Kernels
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
David de la Iglesia Castro
Description
Context

The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.

Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:

RGB-D devices: Google Tango, Microsoft Kinect, etc.

Lidar.

3D reconstruction from multiple images.

However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).

This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).

In the 3D_from_2D notebook you can find the code used to generate the dataset.

You can use the code in the notebook to generate a bigger 3D dataset from the original.

Content

full_dataset_vectors.h5

The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.

In adition to the original point clouds, it contains randomly rotated copies with noise.

The full dataset is splitted into arrays:

X_train (10000, 4096)

y_train (10000)

X_test(2000, 4096)

y_test (2000)

Example python code reading the full dataset:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: X_train = hf["X_train"][:] y_train = hf["y_train"][:] X_test = hf["X_test"][:] y_test = hf["y_test"][:]

train_point_clouds.h5 & test_point_clouds.h5

5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.

Each file is divided into HDF5 groups

Each group is named as its corresponding array index in the original mnist dataset and it contains:

"points" dataset: x, y, z coordinates of each 3D point in the point cloud.

"normals" dataset: nx, ny, nz components of the unit normal associate to each point.

"img" dataset: the original mnist image.

"label" attribute: the original mnist label.

Example python code reading 2 digits and storing some of the group content in tuples:

with h5py.File("../input/train_point_clouds.h5", "r") as hf: a = hf["0"] b = hf["1"] digit_a = (a["img"][:], a["points"][:], a.attrs["label"]) digit_b = (b["img"][:], b["points"][:], b.attrs["label"])

voxelgrid.py

Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.

plot3D.py

Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here

Functions included:

array_to_color Converts 1D array to rgb values use as kwarg color in plot_points()

plot_points(xyz, colors=None, size=0.1, axis=False)

plot_voxelgrid(v_grid, cmap="Oranges", axis=False)

Acknowledgements

Website of the original MNIST dataset

Website of the 3D MNIST dataset

Have fun!
Rare Pepes Stock Prices
kaggle.com
Updated Jun 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Larxel (2020). Rare Pepes Stock Prices [Dataset]. https://www.kaggle.com/andrewmvd/rare-pepes/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Larxel
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F793761%2F38c53e7b60068a9b81d56c9abd962c62%2Fpepe_card.png?generation=1591424357125362&alt=media" alt="">

About this dataset

Rare Pepes are sold within blockchain networks - internets as its finest. Such market has sold over $1.2 million worth of images in the process. That’s about 100 million in Pepe Cash - a cryptocurrency.

How to use this dataset

Your kernel can be featured here!

More datasets

Acknowledgments

This data was organized by fivethirtyeight for this story and collected and published by the Rare Pepe Foundation.

Recommended links

Original story

Github repo

License

License was not specified at the source

Splash banner

Splash banner by unknown (just like the origin of so many pepes).
Reddit Sci/Tech Acronyms
kaggle.com
Updated Jun 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
salbaroudi (2019). Reddit Sci/Tech Acronyms [Dataset]. https://www.kaggle.com/salbaroudi/reddit-scitech-acronyms/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
salbaroudi
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Introduction:

140k+ acronyms were mined from science, tech, bio and future leaning subreddits. This was done with PRAW and compiled into a .csv file. This data set was originally mined to be a learning tool, used to illustrate pandas groupings, visualizations and count based time series. If the data set is refined enough, it might be possible to use it for prediction.

Data Acquisition (Codebook):

PRAW (a python3 library) script was used to mine the data from a list of subreddits that were hand selected. Science and Tech themed subreddits were focused on, as they tend to have higher quality content. To expand the list, a subreddit graph explorer was used to get a better view of the Sci/Tech subreddit network. Subreddits were excluded according to the following criteria:

(1) Too few submissions and/or users.

(2) Too esoteric, niche, or a subset of a much larger subreddit (example: pennystocks is a subset of stocks, in terms of content scope).

(3) Satirical, politicized, or highly valenced in content (example: pcmasterrace).

Some of these points are dependent on human interpretation - which may introduce bias into the data. See subreddit.txt file for a list of those selected. For each subreddit: upto 1000 submissions had there comment trees fully populated, and each comment was scanned for acronyms that were 3 to 7 letters in length. Associated information was then compiled, and written to a csv file. The format of the data table is below:

commID: Reddit Comment ID (base 36 integer) (primary key)

time: unix system time stamp for comment, that acronym is mentioned in. (float)

user: username for person making comment. (string)

subreddit: name of subreddit acronym appears in. (string)

acronym: The term itself. (string)

Data Statistics and Facts:

See the kernel for more details.

References:

To reference this data set, use the following information: al-Baroudi, S. (2019, June). Reddit Sci/Tech Acronyms Dataset, Version 1. Retrieved (current date)
Montreal Street Parking
kaggle.com
Updated Nov 18, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdy Nabaee (2019). Montreal Street Parking [Dataset]. https://www.kaggle.com/mnabaee/mtlstreetparking/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mahdy Nabaee
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Montreal
Description
This dataset contains information about the streets and street parking signals in the City of Montreal. They are obtained from the City's portal at http://donnees.ville.montreal.qc.ca/group/transport.

In this database, you will see three different files, relevant to our problem of interest. gbdouble.json: This is a geo-json file which contains the geographical coordinates for each side of the streets in the City. Each street side is described by a number of line segments (the coordinates of a number of points). To open and read the coordinates of the street segments, you can have a look at https://www.kaggle.com/mnabaee/d/mnabaee/mtlstreetparking/load-street-side-coordinates/

signalisation.csv: This csv file includes all of the parking signals in the City. Each row will have the latitude, longitude, a text description as well as a number of other fields. One may need to parse/digest the text information field to understand what the signal means.

signalisation.pdf: This pdf contains all of the picture of each signal and their corresponding signal code (which can be matched with the codes provided in signalisation.csv).

The main goal in analyzing this dataset is to create an interactive map of the street parking spots at a given time (interval). The available parking spots should be found by analyzing the signals in a street and digesting their meaning.

A preliminary work on the data set is done on the data where you can see at GitHub.

https://raw.githubusercontent.com/mnabaee/kernels/master/mtl-street-parking/finalres.png" alt="Final Result">
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos

GitHub Repos

Code and comments from 2.8 million repos

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

GitHubhttps://github.com/

Authors

Github

Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.
Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?

Clear search

Close search

Google apps

Main menu

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Data from: Hacker News

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

3D MNIST

Context

Content

full_dataset_vectors.h5

train_point_clouds.h5 & test_point_clouds.h5

voxelgrid.py

plot3D.py

Acknowledgements

Have fun!

Rare Pepes Stock Prices

About this dataset

How to use this dataset

Acknowledgments

Recommended links

License

Splash banner

Reddit Sci/Tech Acronyms

Introduction:

Data Acquisition (Codebook):

Data Statistics and Facts:

References:

Montreal Street Parking

GitHub ReposSee More Versions

Code and comments from 2.8 million repos

Querying BigQuery tables

Acknowledgements

Inspiration

GitHub Repos