GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".
Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.
Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.hacker_news.[TABLENAME]
. Fork this kernel to get started.
This dataset was kindly made publicly available by Hacker News under the MIT license.
Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?
Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?
Is the amount of coverage by Hacker News predictive of a startup’s success?
The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.
Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:
However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).
This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).
In the 3D_from_2D notebook you can find the code used to generate the dataset.
You can use the code in the notebook to generate a bigger 3D dataset from the original.
The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.
In adition to the original point clouds, it contains randomly rotated copies with noise.
The full dataset is splitted into arrays:
Example python code reading the full dataset:
with h5py.File("../input/train_point_clouds.h5", "r") as hf:
X_train = hf["X_train"][:]
y_train = hf["y_train"][:]
X_test = hf["X_test"][:]
y_test = hf["y_test"][:]
5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.
Each file is divided into HDF5 groups
Each group is named as its corresponding array index in the original mnist dataset and it contains:
x, y, z
coordinates of each 3D point in the point cloud.nx, ny, nz
components of the unit normal associate to each point.Example python code reading 2 digits and storing some of the group content in tuples:
with h5py.File("../input/train_point_clouds.h5", "r") as hf:
a = hf["0"]
b = hf["1"]
digit_a = (a["img"][:], a["points"][:], a.attrs["label"])
digit_b = (b["img"][:], b["points"][:], b.attrs["label"])
Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.
Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here
Functions included:
array_to_color
Converts 1D array to rgb values use as kwarg color
in plot_points()
plot_points(xyz, colors=None, size=0.1, axis=False)
plot_voxelgrid(v_grid, cmap="Oranges", axis=False)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F793761%2F38c53e7b60068a9b81d56c9abd962c62%2Fpepe_card.png?generation=1591424357125362&alt=media" alt="">
Rare Pepes are sold within blockchain networks - internets as its finest. Such market has sold over $1.2 million worth of images in the process. That’s about 100 million in Pepe Cash - a cryptocurrency.
- Your kernel can be featured here!
- More datasets
This data was organized by fivethirtyeight for this story and collected and published by the Rare Pepe Foundation.
Recommended links
License
License was not specified at the source
Splash banner
Splash banner by unknown (just like the origin of so many pepes).
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
140k+ acronyms were mined from science, tech, bio and future leaning subreddits. This was done with PRAW and compiled into a .csv file. This data set was originally mined to be a learning tool, used to illustrate pandas groupings, visualizations and count based time series. If the data set is refined enough, it might be possible to use it for prediction.
PRAW (a python3 library) script was used to mine the data from a list of subreddits that were hand selected. Science and Tech themed subreddits were focused on, as they tend to have higher quality content. To expand the list, a subreddit graph explorer was used to get a better view of the Sci/Tech subreddit network. Subreddits were excluded according to the following criteria:
(1) Too few submissions and/or users.
(2) Too esoteric, niche, or a subset of a much larger subreddit (example: pennystocks is a subset of stocks, in terms of content scope).
(3) Satirical, politicized, or highly valenced in content (example: pcmasterrace).
Some of these points are dependent on human interpretation - which may introduce bias into the data. See subreddit.txt file for a list of those selected. For each subreddit: upto 1000 submissions had there comment trees fully populated, and each comment was scanned for acronyms that were 3 to 7 letters in length. Associated information was then compiled, and written to a csv file. The format of the data table is below:
commID: Reddit Comment ID (base 36 integer) (primary key)
time: unix system time stamp for comment, that acronym is mentioned in. (float)
user: username for person making comment. (string)
subreddit: name of subreddit acronym appears in. (string)
acronym: The term itself. (string)
See the kernel for more details.
To reference this data set, use the following information: al-Baroudi, S. (2019, June). Reddit Sci/Tech Acronyms Dataset, Version 1. Retrieved (current date)
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains information about the streets and street parking signals in the City of Montreal. They are obtained from the City's portal at http://donnees.ville.montreal.qc.ca/group/transport.
In this database, you will see three different files, relevant to our problem of interest. gbdouble.json: This is a geo-json file which contains the geographical coordinates for each side of the streets in the City. Each street side is described by a number of line segments (the coordinates of a number of points). To open and read the coordinates of the street segments, you can have a look at https://www.kaggle.com/mnabaee/d/mnabaee/mtlstreetparking/load-street-side-coordinates/
signalisation.csv: This csv file includes all of the parking signals in the City. Each row will have the latitude, longitude, a text description as well as a number of other fields. One may need to parse/digest the text information field to understand what the signal means.
signalisation.pdf: This pdf contains all of the picture of each signal and their corresponding signal code (which can be matched with the codes provided in signalisation.csv).
The main goal in analyzing this dataset is to create an interactive map of the street parking spots at a given time (interval). The available parking spots should be found by analyzing the signals in a street and digesting their meaning.
A preliminary work on the data set is done on the data where you can see at GitHub.
https://raw.githubusercontent.com/mnabaee/kernels/master/mtl-street-parking/finalres.png" alt="Final Result">
Not seeing a result you expected?
Learn how you can add new datasets to our index.
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.
This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]
. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.
This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.