6 datasets found
  1. GitHub Repos

    • kaggle.com
    zip
    Updated Mar 20, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset provided by
    GitHubhttps://github.com/
    Authors
    Github
    Description

    GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

    This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

    Acknowledgements

    This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

    Inspiration

    • This is the perfect dataset for fighting language wars.
    • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
  2. Data from: Hacker News

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hacker News (2019). Hacker News [Dataset]. https://www.kaggle.com/hacker-news/hacker-news
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset authored and provided by
    Hacker Newshttp://news.ycombinator.com/
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

    Content

    Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

    Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

    Querying BigQuery tables

    You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.hacker_news.[TABLENAME]. Fork this kernel to get started.

    Acknowledgements

    This dataset was kindly made publicly available by Hacker News under the MIT license.

    Inspiration

    • Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

    • Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

    • Is the amount of coverage by Hacker News predictive of a startup’s success?

  3. 3D MNIST

    • kaggle.com
    Updated Oct 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David de la Iglesia Castro (2019). 3D MNIST [Dataset]. https://www.kaggle.com/daavoo/3d-mnist/Kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    David de la Iglesia Castro
    Description

    Context

    The aim of this dataset is to provide a simple way to get started with 3D computer vision problems such as 3D shape recognition.

    Accurate 3D point clouds can (easily and cheaply) be adquired nowdays from different sources:

    However there is a lack of large 3D datasets (you can find a good one here based on triangular meshes); it's especially hard to find datasets based on point clouds (wich is the raw output from every 3D sensing device).

    This dataset contains 3D point clouds generated from the original images of the MNIST dataset to bring a familiar introduction to 3D to people used to work with 2D datasets (images).

    In the 3D_from_2D notebook you can find the code used to generate the dataset.

    You can use the code in the notebook to generate a bigger 3D dataset from the original.

    Content

    full_dataset_vectors.h5

    The entire dataset stored as 4096-D vectors obtained from the voxelization (x:16, y:16, z:16) of all the 3D point clouds.

    In adition to the original point clouds, it contains randomly rotated copies with noise.

    The full dataset is splitted into arrays:

    • X_train (10000, 4096)
    • y_train (10000)
    • X_test(2000, 4096)
    • y_test (2000)

    Example python code reading the full dataset:

     with h5py.File("../input/train_point_clouds.h5", "r") as hf:  
       X_train = hf["X_train"][:]
       y_train = hf["y_train"][:]  
       X_test = hf["X_test"][:] 
       y_test = hf["y_test"][:] 
    

    train_point_clouds.h5 & test_point_clouds.h5

    5000 (train), and 1000 (test) 3D point clouds stored in HDF5 file format. The point clouds have zero mean and a maximum dimension range of 1.

    Each file is divided into HDF5 groups

    Each group is named as its corresponding array index in the original mnist dataset and it contains:

    • "points" dataset: x, y, z coordinates of each 3D point in the point cloud.
    • "normals" dataset: nx, ny, nz components of the unit normal associate to each point.
    • "img" dataset: the original mnist image.
    • "label" attribute: the original mnist label.

    Example python code reading 2 digits and storing some of the group content in tuples:

    with h5py.File("../input/train_point_clouds.h5", "r") as hf:  
      a = hf["0"]
      b = hf["1"]  
      digit_a = (a["img"][:], a["points"][:], a.attrs["label"]) 
      digit_b = (b["img"][:], b["points"][:], b.attrs["label"]) 
    

    voxelgrid.py

    Simple Python class that generates a grid of voxels from the 3D point cloud. Check kernel for use.

    plot3D.py

    Module with functions to plot point clouds and voxelgrid inside jupyter notebook. You have to run this locally due to Kaggle's notebook lack of support to rendering Iframes. See github issue here

    Functions included:

    • array_to_color Converts 1D array to rgb values use as kwarg color in plot_points()

    • plot_points(xyz, colors=None, size=0.1, axis=False)

    • plot_voxelgrid(v_grid, cmap="Oranges", axis=False)

    Acknowledgements

    Have fun!

  4. Rare Pepes Stock Prices

    • kaggle.com
    Updated Jun 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Larxel (2020). Rare Pepes Stock Prices [Dataset]. https://www.kaggle.com/andrewmvd/rare-pepes/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Larxel
    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F793761%2F38c53e7b60068a9b81d56c9abd962c62%2Fpepe_card.png?generation=1591424357125362&alt=media" alt="">

    About this dataset

    Rare Pepes are sold within blockchain networks - internets as its finest. Such market has sold over $1.2 million worth of images in the process. That’s about 100 million in Pepe Cash - a cryptocurrency.

    How to use this dataset

    Acknowledgments

    This data was organized by fivethirtyeight for this story and collected and published by the Rare Pepe Foundation.

    Recommended links

    License

    License was not specified at the source

    Splash banner

    Splash banner by unknown (just like the origin of so many pepes).

  5. Reddit Sci/Tech Acronyms

    • kaggle.com
    Updated Jun 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    salbaroudi (2019). Reddit Sci/Tech Acronyms [Dataset]. https://www.kaggle.com/salbaroudi/reddit-scitech-acronyms/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    salbaroudi
    License

    https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api

    Description

    Introduction:

    140k+ acronyms were mined from science, tech, bio and future leaning subreddits. This was done with PRAW and compiled into a .csv file. This data set was originally mined to be a learning tool, used to illustrate pandas groupings, visualizations and count based time series. If the data set is refined enough, it might be possible to use it for prediction.

    Data Acquisition (Codebook):

    PRAW (a python3 library) script was used to mine the data from a list of subreddits that were hand selected. Science and Tech themed subreddits were focused on, as they tend to have higher quality content. To expand the list, a subreddit graph explorer was used to get a better view of the Sci/Tech subreddit network. Subreddits were excluded according to the following criteria:

    (1) Too few submissions and/or users.

    (2) Too esoteric, niche, or a subset of a much larger subreddit (example: pennystocks is a subset of stocks, in terms of content scope).

    (3) Satirical, politicized, or highly valenced in content (example: pcmasterrace).

    Some of these points are dependent on human interpretation - which may introduce bias into the data. See subreddit.txt file for a list of those selected. For each subreddit: upto 1000 submissions had there comment trees fully populated, and each comment was scanned for acronyms that were 3 to 7 letters in length. Associated information was then compiled, and written to a csv file. The format of the data table is below:

    commID: Reddit Comment ID (base 36 integer) (primary key)

    time: unix system time stamp for comment, that acronym is mentioned in. (float)

    user: username for person making comment. (string)

    subreddit: name of subreddit acronym appears in. (string)

    acronym: The term itself. (string)

    Data Statistics and Facts:

    See the kernel for more details.

    References:

    To reference this data set, use the following information: al-Baroudi, S. (2019, June). Reddit Sci/Tech Acronyms Dataset, Version 1. Retrieved (current date)

  6. Montreal Street Parking

    • kaggle.com
    Updated Nov 18, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahdy Nabaee (2019). Montreal Street Parking [Dataset]. https://www.kaggle.com/mnabaee/mtlstreetparking/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mahdy Nabaee
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Montreal
    Description

    This dataset contains information about the streets and street parking signals in the City of Montreal. They are obtained from the City's portal at http://donnees.ville.montreal.qc.ca/group/transport.

    In this database, you will see three different files, relevant to our problem of interest. gbdouble.json: This is a geo-json file which contains the geographical coordinates for each side of the streets in the City. Each street side is described by a number of line segments (the coordinates of a number of points). To open and read the coordinates of the street segments, you can have a look at https://www.kaggle.com/mnabaee/d/mnabaee/mtlstreetparking/load-street-side-coordinates/

    signalisation.csv: This csv file includes all of the parking signals in the City. Each row will have the latitude, longitude, a text description as well as a number of other fields. One may need to parse/digest the text information field to understand what the signal means.

    signalisation.pdf: This pdf contains all of the picture of each signal and their corresponding signal code (which can be matched with the codes provided in signalisation.csv).

    The main goal in analyzing this dataset is to create an interactive map of the street parking spots at a given time (interval). The available parking spots should be found by analyzing the signals in a street and digesting their meaning.

    A preliminary work on the data set is done on the data where you can see at GitHub.

    https://raw.githubusercontent.com/mnabaee/kernels/master/mtl-street-parking/finalres.png" alt="Final Result">

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Organization logo

GitHub Repos

Code and comments from 2.8 million repos

Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

  • This is the perfect dataset for fighting language wars.
  • Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Search
Clear search
Close search
Google apps
Main menu