28 datasets found

GitHub Repos
kaggle.com
zip
Updated Mar 20, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
GitHubhttps://github.com/
Authors
Github
Description
GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.

Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?
Open Images
kaggle.com
opendatalab.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Open Images [Dataset]. https://www.kaggle.com/bigquery/open-images
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

Labeled datasets are useful in machine learning research.

Content

This public dataset contains approximately 9 million URLs and metadata for images that have been annotated with labels spanning more than 6,000 categories.

Tables: 1) annotations_bbox 2) dict 3) images 4) labels

Update Frequency: Quarterly

Querying BigQuery Tables

Fork this kernel to get started.

Acknowledgements

https://bigquery.cloud.google.com/dataset/bigquery-public-data:open_images

https://cloud.google.com/bigquery/public-data/openimages

APA-style citation: Google Research (2016). The Open Images dataset [Image urls and labels]. Available from github: https://github.com/openimages/dataset.

Use: The annotations are licensed by Google Inc. under CC BY 4.0 license.

The images referenced in the dataset are listed as having a CC BY 2.0 license. Note: while we tried to identify images that are licensed under a Creative Commons Attribution license, we make no representations or warranties regarding the license status of each image and you should verify the license for each image yourself.

Banner Photo by Mattias Diesel from Unsplash.

Inspiration

Which labels are in the dataset? Which labels have "bus" in their display names? How many images of a trolleybus are in the dataset? What are some landing pages of images with a trolleybus? Which images with cherries are in the training set?
Eclipse Megamovie
console.cloud.google.com
Updated Jul 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Google%20Cloud%20Public%20Datasets%20Program&hl=en_GB (2023). Eclipse Megamovie [Dataset]. https://console.cloud.google.com/marketplace/product/google-cloud-public-datasets/eclipse-megamovie?hl=en_GB
Explore at:
Dataset updated
Jul 15, 2023
Dataset provided by
Googlehttp://google.com/
Description
This is the full set of images submitted for the Eclipse Megamovie project, a citizen science project to capture images of the Sun’s corona during the August 21, 2017 total solar eclipse. These images were taken by volunteer photographers (as well as the general public) from across the country using consumer camera equipment. The Eclipse Megamovie project was a collaboration between UC Berkeley, Google, the Astronomical Society of the Pacific, and many more.* In addition to the dataset, the code used by the project to create the website and process individual movies can be found in GitHub For a full description of the data fields, see below. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery . *Additional partners: Center for Research on Lifelong STEM Learning, Oregon State University, Eclipse Across America, Foothill College, High Altitude Observatory of the National Center for Atmospheric Research, Ideum, Lick Observatory, Space Sciences Laboratory, University of California, Berkeley, University of Colorado at Boulder, Williams College and the IAU Working Group.
GitHub Programming Languages Data
kaggle.com
zip
Updated Jan 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Wen (2022). GitHub Programming Languages Data [Dataset]. https://www.kaggle.com/datasets/isaacwen/github-programming-languages-data
Explore at:
zip(41198 bytes)Available download formats
Dataset updated
Jan 2, 2022
Authors
Isaac Wen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

A common question for those new and familiar to computer science and software engineering is what is the most best and/or most popular programming language. It is very difficult to give a definitive answer, as there are a seemingly indefinite number of metrics that can define the 'best' or 'most popular' programming language.

One such metric that can be used to define a 'popular' programming language is the number of projects and files that are made using that programming language. As GitHub is the most popular public collaboration and file-sharing platform, analyzing the languages that are used for repositories, PRs, and issues on GitHub and be a good indicator for the popularity of a language.

Content

This dataset contains statistics about the programming languages used for repositories, PRs, and issues on GitHub. The data is from 2011 to 2021.

Source

This data was queried and aggregated from BigQuery's public github_repos and githubarchive datasets.

Limitations

Only data for public GitHub repositories, and their corresponding PRs/issues, have their data available publicly. Thus, this dataset is only based on public repositories, which may not be fully representative of all repositories on GitHub.
Makani Flight Logs
console.cloud.google.com
Updated Feb 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:BigQuery%20Public%20Datasets%20Program&hl=en-GB (2020). Makani Flight Logs [Dataset]. https://console.cloud.google.com/marketplace/product/bigquery-public-datasets/makani-logs?hl=en-GB
Explore at:
Dataset updated
Feb 15, 2020
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
License
Description
Makani developed energy kites, using a wing tethered to a ground station to efficiently harness energy from the wind, generating electricity at utility-scale. As the kite flies autonomously in loops, rotors on the wing spin as the wind moves through them, generating electricity that is sent down the tether to the grid. The company was closed in February 2020, but major technical learnings have been made available in the public domain. This data set is part of that public package. The main folder in this bucket is labeled 'merged logs' and contains all telemetry from the kite and base station collected during crosswind flights of the M600 kite between 2016 and 2019. The other buckets contain build files and databases that are used to build and run the Makani flight simulator, which can be accessed at github.com/google/makani . This public dataset is hosted in Google Cloud Storage and available free to use. Use this quick start guide to learn how to access public datasets on Google Cloud Storage.

GitHub-Issues

kaggle.com

zip

Updated Apr 28, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Hamel Husain (2022). GitHub-Issues [Dataset]. https://www.kaggle.com/datasets/hamelhusain/githubissues

Explore at:

zip(2612014366 bytes)Available download formats

Dataset updated

Apr 28, 2022

Authors

Hamel Husain

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

GitHub Issues Data Pulled From BigQuery Public Dataset -- Access this query at https://console.cloud.google.com/bigquery?sq=235037502967:a71a4b32d74442558a2739b581064e5f

This data is pulled with the following SQL query

SELECT url, title, body
FROM(
SELECT url, title, body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 80, 120) ORDER BY url) as count_body_beg
FROM(
SELECT url, title, body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 40, 80) ORDER BY url) as count_body_beg
FROM(
SELECT url, title, body
 , ROW_NUMBER() OVER (PARTITION BY SUBSTR(body, 0, 40) ORDER BY url) as count_body_beg
FROM(

  SELECT DISTINCT 
   url
   -- replace more than one white-space character in a row with a single space
  , REGEXP_REPLACE(title, r"\s{2,}", ' ') as title
  , REGEXP_REPLACE(body, r"\s{2,}", ' ') as body
  , ROW_NUMBER() OVER (PARTITION BY SUBSTR(title, 0, 22) ORDER BY url) as count_title_beg
  -- , RANK() OVER (PARTITION BY SUBSTR(body, 0, 1000) ORDER BY url) as count_body_beg
  FROM(
    SELECT
      JSON_EXTRACT(payload, '$.issue.html_url') as url
      -- extract the title and body removing parentheses, brackets, and quotes
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
    FROM `githubarchive.day.2021*`
    WHERE 
     -- ALL Of 2021 
       _TABLE_SUFFIX BETWEEN '0101' and '1231'
     and type="IssuesEvent" 
     -- Only want the issue at a specific point otherwise will have duplicates
     and JSON_EXTRACT(payload, '$.action') = "\"opened\"" 
     UNION ALL 
       SELECT
      JSON_EXTRACT(payload, '$.issue.html_url') as url
      -- extract the title and body removing parentheses, brackets, and quotes
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
     , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\
|\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
    FROM `githubarchive.day.2020*`
    WHERE 
     -- ALL Of 2020 
       _TABLE_SUFFIX BETWEEN '0101' and '1231'
     and type="IssuesEvent" 
     -- Only want the issue at a specific point otherwise will have duplicates
     and JSON_EXTRACT(payload, '$.action') = "\"opened\""

  ) as tbl

  WHERE 
   -- the body must be at least 8 words long and the title at least 3 words long
   -- this is an arbitrary way to filter out empty or sparse issues
     ARRAY_LENGTH(SPLIT(body, ' ')) >= 6
   and ARRAY_LENGTH(SPLIT(title, ' ')) >= 3
   -- filter out issues that have really long titles or bodies
   --  (these are outliers, and will slow tokenization down).
   and LENGTH(title) <= 400
   and LENGTH(body) <= 2000
) tbl2
WHERE count_title_beg = 1
)tbl3
WHERE count_body_beg = 1
)tbl4
WHERE count_body_beg = 1
)tbl5
WHERE count_body_beg = 1

Synthetic Patient Data in OMOP
console.cloud.google.com
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:U.S.%20Department%20of%20Health%20%26%20Human%20Services&hl=ja (2023). Synthetic Patient Data in OMOP [Dataset]. https://console.cloud.google.com/marketplace/product/hhs/synpuf?hl=ja
Explore at:
Dataset updated
Jul 26, 2023
Dataset provided by
Googlehttp://google.com/
Description
The Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF). It is synthetic data containing 2008-2010 Medicare insurance claims for development and demonstration purposes. It has been converted to the Observational Medical Outcomes Partnership (OMOP) common data model from its original form, CSV, by the open source community as released on GitHub Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created." This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
PatCit: A Comprehensive Dataset of Patent Citations
zenodo.org
application/gzip, bin
Updated Dec 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise (2020). PatCit: A Comprehensive Dataset of Patent Citations [Dataset]. http://doi.org/10.5281/zenodo.3710994
Explore at:
application/gzip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3710994
Dataset updated
Dec 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gaétan de Rassenfosse; Gaétan de Rassenfosse; Cyril Verluise; Cyril Verluise
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.

parsed and consolidated the 27 million NPL citations classified as bibliographical references.

extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.

The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

We address worldwide patents, as long as the data is available.

We address all classes of citations, not only bibliographical references.

We address front-page and in-text citations.

Highest standards

We use and implement state-of-the art machine learning solutions.

We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
SOTorrent 2018-12-09
kaggle.com
zip
Updated Dec 18, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SOTorrent (2018). SOTorrent 2018-12-09 [Dataset]. https://www.kaggle.com/datasets/sotorrent/2018-12-09
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Dec 18, 2018
Dataset authored and provided by
SOTorrent
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Please notice

Tables TitleVersion and Votes are not yet visible in the Data preview page, but they are accessible in Kernels.

Context

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump.

Content

SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub. If you use this dataset in your work, please cite our MSR 2018 paper or our MSR 2019 mining challenge proposal.

This version is based on the official Stack Overflow data dump released 2018-12-02 and the Google BigQuery GitHub data set queried 2018-12-09.

Inspiration

The goal of the MSR 2019 mining challenge is to study the origin, evolution, and usage of Stack Overflow code snippets. Questions that are, to the best of our knowledge, not sufficiently answered yet include:

How are code snippets on Stack Overflow maintained?

How many clones of code snippets exist inside Stack Overflow?

How can we detect buggy versions of Stack Overflow code snippets and find them in GitHub projects?

How frequently are code snippets copied from external sources into Stack Overflow and then co-evolve there?

How do snippets copied from Stack Overflow to GitHub co-evolve?

Does the evolution of Stack Overflow code snippets follow patterns?

Do these patterns differ between programming languages?

Are the licenses of external sources compatible with Stack Overflow’s license (CC BY-SA 3.0)?

How many code blocks on Stack Overflow do not contain source code (and are only used for markup)?

Can we reliably predict bug-fixing edits to code on Stack Overflow?

Can we reliably predict popularity of Stack Overflow code snippets on GitHub?

These are just some of the questions that could be answered using SOTorrent. We encourage challenge participants to adapt the above questions or formulate their own research questions about the origin, evolution, and usage of content on Stack Overflow.
Data from: arXiv Dataset
kaggle.com
huggingface.co
+1more
zip
Updated Nov 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2020). arXiv Dataset [Dataset]. https://www.kaggle.com/Cornell-University/arxiv
Explore at:
zip(950178574 bytes)Available download formats
Dataset updated
Nov 22, 2020
Dataset authored and provided by
Cornell University
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
About ArXiv

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.

In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, we present a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more.

Our hope is to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

The dataset is freely available via Google Cloud Storage buckets (more info here). Stay tuned for weekly updates to the dataset!

ArXiv is a collaboratively funded, community-supported resource founded by Paul Ginsparg in 1991 and maintained and operated by Cornell University.

The release of this dataset was featured further in a Kaggle blog post here.

https://storage.googleapis.com/kaggle-public-downloads/arXiv.JPG" alt="">

See here for more information.

ArXiv On Kaggle

Metadata

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. This file contains an entry for each paper, containing: - id: ArXiv ID (can be used to access the paper, see below) - submitter: Who submitted the paper - authors: Authors of the paper - title: Title of the paper - comments: Additional info, such as number of pages and figures - journal-ref: Information about the journal the paper was published in - doi: https://www.doi.org - abstract: The abstract of the paper - categories: Categories / tags in the ArXiv system - versions: A version history

You can access each paper directly on ArXiv using these links: - https://arxiv.org/abs/{id}: Page for this paper including its abstract and further links - https://arxiv.org/pdf/{id}: Direct link to download the PDF

Bulk access

The full set of PDFs is available for free in the GCS bucket gs://arxiv-dataset or through Google API (json documentation and xml documentation).

You can use for example gsutil to download the data to your local machine. ```

List files:

gsutil cp gs://arxiv-dataset/arxiv/

Download pdfs from March 2020:

gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2003/ ./a_local_directory/

Download all the source files

gsutil cp -r gs://arxiv-dataset/arxiv/ ./a_local_directory/ ```

Update Frequency

We're automatically updating the metadata as well as the GCS bucket on a weekly basis.

License

Creative Commons CC0 1.0 Universal Public Domain Dedication applies to the metadata in this dataset. See https://arxiv.org/help/license for further details and licensing on individual papers.

Acknowledgements

The original data is maintained by ArXiv, huge thanks to the team for building and maintaining this dataset.

We're using https://github.com/mattbierbaum/arxiv-public-datasets to pull the original data, thanks to Matt Bierbaum for providing this tool.
Learning Path Index Dataset
kaggle.com
zip
Updated Nov 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mani Sarkar (2024). Learning Path Index Dataset [Dataset]. https://www.kaggle.com/datasets/neomatrix369/learning-path-index-dataset/code
Explore at:
zip(151846 bytes)Available download formats
Dataset updated
Nov 6, 2024
Authors
Mani Sarkar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description

The Learning Path Index Dataset is a comprehensive collection of byte-sized courses and learning materials tailored for individuals eager to delve into the fields of Data Science, Machine Learning, and Artificial Intelligence (AI), making it an indispensable reference for students, professionals, and educators in the Data Science and AI communities.

This Kaggle Dataset along with the KaggleX Learning Path Index GitHub Repo were created by the mentors and mentees of Cohort 3 KaggleX BIPOC Mentorship Program (between August 2023 and November 2023, also see this). See Credits section at the bottom of the long description.

Inspiration

This dataset was created out of a commitment to facilitate learning and growth within the Data Science, Machine Learning, and AI communities. It started off as an idea at the end of Cohort 2 of the KaggleX BIPOC Mentorship Program brainstorming and feedback session. It was one of the ideas to create byte-sized learning material to help our KaggleX mentees learn things faster. It aspires to simplify the process of finding, evaluating, and selecting the most fitting educational resources.

Context

This dataset was meticulously curated to assist learners in navigating the vast landscape of Data Science, Machine Learning, and AI education. It serves as a compass for those aiming to develop their skills and expertise in these rapidly evolving fields.

The mentors and mentees communicated via Discord, Trello, Google Hangout, etc... to put together these artifacts and made them public for everyone to use and contribute back.

Sources

The dataset compiles data from a curated selection of reputable sources including leading educational platforms such as Google Developer, Google Cloud Skill Boost, IBM, Fast AI, etc. By drawing from these trusted sources, we ensure that the data is both accurate and pertinent. The raw data and other artifacts as a result of this exercise can be found on the GitHub Repo i.e. KaggleX Learning Path Index GitHub Repo.

Content

The dataset encompasses the following attributes:

Course / Learning Material: The title of the Data Science, Machine Learning, or AI course or learning material.

Source: The provider or institution offering the course.

Course Level: The proficiency level, ranging from Beginner to Advanced.

Type (Free or Paid): Indicates whether the course is available for free or requires payment.

Module: Specific module or section within the course.

Duration: The estimated time required to complete the module or course.

Module / Sub-module Difficulty Level: The complexity level of the module or sub-module.

Keywords / Tags / Skills / Interests / Categories: Relevant keywords, tags, or categories associated with the course with a focus on Data Science, Machine Learning, and AI.

Links: Hyperlinks to access the course or learning material directly.

How to contribute to this initiative?

You can also join us by taking part in the next KaggleX BIPOC Mentorship program (also see this)

Keep your eyes open on the Kaggle Discussions page and other KaggleX social media channels. Or find us on the Kaggle Discord channel to learn more about the next steps

Create notebooks from this data

Create supplementary or complementary data for or from this dataset

Submit corrections/enhancements or anything else to help improve this dataset so it has a wider use and purpose

License

The Learning Path Index Dataset is openly shared under a permissive license, allowing users to utilize the data for educational, analytical, and research purposes within the Data Science, Machine Learning, and AI domains. Feel free to fork the dataset and make it your own, we would be delighted if you contributed back to the dataset and/or our KaggleX Learning Path Index GitHub Repo as well.

Important Links

KaggleX BIPOC Mentorship program (also see this)

KaggleX Learning Path Index Dataset

KaggleX Learning Path Index GitHub Repo

New Official Kaggle Discord Server!

Credits

Credits for all the work done to create this Kaggle Dataset and the KaggleX [Learnin...
D
Dataset for Design Ideation Study
dataverse.azure.uit.no
dataverse.no
application/x-h5, pdf +3
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert (2024). Dataset for Design Ideation Study [Dataset]. http://doi.org/10.18710/PZQC4A
Explore at:
tsv(7501), txt(13093), application/x-h5(25860340), application/x-h5(286920385), zip(581532), tsv(295160), application/x-h5(540715825), tsv(767327), application/x-h5(49209334), application/x-h5(510702725), tsv(1336354), tsv(2010), tsv(1935109), pdf(33267), application/x-h5(272694817)Available download formats
Unique identifier
https://doi.org/10.18710/PZQC4A
Dataset updated
Feb 28, 2024
Dataset provided by
DataverseNO
Authors
Filip Gornitzka Abelson; Filip Gornitzka Abelson; Henrikke Dybvik; Henrikke Dybvik; Martin Steinert; Martin Steinert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Study information Design ideation study (N = 24) using eye tracking technology. Participants solved a total of twelve design problems while receiving inspirational stimuli on a monitor. Their task was to generate as many solutions to each problem and explain their solution briefly by thinking aloud. The study allows for getting further insight into how inspirational stimuli improve idea fluency during design ideation. This dataset features processed data from the experiment. Eye tracking data includes gaze data, fixation data, blink data, and pupillometry data for all participants. The study is based on the following research paper and follows the same experimental setup: Goucher-Lambert, K., Moss, J., & Cagan, J. (2019). A neuroimaging investigation of design ideation with and without inspirational stimuli—understanding the meaning of near and far stimuli. Design Studies, 60, 1-38. DOI Dataset Most files in the dataset are saved as CSV files or other human readable file formats. Large files are saved in Hierarchical Data Format (HDF5/H5) to allow for smaller file sizes and higher compression. All data is described thoroughly in 00_ReadMe.txt. The following processed data is included in the dataset: Concatenated annotations file of experimental flow for all participants (CSV). All eye tracking raw data in concatenated files. Annotated with only participant ID. (CSV/HDF5) Annotated eye tracking data for ideation routines only. A subset of the files above. (CSV/HDF5) Audio transcriptions from Google Cloud Speech-to-Text API of each recording with annotations. (CSV) Raw API response for each transcription. These files include time offset for each word in a recording. (JSON) Data for questionnaire feedback and ideas generated during the experiment. (CSV) Data for the post-experiment survey, including demographic information (TSV). Python code used for the open-source experimental setup and dataset construction is hosted at GitHub. Repository also includes code of how the dataset has been further processed.
DICOM converted Slide Microscopy images for the TCGA-SKCM collection
zenodo.org
bin
Updated Aug 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim (2024). DICOM converted Slide Microscopy images for the TCGA-SKCM collection [Dataset]. http://doi.org/10.5281/zenodo.12690040
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12690040
Dataset updated
Aug 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-SKCM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

Collection description

Melanoma is a cancer in a type of skin cells called melanocytes. Melanocyes are the cells that produce melanin, which colors the skin. When exposed to sun, these cells make more melanin, causing the skin to darken or tan. Melanoma can occur anywhere on the body and risk factors include fair complexion, family history of melanoma, and being exposed to natural or artificial sunlight over long periods of time. Melanoma is most often discovered because it has metastasized, or spread, to another organ, such as the lymph nodes. In many cases, the primary skin melanoma site is never found. Because of this challenge, TCGA is studying primarily metastatic cases (in contrast to other cancers selected for study, where metastatic cases are excluded). For 2011, it was estimated that there were 70,230 new cases of melanoma and 8,790 deaths from the disease.

Please see the TCGA-SKCM information page to learn more about the images and to obtain any supporting metadata for this collection.

Citation guidelines can be found on the Citing TCGA in Publications and Presentations information page.

Files included

A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

tcga_skcm-idc_v8-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services buckets

tcga_skcm-idc_v8-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage buckets

tcga_skcm-idc_v8-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

Download instructions

Each of the manifests include instructions in the header on how to download the included files.

To download the files using .s5cmd manifests:

install idc-index package: pip install --upgrade idc-index

download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd.

To download the files using .dcf manifest, see manifest header.

Acknowledgments

Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

References

[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
PheKnowLator Human Disease Knowledge Graph Benchmarks Archive
zenodo.org
bin
Updated Feb 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PheKnowLator Ecosystem Developers; PheKnowLator Ecosystem Developers (2024). PheKnowLator Human Disease Knowledge Graph Benchmarks Archive [Dataset]. http://doi.org/10.5281/zenodo.10689968
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10689968
Dataset updated
Feb 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
PheKnowLator Ecosystem Developers; PheKnowLator Ecosystem Developers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PKT Human Disease KG Benchmark Builds

The PheKnowLator (PKT) Human Disease KG (PKT-KG) was built to model mechanisms of human disease, which includes the Central Dogma and represents multiple biological scales of organization including molecular, cellular, tissue, and organ. The knowledge representation was designed in collaboration with a PhD-level molecular biologist (Figure).

The PKT Human Disease KG was constructed using 12 OBO Foundry ontologies, 31 Linked Open Data sets, and results from two large-scale experiments (Supplementary Material). The 12 OBO Foundry ontologies were selected to represent chemicals and vaccines (i.e., ChEBI and Vaccine Ontology), cells and cell lines (i.e., Cell Ontology, Cell Line Ontology), gene/gene product attributes (i.e., Gene Ontology), phenotypes and diseases (i.e., Human Phenotype Ontology, Mondo Disease Ontology), proteins, including complexes and isoforms (i.e., Protein Ontology), pathways (i.e., Pathway Ontology), types and attributes of biological sequences (i.e., Sequence Ontology), and anatomical entities (Uberon ontology). The RO is used to provide relationships between the core OBO Foundry ontologies and database entities.

The PKT Human Disease KG contained 18 node types and 33 edge types. Note that the number of nodes and edge types reflects those that are explicitly added to the core set of OBO Foundry ontologies and does not take into account the node and edge types provided by the ontologies. These nodes and edge types were used to construct 12 different PKT Human Disease benchmark KGs by altering the Knowledge Model (i.e., class- vs. instance-based), Relation Strategy (i.e., standard vs. inverse relations), and Semantic Abstraction (i.e., OWL-NETS (yes/no) with and without Knowledge Model harmonization [OWL-NETS Only vs. OWL-NETS + Harmonization]) parameters. Benchmarks within the PheKnowLator ecosystem are different versions of a KG that can be built under alternative knowledge models, relation strategies, and with or without semantic abstraction. They provide users with the ability to evaluate different modeling decisions (based on the prior mentioned parameters) and to examine the impact of these decisions on different downstream tasks.

The Figures and Tables explaining attributes in the builds can be found here.

Build Data Access

Important Build Information

The benchmarks were originally built and stored using Google Cloud Platform (GCP) resources. For details and a complete description of this process, can be found on GitHub (here). Note that we have developed this Zenodo-based archive for the builds. While the original GCP resources contained all of the resources needed to generate the builds, due to the file size upload limits associated with each archive, we have limited the uploaded files to the KGs, associated metadata, and log files. The list of resources, including their URLs, and date of download, can all be found in the logs associated with each build.

🗂 For additional information on the KG file types please see the following Wiki page, which is also available as a download from this repository (PheKnowLator_HumanDiseaseKG_Output_FileInformation.xlsx).

v1.0.0

KGs: https://zenodo.org/doi/10.5281/zenodo.7030200

Embeddings: https://zenodo.org/doi/10.5281/zenodo.7030188

All Other Build Versions

Class-based Builds

Standard Relations

OWL Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

OWL-NETS Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

Inverse Relations

OWL Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

OWL-NETS Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

Instance-based Builds

Standard Relations

OWL Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

OWL-NETS Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

Inverse Relations

OWL Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021

OWL-NETS Build

v2.0.0: MAY2020 ; JAN2021; FEB2021

v2.1.0: MAY2021; JUN2021; JUL2021; AUG2021; SEP2021

v3.0.2: OCT2021; NOV2021
Bitcoin Blockchain Historical Data
kaggle.com
zip
Updated Feb 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Bitcoin Blockchain Historical Data [Dataset]. https://www.kaggle.com/bigquery/bitcoin-blockchain
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Feb 12, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Blockchain technology, first implemented by Satoshi Nakamoto in 2009 as a core component of Bitcoin, is a distributed, public ledger recording transactions. Its usage allows secure peer-to-peer communication by linking blocks containing hash pointers to a previous block, a timestamp, and transaction data. Bitcoin is a decentralized digital currency (cryptocurrency) which leverages the Blockchain to store transactions in a distributed manner in order to mitigate against flaws in the financial industry.

Nearly ten years after its inception, Bitcoin and other cryptocurrencies experienced an explosion in popular awareness. The value of Bitcoin, on the other hand, has experienced more volatility. Meanwhile, as use cases of Bitcoin and Blockchain grow, mature, and expand, hype and controversy have swirled.

Content

In this dataset, you will have access to information about blockchain blocks and transactions. All historical data are in the bigquery-public-data:crypto_bitcoin dataset. It’s updated it every 10 minutes. The data can be joined with historical prices in kernels. See available similar datasets here: https://www.kaggle.com/datasets?search=bitcoin.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_bitcoin.[TABLENAME]. Fork this kernel to get started.

Method & Acknowledgements

Allen Day (Twitter | Medium), Google Cloud Developer Advocate & Colin Bookman, Google Cloud Customer Engineer retrieve data from the Bitcoin network using a custom client available on GitHub that they built with the bitcoinj Java library. Historical data from the origin block to 2018-01-31 were loaded in bulk to two BigQuery tables, blocks_raw and transactions. These tables contain fresh data, as they are now appended when new blocks are broadcast to the Bitcoin network. For additional information visit the Google Cloud Big Data and Machine Learning Blog post "Bitcoin in BigQuery: Blockchain analytics on public data".

Photo by Andre Francois on Unsplash.

Inspiration

How many bitcoins are sent each day?

How many addresses receive bitcoin each day?

Compare transaction volume to historical prices by joining with other available data sources
BigQuery GIS Utility Datasets (U.S.)
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). BigQuery GIS Utility Datasets (U.S.) [Dataset]. https://www.kaggle.com/bigquery/utility-us
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Querying BigQuery tables You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME].

Project: "bigquery-public-data"

Table: "utility_us"

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

If you're using Python, you can start with this code:

import pandas as pd from bq_helper import BigQueryHelper bq_assistant = BigQueryHelper("bigquery-public-data", "utility_us")
Ethereum Classic Blockchain
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Ethereum Classic Blockchain [Dataset]. https://www.kaggle.com/datasets/bigquery/crypto-ethereum-classic
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Ethereum Classic is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting) functionality. It provides a decentralized Turing-complete virtual machine, the Ethereum Virtual Machine (EVM), which can execute scripts using an international network of public nodes. Ethereum Classic and Ethereum have a value token called "ether", which can be transferred between participants, stored in a cryptocurrency wallet and is used to compensate participant nodes for computations performed in the Ethereum Platform.

Ethereum Classic came into existence when some members of the Ethereum community rejected the DAO hard fork on the grounds of "immutability", the principle that the blockchain cannot be changed, and decided to keep using the unforked version of Ethereum. Till this day, Etherum Classic runs the original Ethereum chain.

Content

In this dataset, you will have access to Ethereum Classic (ETC) historical block data along with transactions and traces. You can access the data from BigQuery in your notebook with bigquery-public-data.crypto_ethereum_classic dataset.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.crypto_ethereum_classic.[TABLENAME]. Fork this kernel to get started.

Acknowledgements

This dataset wouldn't be possible without the help of Allen Day, Evgeny Medvedev and Yaz Khoury. This dataset uses Blockchain ETL. Special thanks to ETC community member @donsyang for the banner image.

Inspiration

One of the main questions we wanted to answer was the Gini coefficient of ETC data. We also wanted to analyze the DAO Smart Contract before and after the DAO Hack and the resulting Hardfork. We also wanted to analyze the network during the famous 51% attack and see what sort of patterns we can spot about the attacker.
Hacker News Corpus
kaggle.com
zip
Updated Jun 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hacker News (2017). Hacker News Corpus [Dataset]. https://www.kaggle.com/hacker-news/hacker-news-corpus
Explore at:
zip(642956855 bytes)Available download formats
Dataset updated
Jun 29, 2017
Dataset authored and provided by
Hacker News
Description
Context

This dataset contains a randomized sample of roughly one quarter of all stories and comments from Hacker News from its launch in 2006. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".

Content

Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.

Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.

Acknowledgements

This dataset was kindly made publicly available by Hacker News under the MIT license.

Inspiration

Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?

Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?

Is the amount of coverage by Hacker News predictive of a startup’s success?

Use this dataset with BigQuery

You can use Kernels to analyze, share, and discuss this data on Kaggle, but if you’re looking for real-time updates and bigger data, check out the data in BigQuery, too: https://cloud.google.com/bigquery/public-data/hacker-news

The BigQuery version of this dataset has roughly four times as many articles.
Global Development Analysis (2000-2020)
kaggle.com
zip
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Matta (2025). Global Development Analysis (2000-2020) [Dataset]. https://www.kaggle.com/datasets/michaelmatta0/global-development-indicators-2000-2020
Explore at:
zip(1311638 bytes)Available download formats
Dataset updated
May 11, 2025
Authors
Michael Matta
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Global Economic, Environmental, Health, and Social indicators Ready for Analysis

📝 Description

This comprehensive dataset merges global economic, environmental, technological, and human development indicators from 2000 to 2020. Sourced and transformed from multiple public datasets via Google BigQuery, it is designed for advanced exploratory data analysis, machine learning, policy modeling, and sustainability research.

Curated by combining and transforming data from the Google BigQuery Public Data program, this dataset offers a harmonized view of global development across more than 40 key indicators spanning over two decades (2000–2020). It supports research across multiple domains such as:

Economic Growth

Climate Sustainability

Digital Transformation

Public Health

Human Development

Resilience and Governance

for formulas and more details check: https://github.com/Michael-Matta1/datasets-collection/tree/main/Global%20Development

📅 Temporal Coverage

Years: 2000–2020

Includes calculated features:

years_since_2000

years_since_century

is_pandemic_period (binary indicator for pandemic periods)

🌍 Geographic Scope

Countries: Global (identified by ISO country codes)

Regions and Income Groups included for aggregated analysis

📊 Key Feature Groups

Economic Indicators:

GDP (USD), GDP per capita

FDI, inflation, unemployment, economic growth index

Environmental Indicators:

CO₂ emissions, renewable energy use

Forest area, green transition score, CO₂ intensity

Technology & Connectivity:

Internet usage, mobile subscriptions

Digital readiness score, digital connectivity index

Health & Education:

Life expectancy, child mortality

School enrollment, healthcare capacity, health development ratio

Governance & Resilience:

Governance quality, global resilience

Human development composite, ecological preservation

🔍 Use Cases

Trend analysis over time

Country-level comparisons

Modeling development outcomes

Predictive analytics on sustainability or human development

Correlation and clustering across multiple indicators

⚠️ Note on Missing Region and Income Group Data

Approximately 18% of the entries in the region and income_group columns are null. This is primarily due to the inclusion of aggregate regions (e.g., Arab World, East Asia & Pacific, Africa Eastern and Southern) and non-country classifications (e.g., Early-demographic dividend, Central Europe and the Baltics). These entries represent groups of countries with diverse income levels and geographic characteristics, making it inappropriate or misleading to assign a single region or income classification. In some cases, the data source may have intentionally left these fields blank to avoid oversimplification or due to a lack of standardized classification.

📋 Column Descriptions

year: Year of the recorded data, representing a time series for each country.

country_code: Unique code assigned to each country (ISO-3166 standard).

country_name: Name of the country corresponding to the data.

region: Geographical region of the country (e.g., Africa, Asia, Europe).

income_group: Income classification based on Gross National Income (GNI) per capita (low, lower-middle, upper-middle, high income).

currency_unit: Currency used in the country (e.g., USD, EUR).

gdp_usd: Gross Domestic Product (GDP) in USD (millions or billions).

population: Total population of the country for the given year.

gdp_per_capita: GDP divided by population (economic output per person).

inflation_rate: Annual rate of inflation (price level rise).

unemployment_rate: Percentage of the labor force unemployed but seeking employment.

fdi_pct_gdp: Foreign Direct Investment (FDI) as a percentage of GDP.

co2_emissions_kt: Total CO₂ emissions in kilotons (kt).

energy_use_per_capita: Energy consumption per person (kWh).

renewable_energy_pct: Percentage of energy consumption from renewable sources.

forest_area_pct: Percentage of total land area covered by forests.

electricity_access_pct: Percentage of the population with access to electricity.

life_expectancy: Average life expectancy at birth.

child_mortality: Deaths of children under 5 per 1,000 live births.

school_enrollment_secondary: Percentage of population enrolled in secondary education.

health_expenditure_pct_gdp: Percentage of GDP spent on healthcare.

hospital_beds_per_1000...
Indie Map
kaggle.com
data.wu.ac.at
zip
Updated Jul 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ryan (2017). Indie Map [Dataset]. https://www.kaggle.com/snarfed/indiemap
Explore at:
zip(19873006 bytes)Available download formats
Dataset updated
Jul 1, 2017
Authors
Ryan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The IndieWeb is a people-focused alternative to the "corporate" web. Participants use their own personal web sites to post, reply, share, organize events and RSVP, and interact in online social networking in ways that have otherwise been limited to centralized silos like Facebook and Twitter.

The Indie Map dataset is a social network of the 2300 most active IndieWeb sites, including all connections between sites and number of links in each direction, broken down by type. It includes:

5.8M web pages, including raw HTML, parsed microformats2, and extracted links with metadata.

631M links and 706K "friend" relationships between sites.

380GB of HTML and HTTP requests in WARC format.

The zip file here contains a JSON file for each site, which includes metadata, a list of other sites linked to and from, and the number of links of each type.

The complete dataset of 5.8M HTML pages is available in a publicly accessible Google BigQuery dataset. The raw pages can also be downloaded as WARC files. They're hosted on Google Cloud Storage.

More details in the full documentation.

Indie Map is free, open source, and placed into the public domain via CC0. Crawled content remains the property of each site's owner and author, and subject to their existing copyrights.

Facebook

Twitter

Click to copy link

Link copied

Cite

Github (2019). GitHub Repos [Dataset]. https://www.kaggle.com/datasets/github/github-repos

GitHub Repos

Code and comments from 2.8 million repos

Explore at:

zip(0 bytes)Available download formats

Dataset updated

Mar 20, 2019

Dataset provided by

GitHubhttps://github.com/

Authors

Github

Description

GitHub is how people build software and is home to the largest community of open source developers in the world, with over 12 million people contributing to 31 million projects on GitHub since 2008.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains a full snapshot of the content of more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths, and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

Querying BigQuery tables

You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.github_repos.[TABLENAME]. Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Acknowledgements

This dataset was made available per GitHub's terms of service. This dataset is available via Google Cloud Platform's Marketplace, GitHub Activity Data, as part of GCP Public Datasets.

Inspiration

This is the perfect dataset for fighting language wars.
Can you identify any signals that predict which packages or languages will become popular, in advance of their mass adoption?

Clear search

Close search

Google apps

Main menu

GitHub Repos

Querying BigQuery tables

Acknowledgements

Inspiration

Open Images

Context

Content

Querying BigQuery Tables

Acknowledgements

Inspiration

Eclipse Megamovie

GitHub Programming Languages Data

Context

Content

Source

Limitations

Makani Flight Logs

GitHub-Issues

Synthetic Patient Data in OMOP

PatCit: A Comprehensive Dataset of Patent Citations

SOTorrent 2018-12-09

Please notice

Context

Content

Inspiration

Data from: arXiv Dataset

About ArXiv

ArXiv On Kaggle

Metadata

Bulk access

List files:

Download pdfs from March 2020:

Download all the source files

Update Frequency

License

Acknowledgements

Learning Path Index Dataset

Description

Inspiration

Context

Sources

Content

How to contribute to this initiative?

License

Important Links

Credits

Dataset for Design Ideation Study

DICOM converted Slide Microscopy images for the TCGA-SKCM collection

Collection description

Files included

Download instructions

Acknowledgments

References

PheKnowLator Human Disease Knowledge Graph Benchmarks Archive

PKT Human Disease KG Benchmark Builds

Build Data Access

Important Build Information

v1.0.0

All Other Build Versions

Bitcoin Blockchain Historical Data

Context

Content

Querying BigQuery tables

Method & Acknowledgements

Inspiration

BigQuery GIS Utility Datasets (U.S.)

Fork this kernel to get started to learn how to safely manage analyzing large BigQuery datasets.

Ethereum Classic Blockchain

Context

Content

Querying BigQuery tables

Acknowledgements

Inspiration

Hacker News Corpus

Context

Content

Acknowledgements

Inspiration

Use this dataset with BigQuery

Global Development Analysis (2000-2020)

GitHub Repos