Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Hacker News Stories Dataset
This is a dataset containing approximately 4 million stories from Hacker News, exported to a CSV file. The dataset includes the following fields:
id (int64): The unique identifier of the story. title (string): The title of the story. url (string): The URL of the story. score (int64): The score of the story. time (int64): The time the story was posted, in Unix time. comments (int64): The number of comments on the story. author (string): The username of… See the full description on the dataset page: https://huggingface.co/datasets/julien040/hacker-news-posts.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received. Hacker News is a social news website focusing on computer science and entrepreneurship. It is run by Paul Graham's investment fund and startup incubator, Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".
Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received.
Please note that the text field includes profanity. All texts are the author’s own, do not necessarily reflect the positions of Kaggle or Hacker News, and are presented without endorsement.
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.hacker_news.[TABLENAME]
. Fork this kernel to get started.
This dataset was kindly made publicly available by Hacker News under the MIT license.
Recent studies have found that many forums tend to be dominated by a very small fraction of users. Is this true of Hacker News?
Hacker News has received complaints that the site is biased towards Y Combinator startups. Do the data support this?
Is the amount of coverage by Hacker News predictive of a startup’s success?
Hacker News posts and comments
This is a dataset of all HN posts and comments, current as of November 1, 2023.
https://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Hacker News Sentiment Analysis Dataset is a technology community public opinion analysis data that provides an emotional analysis (polarity, subjectivity, and emotional categories) of each of the top 141 hacker news posts along with the title, URL, point, and comment count.
2) Data Utilization (1) Hacker News Sentiment Analysis Dataset has characteristics that: • This dataset includes polar (-1-1), subjectivity (0-1), and category (positive/neutral/negative) columns that quantify the sentiment of comments using TextBlob, based on the latest top posts as of June 24, 2025. • It is generated through web scraping and NLP preprocessing, and allows for quantitative comparison of community responses to technology news. (2) Hacker News Sentiment Analysis Dataset can be used to: • Visualize technology trends Emotional: Connect emotional scores with post topics to visually analyze community response patterns to specific technology news such as AI and policies. • NLP Model Learning: Emotional classification models can be trained using comment data with real-world technical discussions or applied to research on the subjectivity prediction of comments.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Schema
field name type description
title STRING Story title
url STRING Story url
text STRING Story or comment text
dead BOOLEAN Is dead?
by STRING The username of the item's author.
score INTEGER Story score
time INTEGER Unix time
timestamp TIMESTAMP Timestamp for the unix time
type STRING type of details (comment, comment_ranking, poll, story, job, pollopt)
id INTEGER The item's unique id.
parent INTEGER Parent comment ID descendants INTEGER… See the full description on the dataset page: https://huggingface.co/datasets/labofsahil/hacker-news-dataset.
This dataset contains all stories and comments from Hacker News from its launch in 2006 to present. Each story contains a story ID, the author that made the post, when it was written, and the number of points the story received. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A curated dataset from fh-bigquery:hackernews.stories
Only HN stories with more than 10 comments are included, and only comments from users with more than 10 comments are included.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Hacker News corpus, 2007-Nov 2022
Dataset Description
Dataset Summary
Dataset Name: Hacker News Full Corpus (2007 - November 2022) Description:
NOTE: I am not affiliated with Y Combinator.
This dataset is a July 2023 snapshot of YCombinator's BigQuery dump of the entire archive of posts and comments made on Hacker News. It contains posts from Hacker News' inception in 2007 through to November 16, 2022, when the BigQuery database was last updated. The dataset… See the full description on the dataset page: https://huggingface.co/datasets/jkeisling/hacker-news-corpus-2007-2022.
This dataset was created by ashish01
This dataset was created by Michał Paliński
It contains the following files:
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Dataset Card for [Dataset Name]
Dataset Summary
Hacker news until 2015 with comments. Collect from Google BigQuery open dataset. We didn't do any pre-processing except remove HTML tags.
Supported Tasks and Leaderboards
Comment Generation; News analysis with comments; Other comment-based NLP tasks.
Languages
English
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/Linkseed/hacker_news_with_comments.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
See also: https://zenodo.org/record/45901 and https://zenodo.org/record/49899 and https://zenodo.org/record/49900
georgeck/hacker-news-discussion-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Hamza Jabbar Khan
gbonifacechang/hacker-news-regressor-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# TDMentions: A Dataset of Technical Debt Mentions in Online Posts (version 1.0)
TDMentions is a dataset that contains mentions of technical debt from Reddit, Hacker News, and Stack Exchange. It also contains a list of blog posts on Medium that were tagged as technical debt. The dataset currently contains approximately 35,000 items.
## Data collection and processing
The dataset is mainly collected from existing datasets. We used data from:
- the archive of Reddit posts by Jason Baumgartner (available at [https://pushshift.io](https://pushshift.io),
- the archive of Hacker News available at Google's BigQuery (available at [https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news](https://console.cloud.google.com/marketplace/details/y-combinator/hacker-news)), and the Stack Exchange data dump (available at [https://archive.org/details/stackexchange](https://archive.org/details/stackexchange)).
- the [GHTorrent](http://ghtorrent.org) project
- the [GH Archive](https://www.gharchive.org)
The data set currently contains data from the start of each source/service until 2018-12-31. For GitHub, we currently only include data from 2015-01-01.
We use the regular expression `tech(nical)?[\s\-_]*?debt` to find mentions in all sources except for Medium. We decided to limit our matches to variations of technical debt and tech debt. Other shorter forms, such as TD, can result in too many false positives. For Medium, we used the tag `technical-debt`.
## Data Format
The dataset is stored as a compressed (bzip2) JSON file with one JSON object per line. Each mention is represented as a JSON object with the following keys.
- `id`: the id used in the original source. We use the URL path to identify Medium posts.
- `body`: the text that contains the mention. This is either the comment or the title of the post. For Medium posts this is the title and subtitle (which might not mention technical debt, since posts are identified by the tag).
- `created_utc`: the time the item was posted in seconds since epoch in UTC.
- `author`: the author of the item. We use the username or userid from the source.
- `source`: where the item was posted. Valid sources are:
- HackerNews Comment
- HackerNews Job
- HackerNews Submission
- Reddit Comment
- Reddit Submission
- StackExchange Answer
- StackExchange Comment
- StackExchange Question
- Medium Post
- `meta`: Additional information about the item specific to the source. This includes, e.g., the subreddit a Reddit submission or comment was posted to, the score, etc. We try to use the same names, e.g., `score` and `num_comments` for keys that have the same meaning/information across multiple sources.
This is a sample item from Reddit:
```JSON
{
"id": "ab8auf",
"body": "Technical Debt Explained (x-post r/Eve)",
"created_utc": 1546271789,
"author": "totally_100_human",
"source": "Reddit Submission",
"meta": {
"title": "Technical Debt Explained (x-post r/Eve)",
"score": 1,
"num_comments": 0,
"url": "http://jestertrek.com/eve/technical-debt-2.png",
"subreddit": "RCBRedditBot"
}
}
```
## Sample Analyses
We decided to use JSON to store the data, since it is easy to work with from multiple programming languages. In the following examples, we use [`jq`](https://stedolan.github.io/jq/) to process the JSON.
### How many items are there for each source?
```
lbzip2 -cd postscomments.json.bz2 | jq '.source' | sort | uniq -c
```
### How many submissions that mentioned technical debt were posted each month?
```
lbzip2 -cd postscomments.json.bz2 | jq 'select(.source == "Reddit Submission") | .created_utc | strftime("%Y-%m")' | sort | uniq -c
```
### What are the titles of items that link (`meta.url`) to PDF documents?
```
lbzip2 -cd postscomments.json.bz2 | jq '. as $r | select(.meta.url?) | .meta.url | select(endswith(".pdf")) | $r.body'
```
### Please, I want CSV!
```
lbzip2 -cd postscomments.json.bz2 | jq -r '[.id, .body, .author] | @csv'
```
Note that you need to specify the keys you want to include for the CSV, so it is easier to either ignore the meta information or process each source.
Please see [https://github.com/sse-lnu/tdmentions](https://github.com/sse-lnu/tdmentions) for more analyses
# Limitations and Future updates
The current version of the dataset lacks GitHub data and Medium comments. GitHub data will be added in the next update. Medium comments (responses) will be added in a future update if we find a good way to represent these.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Hacker News Discussion Summarization - Large
Dataset Summary
This dataset comprises 14,531 records of Hacker News front-page stories collected over 516 days. Each record includes the story's metadata and its associated discussion threads, formatted to facilitate the development of summarization models.
Supported Tasks and Leaderboards
The primary task supported by this dataset is summarization, specifically targeting the summarization of… See the full description on the dataset page: https://huggingface.co/datasets/georgeck/hacker-news-discussion-summarization-large.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fixes:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This repository contains the datasets for hacker news, used by https://github.com/anantn/hn-chatgpt-plugin As of June 2025, these are now exported as parquet files instead of sqlite for space efficiency
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Oman Internet Usage: Social Media Market Share: Mobile: news.ycombinator.com data was reported at 0.000 % in 05 Apr 2025. This stayed constant from the previous number of 0.000 % for 04 Apr 2025. Oman Internet Usage: Social Media Market Share: Mobile: news.ycombinator.com data is updated daily, averaging 0.000 % from May 2024 (Median) to 05 Apr 2025, with 56 observations. The data reached an all-time high of 0.190 % in 31 Dec 2024 and a record low of 0.000 % in 05 Apr 2025. Oman Internet Usage: Social Media Market Share: Mobile: news.ycombinator.com data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Oman – Table OM.SC.IU: Internet Usage: Social Media Market Share.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Hacker News Stories Dataset
This is a dataset containing approximately 4 million stories from Hacker News, exported to a CSV file. The dataset includes the following fields:
id (int64): The unique identifier of the story. title (string): The title of the story. url (string): The URL of the story. score (int64): The score of the story. time (int64): The time the story was posted, in Unix time. comments (int64): The number of comments on the story. author (string): The username of… See the full description on the dataset page: https://huggingface.co/datasets/julien040/hacker-news-posts.