21 datasets found

France Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
Explore at:
zip(2750497 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
France
Description
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Reddit: /r/news
kaggle.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/news [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popularity-and-user-engagement-trends/discussion
Explore at:
zip(146481 bytes)Available download formats
Dataset updated
Dec 17, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/news

Exploring Topics, Scores, and Engagement

By Reddit [source]

About this dataset

This dataset provides an in-depth look into learning what communities find important and engaging in the news. With this data, researchers can discover trends related to user engagement and popular topics within subreddits. By examining the “score” and “comms_num” columns, our researchers will be able to pinpoint which topics are most liked, discussed or shared within the various subreddits. Researchers may also gain insights into not only how popular a topic is but how it is growing over time. Additionally, by exploring the body column of our dataset, researchers can understand more about which types of news stories drive conversation within particular subreddits—providing an opportunity for deeper analysis of that subreddit’s diverse community dynamics

The dataset includes eight columns: title, score, id, url, comms_num created**body and timestamp** which can help us identify key insights into user engagement among popular subreddits. With this data we may also determine relationships between topics of discussion and their impact on user engagement allowing us to create a better understanding surrounding issue-based conversations online as well as uncover emerging trends in online news consumption habits

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset is useful for those who are looking to gain insight into the popularity and user engagement of specific subreddits. The data includes 8 different columns including title, score, id, url, comms_num, created, body and timestamp. This can provide valuable information about how users view and interact with particular topics across various subreddits.

In this guide we’ll look at how you can use this dataset to uncover trends in user engagement on topics within specific subreddits as well as measure the overall popularity of these topics within a subreddit.

1) Analyzing Score: By analyzing the “score” column you can determine which news stories are popular in a particular subreddit and which ones aren't by looking at how many upvotes each story has received. With this data you will be able to determine trends in what types of stories users preferred within a particular subreddit over time.

2) Analyzing Comms_Num: Similarly to analyzing the score column you can analyze the “comms_num” column to see which news stories had more engagement from users by tracking number of comments received on each post. Knowing these points can provide insight into what types of stories tend to draw more comment activity from users in certain subreddits from one day or an extended period of time such tracking post activity for multiple weeks or months at once 3) Analyzing Body: Additionally by looking at the “body” column for each post researchers can gain a better understanding which kinds of topics/news draw attention among specific Reddit communities.. With that complete picture researchers have access not only to data measuring Reddit buzz but also access topic discussion/comments helping generate further insights into why certain posts might be popular or receive more comments than others

Overallthis dataset provides valuable insights about user engagedment related specifically topics trending accross subsbreddits allowing anyone interested reseraching such things easier way access those insights all one place

Research Ideas

Grouping news topics within particular subreddits and assessing the overall popularity of those topics in terms of scores/user engagement.

Correlating user engagement with certain news topics to understand how they influence discussion or reactions on a subreddit.

Examining the potential correlation between score and the actual body content of a given post to assess what types of content are most successful in gaining interest from users and creating positive engagement for posts

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: news.csv | Column name | Description ...
d
Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
Explore at:
Dataset updated
Nov 13, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Contiguous United States, United States
Description
This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = “yes” column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = “yes”). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.
Reddit: /r/technology (Submissions & Comments)
kaggle.com
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 18, 2022
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/technology (Submissions & Comments)

Title, Score, ID, URL, Comment Number, and Timestamp

By Reddit [source]

About this dataset

This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

Research Ideas

Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.

Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.

Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Data from: [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012)
smithsonian.figshare.com
search.dataone.org
pdf
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Condit; Suzanne Lao; Rolando Pẽrez; Steven B. Dolins; Robin Foster; Stephen Hubbell (2024). [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012) [Dataset]. http://doi.org/10.5479/data.bci.20130603
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.5479/data.bci.20130603
Dataset updated
Apr 18, 2024
Dataset provided by
Smithsonian Tropical Research Institute
Authors
Richard Condit; Suzanne Lao; Rolando Pẽrez; Steven B. Dolins; Robin Foster; Stephen Hubbell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Barro Colorado Island
Description
Abstract:The 50-hectare plot at Barro Colorado Island, Panama, is a 1000 meter by 500 meter rectangle of forest inside of which all woody trees and shrubs with stems at least 1 cm in stem diameter have been censused. Every individual tree in the 50 hectares was permanently numbered with an aluminum tag in 1982, and every individual has been revisited six times since (in 1985, 1990, 1995, 2000, 2005, and 2010). In each census, every tree was measured, mapped and identified to species. Details of the census method are presented in Condit (Tropical forest census plots: Methods and results from Barro Colorado Island, Panama and a comparison with other plots; Springer-Verlag, 1998), and a description of the seven-census results in Condit, Chisholm, and Hubbell (Thirty years of forest census at Barro Colorado and the Importance of Immigration in maintaining diversity; PLoS ONE, 7:e49826, 2012).Description:CITATION TO DATABASE: Condit, R., Lao, S., Pérez, R., Dolins, S.B., Foster, R.B. Hubbell, S.P. 2012. Barro Colorado Forest Census Plot Data, 2012 Version. DOI http://dx.doi.org/10.5479/data.bci.20130603 CO-AUTHORS: Stephen Hubbell and Richard Condit have been principal investigators of the project for over 30 years. They are fully responsible for the field methods and data quality. As such, both request that data users contact them and invite them to be co-authors on publications relying on the data. More recent versions of the data, often with important updates, can be requested directly from R. Condit (conditr@gmail.com). ACKNOWLEDGMENTS: The following should be acknowledged in publications for contributions to the 50-ha plot project: R. Foster as plot founder and the first botanist able to identify so many trees in a diverse forest; R. Pérez and S. Aguilar for species identification; S. Lao for data management; S. Dolins for database design; plus hundreds of field workers for the census work, now over 2 million tree measurements; the National Science Foundation, Smithsonian Tropical Research Institute, and MacArthur Foundation for the bulk of the financial support. File 1. RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (File 5). File 2. RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (File 7). File 3. ViewFullTable.zip: A zip archive with a single ascii text file named ViewFullTable.txt holding a table with all census data from the BCI 50-ha plot. Each row is a single measurement of a single stem, with columns indicating the census, date, species name, plus tree and stem identifiers; all seven censuses are included. A full description of all columns in the table can be found at http://dx.doi.org/10.5479/data.bci.20130604 (ViewFullTable, pp. 21-22 of the pdf). File 4. ViewTax.txt: An ascii text table with information on all tree species recorded in the 50-ha plot. There are columns with taxonomics names (family, genus, species, and subspecies), plus the taxonomic authority. The column 'Mnemonic' gives a shortened code identifying each species, a code used in the R tables (Files 5, 7). The column 'IDLevel' indicates the depth to which the species is identified: if IDLevel='species', it is a fully identified, but if IDLevel='genus', the genus is known but not the species. IDLevel can also be 'family', or 'none' in case the species is not even known to family. File 5. bci.full.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.full1.rdata' for the first census through 'bci.full7.rdata' for the seventh census. Each of the seven files is a table having one record per individual tree, and each includes a record for every tree found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputFull.pdf (File 1). File 6. bci.spptable.rdata: A list of the 1064 species found across all tree plots and inventories in Panama, in R format. This is a superset of species found in the BCI censuses: every BCI species is included, plus additional species never observed at BCI. The column 'sp' in this table is a code identifying the species in the R census tables (File 5, 7), and matching 'mnemomic' in ViewFullTable (File 3). File 7. bci.stem.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.stem1.rdata' for the first census through 'bci.stem7.rdata' for the seventh census. Each of the seven files is a table having one record per individual stem, necessary because some individual trees have more than one stem. Each includes a record for every stem found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputStem.pdf (File 2). File 8. TSMAttributes.txt: An ascii text table giving full descriptions of measurement codes, which are also referred to as TSMCodes. These short codes are used in the column 'code' in R tables and in the column 'ListOfTSM' in ViewFullTable.txt, in both cases with individual codes separated by commas. File 9. bci_31August2012_mysql.zip: A zip archive holding one file, 'bci.sql', which is a mysqldump of the complete MySQL database (version 5.0.95, http://www.mysql.com) created 31 August 2012. The database includes data collected from seven censuses of the BCI 50 ha plot plus censuses of many additional plots elsewhere in Panama, plus transects where only species identifications were collected and trees were not tagged nor measurements made. Detailed documentation of all tables within the database can be found at (http://dx.doi.org/10.5479/data.bci.20130604). This version of the data is intended for experienced SQL users; for most, the R Analytical Tables in Rtables.zip are more useful.
UK Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). UK Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/uk-weekly-real-estate-listings-2022-2023
Explore at:
zip(29112488 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
United Kingdom
Description
These Kaggle datasets provide downloaded real-estate listings from the UK real estate market, capturing data from a leading platform in the UK (Zoopla), reminiscent of the approach taken for the US dataset from Redfin and French dataset from Seloger. It encompasses detailed property listings, pricing, and market trends across UK, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as UK_clean_unique.csv.

The cleaning process mirrored that of the US and French datasets, involving removing irrelevant features, normalizing variable names for dataset consistency with the USA and France, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on the UK's real estate market drivers.

For exact column descriptions, see columns for UK_clean_unique.csv and my thesis.

Table 2.6 and Section 2.2.2, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Reddit: /r/videos
kaggle.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/videos [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popular-and-quality-video-content-on/code
Explore at:
zip(127095 bytes)Available download formats
Dataset updated
Dec 17, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/videos

Insights on Popularity and Content Quality

By Reddit [source]

About this dataset

This dataset explores the media content on Reddit and how it is received by its community, providing detailed insights into both the popularity and quality of subreditvideos. Here you will find data about videos posted on Reddit, compiled from various metrics such as their upvotes, number of comments, date and time posted, body text and more. With this data you can dive deeper into the types of videos being shared and the topics being discussed – gaining a better understanding of what resonates with the Reddit community. This information allows us to gain insight into what kind of content has potential to reach a wide audience on Reddit; it also reveals which types of videos have been enjoying popularity amongst users over time. These insights can help researchers uncover valuable findings about media trends on popular social media sites such as Reddit – so don't hesitate to explore!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How To Use This Dataset

This dataset is a great resource for analyzing the content and popularity of videos posted on Reddit. It provides various metrics such as score, url, comment count and creation date that let you compare the different types of content being shared on the subredditvideos subreddit.

To get started, take a look at the title field for each post. This gives you an idea of what type of video is being shared, which can be helpful in understanding what topics are popular on the platform.

Next, use the score field to identify posts that have done well in terms of receiving upvotes from users. The higher its score, the more popular it has been with viewers. A higher score does not necessarily indicate higher quality however; take a closer look at each post's body field to get an idea for its content quality before making assumptions about its value based solely off of its high score. Having said that, top scoring posts could be considered further when doing research analysis into popular topics or trends in media consumption behavior across Reddit’s userbase (e.g., trending topics among young adults). The url field provides you with links to directly access videos so you can review them yourself before sharing them or forwarding them onto friends or colleagues for their feedback/insight as well (something that could be done further depending on how detailed your research project requires). The comms_num column represents how many comments each video has received which may give insight into how engaged viewers have been when viewing stories submitted by this particular sub-reddit’s members - useful information if interactions/conversations surrounding particular types of content are part of your research objective too! Finally make sure to check out timestamp column as this records when each story was created - important information whenever attempting to draw conclusive insights from time-oriented data points (a time series analysis would serve very handy here!).
Knowing all these features listed above should give researchers an easily accessible source into exploring popularity and quality levels amongst Reddit’s shared media channels – uncovering potentially useful insights related specifically those moving image stories found within subredditvideos are made available via this dataset here!

Research Ideas

Identifying and tracking trends in the popularity of different genres of videos posted on Reddit, such as interviews, music videos, or educational content.

Investigating audience engagement with certain types of content to determine the types of posts that resonate most with users on Reddit.

Examining correlations between video score or comment count and specific video characteristics such as length, topic or visual style

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: videos.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of ...
The Pizza Problem
kaggle.com
zip
Updated Feb 8, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeremy Jeanne (2019). The Pizza Problem [Dataset]. https://www.kaggle.com/jeremyjeanne/google-hashcode-pizza-training-2019
Explore at:
zip(178852 bytes)Available download formats
Dataset updated
Feb 8, 2019
Authors
Jeremy Jeanne
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Problem description

Pizza

The pizza is represented as a rectangular, 2-dimensional grid of R rows and C columns. The cells within the grid are referenced using a pair of 0-based coordinates [r, c] , denoting respectively the row and the column of the cell.

Each cell of the pizza contains either:

mushroom, represented in the input file as M tomato, represented in the input file as T

Slice

A slice of pizza is a rectangular section of the pizza delimited by two rows and two columns, without holes. The slices we want to cut out must contain at least L cells of each ingredient (that is, at least L cells of mushroom and at least L cells of tomato) and at most H cells of any kind in total - surprising as it is, there is such a thing as too much pizza in one slice. The slices being cut out cannot overlap. The slices being cut do not need to cover the entire pizza.

Goal

The goal is to cut correct slices out of the pizza maximizing the total number of cells in all slices. Input data set The input data is provided as a data set file - a plain text file containing exclusively ASCII characters with lines terminated with a single ‘ ’ character at the end of each line (UNIX- style line endings).

File format

The file consists of:

one line containing the following natural numbers separated by single spaces: R (1 ≤ R ≤ 1000) is the number of rows C (1 ≤ C ≤ 1000) is the number of columns L (1 ≤ L ≤ 1000) is the minimum number of each ingredient cells in a slice H (1 ≤ H ≤ 1000) is the maximum total number of cells of a slice

Google 2017, All rights reserved.

R lines describing the rows of the pizza (one after another). Each of these lines contains C characters describing the ingredients in the cells of the row (one cell after another). Each character is either ‘M’ (for mushroom) or ‘T’ (for tomato).

Example

3 5 1 6 TTTTT TMMMT TTTTT

3 rows, 5 columns, min 1 of each ingredient per slice, max 6 cells per slice

Example input file.

Submissions

File format

The file must consist of:

one line containing a single natural number S (0 ≤ S ≤ R × C) , representing the total number of slices to be cut, U lines describing the slices. Each of these lines must contain the following natural numbers separated by single spaces: r 1 , c 1 , r 2 , c 2 describe a slice of pizza delimited by the rows r (0 ≤ r1,r2 < R, 0 ≤ c1, c2 < C) 1 and r 2 and the columns c 1 and c 2 , including the cells of the delimiting rows and columns. The rows ( r 1 and r 2 ) can be given in any order. The columns ( c 1 and c 2 ) can be given in any order too.

Example

0 0 2 1 0 2 2 2 0 3 2 4

3 slices.

First slice between rows (0,2) and columns (0,1). Second slice between rows (0,2) and columns (2,2). Third slice between rows (0,2) and columns (3,4). Example submission file.

© Google 2017, All rights reserved.

Slices described in the example submission file marked in green, orange and purple. Validation

For the solution to be accepted:

the format of the file must match the description above, each cell of the pizza must be included in at most one slice, each slice must contain at least L cells of mushroom, each slice must contain at least L cells of tomato, total area of each slice must be at most H

Scoring

The submission gets a score equal to the total number of cells in all slices. Note that there are multiple data sets representing separate instances of the problem. The final score for your team is the sum of your best scores on the individual data sets. Scoring example

The example submission file given above cuts the slices of 6, 3 and 6 cells, earning 6 + 3 + 6 = 15 points.

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

Reddit: /r/Damnthatsinteresting
kaggle.com
zip
Updated Dec 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/Damnthatsinteresting [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-power-of-user-engagement-on-damnth
Explore at:
zip(139409 bytes)Available download formats
Dataset updated
Dec 18, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/Damnthatsinteresting

Investigating Popularity, Score and Engagement Across Subreddits

By Reddit [source]

About this dataset

This dataset provides valuable insights into user engagement and popularity across the subreddit Damnthatsinteresting. With detailed metrics on various discussions such as the title, score, id, URL, comments, created date and time, body and timestamp of each discussion. This dataset opens a window into the world of user interaction on Reddit by letting researchers align their questions with data-driven results to understand social media behavior. Gain an understanding of what drives people to engage in certain conversations as well as why certain topics become trending phenomena – it’s all here for analysis. Enjoy exploring this fascinating collection of information about Reddit users' activities!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset provides valuable insights into user engagement and the impact of users interactions on the popular subreddit DamnThatsInteresting. Exploring this dataset can help uncover trends in participation, what content is resonating with viewers, and how different users are engaging with each other. In order to get the most out of this dataset, you will need to understand its structure in order to explore and extract meaningful insights. The columns provided include: title, score, url, comms_num, created date/time (created), body and timestamp.

Research Ideas

Analyzing the impact of user comments on the popularity and engagement of discussions

Examining trends in user behavior over time to gain insight into popular topics of discussion

Investigating which discussions reach higher levels of score, popularity or engagement to identify successful strategies for engaging users

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Damnthatsinteresting.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------------------------------------| | title | The title of the discussion thread. (String) | | score | The number of upvotes the discussion has received from users. (Integer) | | url | The URL link for the discussion thread itself. (String) | | comms_num | The number of comments made on a particular discussion. (Integer) | | created | The date and time when the discussion was first created on Reddit by its original poster (OP). (DateTime) | | body | Full content including text body with rich media embedded within posts such as images/videos etc. (String) | | timestamp | When was last post updated by any particular user. (DateTime) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Reddit: /r/pokemon
kaggle.com
zip
Updated Dec 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/pokemon [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popular-pokemon-topics-and-user-inter
Explore at:
zip(434545 bytes)Available download formats
Dataset updated
Dec 19, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/pokemon

Exploring Post Popularity, User Engagement and Topic Disscussion

By Reddit [source]

About this dataset

This Kaggle dataset provides a unique opportunity to explore the ongoing conversations and discussions of the popular Pokémon franchise across Reddit communities. It contains over a thousand entries compiled from posts and comments made by avid Pokémon fans, providing valuable insights into post popularity, user engagement, and topic discussion. With these comprehensive data points including post title, score, post ID link URL, number of comments and date & time created along with body text and timestamp – powerful analysis can be conducted to assess how trends in Pokémon-related activities are evolving over time. So why not dive deep into this fascinating world of Poké-interactions? Follow us as we navigate through the wide range of interesting topics being discussed on Reddit about this legendary franchise!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains over a thousand entries of user conversations related to the Pokémon community posted and commented on Reddit. By using this dataset, you can explore the popularity of Pokémon related topics, the level of user engagement, and how user interactions shape the discussion around each topic. To do so, you’ll want to focus on columns such as title, score, url, comms_num (number of comments on a post), created (date and time when post was created) and timestamp.
For starters you can start by looking at how many posts have been made about certain topics by using “title” column as a keyword search bar – e.g., ‘Magikarp’ or ‘Team Rocket’ – to see just how many posts have been about them in total. With this data in mind, you could consider what makes popular posts become popular and look at the number of upvotes from users (stored in “score”)– i.e., what posts caught people's attention? Beyond upvotes however is downvotes - can these be taken into account when it comes to gauging popularity? One could also take into consideration user engagement by looking at comms_num as it contains information regarding number of comments left for each post - does an increase in comments lead to an increase in upvotes?
Additionally one could examine how posts were communicated with users by reading into body texts stored under 'body'. Through this information users can create insights into overall discussion per topic: are they conversational or argumentative? Are there underlying regional trends taking place among commenters who place emphasis on different elements regarding their pokemon-related discussions?
This opens up possibilities for further investigations into understanding pokemon-related phenomena through Reddit discussion; finding out what makes certain topics prevalent while others stay obscure; seeing where our World Regions lay within certain conversations; all while understanding specific nuances within conversation trees between commenters!

Research Ideas

Analyzing the influence of post upvotes in user engagement and conversation outcomes

Investigating the frequency of topics discussed in Pokémon related conversations

Examining the correlation between post score and number of comments on each post

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: pokemon.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The body text of the post. (String) | | timestamp | The timestamp of the post. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
Reddit's /r/funny Subreddit
kaggle.com
zip
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit's /r/funny Subreddit [Dataset]. https://www.kaggle.com/datasets/thedevastator/explore-reddit-s-funny-subreddit-analyze-communi/code
Explore at:
zip(93052 bytes)Available download formats
Dataset updated
Dec 15, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Explore Reddit's Funny Subreddit & Analyze Community Engagement!

Quantifying Community Interaction Through Reddit Posts

By Reddit [source]

About this dataset

This dataset offers an insightful analysis into one of the most talked-about online communities today: Reddit. Specifically, we are focusing on the funny subreddit, a subsection of the main forum that enjoys the highest engagement across all Reddit users. Not only does this dataset include post titles, scores and other details regarding post creation and engagement; it also includes powerful metrics to measure active community interaction such as comment numbers and timestamps. By diving deep into this data, we can paint a fuller picture in terms of what people find funny in our digital age - how well do certain topics draw responses? How does sentiment change over time? And how can community managers use these insights to grow their platforms and better engage their userbase for lasting success? With this comprehensive dataset at your fingertips, you'll be able to answer each question - and more

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Introduction

Welcome to the Reddit's Funny Subreddit Kaggle Dataset. In this dataset you will explore and analyze posts from the popular subreddit to gain insights into community engagement. With this dataset, you can understand user engagement trends and learn how people interact with content from different topics. This guide will provide further information about how to use this dataset for your data analysis projects.

Important Columns

This datasets contains columns such as: title, score, url, comms_num (number of comments), created (date of post), body (content of post) and timestamp. All these columns are important in understanding user interactions with each post on Reddit’s Funny Subreddit.

Exploratory Data Analysis

In order to get a better understanding of user engagement on the subreddit, some initial exploration is necessary. By using graphical tools such as histograms or boxplots we can understand basic parameter values like scores or comments numbers for each post in the subreddit easily by just observing their distribution over time or through different parameters (for example: type of joke).

Dimensionality reduction

For more advanced analytics it is recommended that a dimensionality reduction technique like PCA should be used first before tackling any real analysis tasks so that similar posts can be grouped together and easier conclusions regarding them can be drawn out later on more confidently by leaving out any kind of conflicting/irrelevant variables which could cloud up any data-driven decisions taken forward at a later date if not properly accounted for early on in an appropriate manner after dimensional consolidation has been performed successfully first correctly effectively right off the bat once starting out cleanly and properly upfront accordingly throughout..

Further Guidance

If further assistance with using this dataset is required then further readings into topics like text mining, natural language processing , machine learning , etc are highly recommended where detailed explanation related to various steps which could help unlock greater value from Reddit's funny subreddits are explained elaborately hopefully giving readers or researchers ideas over what sort of approaches need being taking when it comes analyzing text-based online service platforms such as Reddit during data analytics/science related tasks

Research Ideas

Analyzing post title length vs. engagement (i.e., score, comments).

Comparing sentiment of post bodies between posts that have high/low scores and comments.

Comparing topics within the posts that have high/low scores and comments to look for any differences in content or style of writing based on engagement level

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: funny.csv | Column name | Description | |:--------------|:------------------------...
Comprehensive Literary Greats Dataset
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Comprehensive Literary Greats Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-literary-greats-dataset
Explore at:
zip(29940528 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Comprehensive Literary Greats Dataset

50,000+ Books Rated and Awarded Across Language, Genre, and Format

By [source]

About this dataset

This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.

Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in “ratingsByStars” section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!

Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset “Best Books Ever: A Comprehensive Historical Collection of Literary Greats”. What worlds awaits you?

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.

First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.

Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).

Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?

Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years

Research Ideas

Creating a web or mobile...
Reddit /r/datasets Dataset
kaggle.com
zip
Updated Nov 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
Explore at:
zip(9619636 bytes)Available download formats
Dataset updated
Nov 28, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Meta-Corpus of Datasets: The Reddit Dataset

The Complete Collection of Datasets Posted on Reddit

By SocialGrep [source]

About this dataset

A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

Research Ideas

Finding correlations between different types of datasets

Determining which datasets are most popular on Reddit

Analyzing the sentiments of post and comments on Reddit's /r/datasets board

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

Articles metadata from CrossRef

kaggle.com

zip

Updated Aug 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata

Explore at:

zip(72982417 bytes)Available download formats

Dataset updated

Aug 1, 2025

Authors

Kea Kohv

Description

This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

How to recreate this dataset in Jupyter Notebook:

1) Prepare list of articles to query ```python import pandas as pd

See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

Load the citation pairs from the Parquet file

citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

Remove all rows where https is in the 'publication' column but no "doi.org" is present

citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

Remove all rows where figshare is in the dataset name

citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

articles = list(set(citation_pairs_doi['publication'].to_list()))

articles = [doi.replace("_", "/") for doi in articles]

Save list articles to a file

with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

2) Query articles from CrossRef API


%%writefile enrich.py
#!pip install -q aiolimiter
import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
from aiolimiter import AsyncLimiter

# ---------- config ----------
HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
MAX_RPS  = 45           # polite pool limit (50), leave head-room
BATCH_SIZE = 10_000         # rows per INSERT
DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
ARTICLES  = pathlib.Path("articles.txt")
# -----------------------------

# ---- platform tweak: prefer selector loop on Windows ----
if sys.platform == "win32":
  asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

# ---- read the DOI list ----
with ARTICLES.open(encoding="utf-8") as f:
  DOIS = [line.strip() for line in f if line.strip()]

# ---- make sure DB & table exist BEFORE the async part ----
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
with sqlite3.connect(DB_PATH) as db:
  db.execute("""
    CREATE TABLE IF NOT EXISTS works (
      doi  TEXT PRIMARY KEY,
      json TEXT
    )
  """)
  db.execute("PRAGMA journal_mode=WAL;")   # better concurrency

# ---------- async section ----------
limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
sem   = asyncio.Semaphore(100)        # cap overall concurrency

async def fetch_one(session, doi: str):
  url = f"https://api.crossref.org/works/{doi}"
  async with limiter, sem:
    try:
      async with session.get(url, headers=HEADERS, timeout=10) as r:
        if r.status == 404:         # common “not found”
          return doi, None
        r.raise_for_status()        # propagate other 4xx/5xx
        return doi, await r.json()
    except Exception as e:
      return doi, None            # log later, don’t crash

async def main():
  start = time.perf_counter()
  db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
  db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak

  async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
    for chunk_start in range(0, len(DOIS), BATCH_SIZE):
      slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
      tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
      results = await asyncio.gather(*tasks)    # all tuples, no exc

      good_rows, bad_dois = [], []
      for doi, payload in results:
        if payload is None:
          bad_dois.append(doi)
        else:
          good_rows.append((doi, orjson.dumps(payload).decode()))

      if good_rows:
        db.executemany(
          "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
          good_rows,
        )
        db.commit()

      if bad_dois:                # append for later retry
        with open("failures.log", "a", encoding="utf-8") as fh:
          fh.writelines(f"{d}
" for d in bad_dois)

      done = chunk_start + len(slice_)
      rate = done / (time.perf_counter() - start)
      print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")

  db.close()

if _name_ == "_main_":
  asyncio.run(main())

Then run: python !python enrich.py

3) Finally extract the necessary fields

import sqlite3
import orjson
i...

Housing Price Prediction using DT and RF in R
kaggle.com
zip
Updated Aug 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
Explore at:
zip(629100 bytes)Available download formats
Dataset updated
Aug 31, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Objective: To predict the prices of houses in the City of Melbourne

Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">

Data Cleaning:

Date column is shown as a character vector which is converted into a date vector using the library ‘lubridate’

We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ‘Date’ and subtract it from the column ‘Year Built’

We remove 11566 records which have missing values

We drop columns which are not significant such as ‘X’, ‘suburb’, ‘address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ‘type’, ‘method’, ‘SellerG’, ‘date’, ‘Car’, ‘year built’, ‘Council Area’, ‘Region Name’

We split the data into ‘train’ and ‘test’ in 80/20 ratio using the sample function

Run libraries ‘rpart’, ‘rpart.plot’, ‘rattle’, ‘RcolorBrewer’

Run decision tree using the rpart function. ‘Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">

Average price for 5464 houses is $1084349

Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.

$4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">

We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">

We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)

Variables ‘postcode’, longitude and building are the most important variables

Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">

We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">

The below image indicates that ‘Building Area’, ‘Age of the house’ and ‘Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">

Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7

Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">

We tune the model and find mtry = 3 has the lowest out of bag error

We use the caret package and use 5 fold cross validation technique

RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4

We can conclude that Random Forest give us more accurate results as compared to Decision Tree

In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
Record High Temperatures for US Cities
kaggle.com
zip
Updated Jan 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Record High Temperatures for US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/record-high-temperatures-for-us-cities-in-2015
Explore at:
zip(9955 bytes)Available download formats
Dataset updated
Jan 18, 2023
Authors
The Devastator
Area covered
United States
Description
Record High Temperatures for US Cities

Clearly Defined Monthly Data

By Gary Hoover [source]

About this dataset

This dataset contains all the record-breaking temperatures for your favorite US cities in 2015. With this information, you can prepare for any unexpected weather that may come your way in the future, or just revel in the beauty of these high heat spells from days past! With record highs spanning from January to December, stay warm (or cool) with these handy historical temperature data points

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains the record high temperatures for various US cities during the year of 2015. The dataset includes columns for each individual month, along with column for the records highs over the entire year. This data is sourced from www.weatherbase.com and can be used to analyze which cities experienced hot summers, or compare temperature variations between different regions.

Here are some useful tips on how to work with this dataset: - Analyze individual monthly temperatures - this dataset allows you to compare high temperatures across months and locations in order to identify which areas experienced particularly hot summers or colder winters.
- Compare annual versus monthly data - use this data to compare average annual highs against monthly highs in order to understand temperature trends at a given location throughout all four seasons of a single year, or explore how different regions vary based on yearly weather patterns as well as across given months within any one year; - Heatmap analysis - use this data plot temperature information in an interactive heatmap format in order to pinpoint particular regions that experience unique weather conditions or higher-than-average levels of warmth compared against cooler pockets of similar size geographic areas; - Statistically model the relationships between independent variables (temperature variations by month, region/city and more!) and dependent variables (e.g., tourism volumes). Use regression techniques such as linear models (OLS), ARIMA models/nonlinear transformations and other methods through statistical software such as STATA or R programming language;
- Look into climate trends over longer periods - adjust time frames included in analyses beyond 2018 when possible by expanding upon the monthly station observations already present within the study timeframe utilized here; take advantage of digitally available historical temperature readings rather than relying only upon printed reports

With these helpful tips, you can get started analyzing record high temperatures for US cities during 2015 using our 'Record High Temperatures for US Cities' dataset!

Research Ideas

Create a heat map chart of US cities representing the highest temperature on record for each city from 2015.

Analyze trends in monthly high temperatures in order to predict future climate shifts and weather patterns across different US cities.

Track and compare monthly high temperature records for all US cities to identify regional hot spots with higher than average records and potential implications for agriculture and resource management planning

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

Unknown License - Please check the dataset description for more information.

Columns

File: Highest temperature on record through 2015 by US City.csv | Column name | Description | |:--------------|:--------------------------------------------------------------| | CITY | Name of the city. (String) | | JAN | Record high temperature for the month of January. (Integer) | | FEB | Record high temperature for the month of February. (Integer) | | MAR | Record high temperature for the month of March. (Integer) | | APR | Record high temperature for the month of April. (Integer) | | MAY | Record high temperature for the month of May. (Integer) | | JUN | Record high temperature for the month of June. (Integer) | | JUL | Record high temperature for the month of July. (Integer) | | AUG | Record high temperature for the month of August. (Integer) | | SEP | Record high temperature for the month of September. (Integer) | | OCT | Record high temperature for the month of October. (Integer) | | ...
Reddit: /r/science
kaggle.com
zip
Updated Dec 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Reddit: /r/science [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-reddit-r-science-subreddit-interaction
Explore at:
zip(205948 bytes)Available download formats
Dataset updated
Dec 17, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Reddit: /r/science

Investigating Social Media Interactions and Popularity Metrics

By Reddit [source]

About this dataset

The Reddit Subreddit Science dataset offers an in-depth exploration of the science-related conversations and content taking place on the popular website, Reddit. This dataset provides valuable insights into user interactions, sentiment analysis and popularity trends across various types of science topics ranging from astrophysics to neuroscience. The data comprises key features such as post titles, post scores, comment counts, creation times and post URLs which will help us to understand the dynamics and sentiments of the scientific discussions within this popular forum. Utilizing this data set can empower us to analyze how a certain topic has changed over time in terms or relevance or what kind of posts are most successful at gaining attention from users. Ultimately we can leverage this analysis to better comprehend shifts in public opinion towards various aspects of current scientific knowledge

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Research Ideas

Analyzing the topic trends within the subreddit over time, in order to understand which topics are most popular with readers.

Identifying relationships between levels of interaction (comments and upvotes) and sentiment (through text analysis), to track how users react to certain topics.

Tracking post and user metrics over time (such as average post length or number of comments per post), in order to monitor changes in outlook on the subreddit community as a whole

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: science.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL associated with the post. (String) | | comms_num | The number of comments associated with the post. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The content of the post. (String) | | timestamp | The timestamp of the post. (DateTime) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.
KC_House Dataset -Linear Regression of Home Prices
kaggle.com
zip
Updated May 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
Explore at:
zip(776807 bytes)Available download formats
Dataset updated
May 15, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset: House pricing dataset containing 21 columns and 21613 rows.

Programming Language : R

Objective : To predict house prices by creating a model

Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
AskReddit (Submissions & Comments)
kaggle.com
zip
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). AskReddit (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-askreddit-trends-a-study-of-subreddit
Explore at:
zip(124445 bytes)Available download formats
Dataset updated
Dec 15, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Uncovering AskReddit Trends: A Study of Subreddit Engagement

User Engagement Through Posts

By Reddit [source]

About this dataset

This comprehensive dataset contains information from the AskReddit subreddit on Reddit.com, with over 8 columns of data providing valuable insights into user engagement and interaction. It includes the title of each post, their score, how many comments are associated with them, and when they were created/posted. Use this data to gain insight into how different posts engage users on Reddit, what kind of content resonates with readers, and how user engagement has shifted over time. Learn more about AskReddit posts and analyze the patterns that emerge as you examine user engagement across different types of content

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

One of the most popular subreddits on Reddit is AskReddit – a platform for users to post questions and share knowledge with other Redditors. As active askers, commenters, and posters, it’s essential to gain insight into how people engage in the subreddit.

Using this dataset you can begin your exploration by examining distributions related to post scores (score), number of comments on posts (comms_num), time elapsed between when posts are made and when they receive replies( created vs timestamp). Additionally there may be interesting correlations between features such as post title length(title) & average words per comment(body). The ultimate goal is uncovering key trends in user behavior & identifying features which help predict outcomes so that insight into various engagements can be better accessed.

Utilizing this dataset could potentially provide valuable understanding on how popular & active topics are being discussed on AskReddit whether it's something political or scientific - resulting in an improved UX for both moderators & users alike! Happy Exploring!

Research Ideas

Examining the influence of post titles on user engagement, such as looking at the correlation between more descriptive title lengths and higher scores or comment counts

Utilizing natural language processing techniques to analyze the body of posts in order to gain more insight into user attitudes and opinions

Studying post timing effects by tracking changes in user engagement over time for various subject topics as well as understanding when it may be best to create a post for maximum exposure

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: AskReddit.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (Datetime) | | body | The body of the post. (String) | | timestamp | The timestamp of the post. (Integer) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

Facebook

Twitter

Click to copy link

Link copied

Cite

Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023

France Weekly Real Estate Listings 2022-2023

Seloger Separate Raw and Clean Merged Listings from 2022-06-26 to 2023-02-26

Explore at:

zip(2750497 bytes)Available download formats

Dataset updated

Apr 3, 2024

Authors

Artur Dragunov

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered

France

Description

These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

Clear search

Close search

Google apps

Main menu

France Weekly Real Estate Listings 2022-2023

Reddit: /r/news

Reddit: /r/news

Exploring Topics, Scores, and Engagement

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

Reddit: /r/technology (Submissions & Comments)

Reddit: /r/technology (Submissions & Comments)

Title, Score, ID, URL, Comment Number, and Timestamp

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Data from: [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012)

UK Weekly Real Estate Listings 2022-2023

Reddit: /r/videos

Reddit: /r/videos

Insights on Popularity and Content Quality

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How To Use This Dataset

Research Ideas

Acknowledgements

License

Columns

The Pizza Problem

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Reddit: /r/Damnthatsinteresting

Reddit: /r/Damnthatsinteresting

Investigating Popularity, Score and Engagement Across Subreddits

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Reddit: /r/pokemon

Reddit: /r/pokemon

Exploring Post Popularity, User Engagement and Topic Disscussion

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Reddit's /r/funny Subreddit

Explore Reddit's Funny Subreddit & Analyze Community Engagement!

Quantifying Community Interaction Through Reddit Posts