21 datasets found
  1. France Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
    Explore at:
    zip(2750497 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    France
    Description

    These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

    The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

    For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

    Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

  2. Reddit: /r/news

    • kaggle.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/news [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popularity-and-user-engagement-trends/discussion
    Explore at:
    zip(146481 bytes)Available download formats
    Dataset updated
    Dec 17, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/news

    Exploring Topics, Scores, and Engagement

    By Reddit [source]

    About this dataset

    This dataset provides an in-depth look into learning what communities find important and engaging in the news. With this data, researchers can discover trends related to user engagement and popular topics within subreddits. By examining the ā€œscoreā€ and ā€œcomms_numā€ columns, our researchers will be able to pinpoint which topics are most liked, discussed or shared within the various subreddits. Researchers may also gain insights into not only how popular a topic is but how it is growing over time. Additionally, by exploring the body column of our dataset, researchers can understand more about which types of news stories drive conversation within particular subreddits—providing an opportunity for deeper analysis of that subreddit’s diverse community dynamics

    The dataset includes eight columns: title, score, id, url, comms_num created**body and timestamp** which can help us identify key insights into user engagement among popular subreddits. With this data we may also determine relationships between topics of discussion and their impact on user engagement allowing us to create a better understanding surrounding issue-based conversations online as well as uncover emerging trends in online news consumption habits

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is useful for those who are looking to gain insight into the popularity and user engagement of specific subreddits. The data includes 8 different columns including title, score, id, url, comms_num, created, body and timestamp. This can provide valuable information about how users view and interact with particular topics across various subreddits.

    In this guide we’ll look at how you can use this dataset to uncover trends in user engagement on topics within specific subreddits as well as measure the overall popularity of these topics within a subreddit.

    1) Analyzing Score: By analyzing the ā€œscoreā€ column you can determine which news stories are popular in a particular subreddit and which ones aren't by looking at how many upvotes each story has received. With this data you will be able to determine trends in what types of stories users preferred within a particular subreddit over time.

    2) Analyzing Comms_Num: Similarly to analyzing the score column you can analyze the ā€œcomms_numā€ column to see which news stories had more engagement from users by tracking number of comments received on each post. Knowing these points can provide insight into what types of stories tend to draw more comment activity from users in certain subreddits from one day or an extended period of time such tracking post activity for multiple weeks or months at once 3) Analyzing Body: Additionally by looking at the ā€œbodyā€ column for each post researchers can gain a better understanding which kinds of topics/news draw attention among specific Reddit communities.. With that complete picture researchers have access not only to data measuring Reddit buzz but also access topic discussion/comments helping generate further insights into why certain posts might be popular or receive more comments than others

    Overallthis dataset provides valuable insights about user engagedment related specifically topics trending accross subsbreddits allowing anyone interested reseraching such things easier way access those insights all one place

    Research Ideas

    • Grouping news topics within particular subreddits and assessing the overall popularity of those topics in terms of scores/user engagement.
    • Correlating user engagement with certain news topics to understand how they influence discussion or reactions on a subreddit.
    • Examining the potential correlation between score and the actual body content of a given post to assess what types of content are most successful in gaining interest from users and creating positive engagement for posts

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: news.csv | Column name | Description ...

  3. d

    Data from: Water Temperature of Lakes in the Conterminous U.S. Using the...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Water Temperature of Lakes in the Conterminous U.S. Using the Landsat 8 Analysis Ready Dataset Raster Images from 2013-2023 [Dataset]. https://catalog.data.gov/dataset/water-temperature-of-lakes-in-the-conterminous-u-s-using-the-landsat-8-analysis-ready-2013
    Explore at:
    Dataset updated
    Nov 13, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Contiguous United States, United States
    Description

    This data release contains lake and reservoir water surface temperature summary statistics calculated from Landsat 8 Analysis Ready Dataset (ARD) images available within the Conterminous United States (CONUS) from 2013-2023. All zip files within this data release contain nested directories using .parquet files to store the data. The file example_script_for_using_parquet.R contains example code for using the R arrow package (Richardson and others, 2024) to open and query the nested .parquet files. Limitations with this dataset include: - All biases inherent to the Landsat Surface Temperature product are retained in this dataset which can produce unrealistically high or low estimates of water temperature. This is observed to happen, for example, in cases with partial cloud coverage over a waterbody. - Some waterbodies are split between multiple Landsat Analysis Ready Data tiles or orbit footprints. In these cases, multiple waterbody-wide statistics may be reported - one for each data tile. The deepest point values will be extracted and reported for tile covering the deepest point. A total of 947 waterbodies are split between multiple tiles (see the multiple_tiles = ā€œyesā€ column of site_id_tile_hv_crosswalk.csv). - Temperature data were not extracted from satellite images with more than 90% cloud cover. - Temperature data represents skin temperature at the water surface and may differ from temperature observations from below the water surface. Potential methods for addressing limitations with this dataset: - Identifying and removing unrealistic temperature estimates: - Calculate total percentage of cloud pixels over a given waterbody as: percent_cloud_pixels = wb_dswe9_pixels/(wb_dswe9_pixels + wb_dswe1_pixels), and filter percent_cloud_pixels by a desired percentage of cloud coverage. - Remove lakes with a limited number of water pixel values available (wb_dswe1_pixels < 10) - Filter waterbodies where the deepest point is identified as water (dp_dswe = 1) - Handling waterbodies split between multiple tiles: - These waterbodies can be identified using the "site_id_tile_hv_crosswalk.csv" file (column multiple_tiles = ā€œyesā€). A user could combine sections of the same waterbody by spatially weighting the values using the number of water pixels available within each section (wb_dswe1_pixels). This should be done with caution, as some sections of the waterbody may have data available on different dates. All zip files within this data release contain nested directories using .parquet files to store the data. The example_script_for_using_parquet.R contains example code for using the R arrow package to open and query the nested .parquet files. - "year_byscene=XXXX.zip" – includes temperature summary statistics for individual waterbodies and the deepest points (the furthest point from land within a waterbody) within each waterbody by the scene_date (when the satellite passed over). Individual waterbodies are identified by the National Hydrography Dataset (NHD) permanent_identifier included within the site_id column. Some of the .parquet files with the byscene datasets may only include one dummy row of data (identified by tile_hv="000-000"). This happens when no tabular data is extracted from the raster images because of clouds obscuring the image, a tile that covers mostly ocean with a very small amount of land, or other possible. An example file path for this dataset follows: year_byscene=2023/tile_hv=002-001/part-0.parquet -"year=XXXX.zip" – includes the summary statistics for individual waterbodies and the deepest points within each waterbody by the year (dataset=annual), month (year=0, dataset=monthly), and year-month (dataset=yrmon). The year_byscene=XXXX is used as input for generating these summary tables that aggregates temperature data by year, month, and year-month. Aggregated data is not available for the following tiles: 001-004, 001-010, 002-012, 028-013, and 029-012, because these tiles primarily cover ocean with limited land, and no output data were generated. An example file path for this dataset follows: year=2023/dataset=lakes_annual/tile_hv=002-001/part-0.parquet - "example_script_for_using_parquet.R" – This script includes code to download zip files directly from ScienceBase, identify HUC04 basins within desired landsat ARD grid tile, download NHDplus High Resolution data for visualizing, using the R arrow package to compile .parquet files in nested directories, and create example static and interactive maps. - "nhd_HUC04s_ingrid.csv" – This cross-walk file identifies the HUC04 watersheds within each Landsat ARD Tile grid. -"site_id_tile_hv_crosswalk.csv" - This cross-walk file identifies the site_id (nhdhr{permanent_identifier}) within each Landsat ARD Tile grid. This file also includes a column (multiple_tiles) to identify site_id's that fall within multiple Landsat ARD Tile grids. - "lst_grid.png" – a map of the Landsat grid tiles labelled by the horizontal – vertical ID.

  4. Reddit: /r/technology (Submissions & Comments)

    • kaggle.com
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/technology (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-technology-insights-through-reddit-di
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 18, 2022
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/technology (Submissions & Comments)

    Title, Score, ID, URL, Comment Number, and Timestamp

    By Reddit [source]

    About this dataset

    This dataset, labeled as Reddit Technology Data, provides thorough insights into the conversations and interactions around technology-related topics shared on Reddit – a well-known Internet discussion forum. This dataset contains titles of discussions, scores as contributed by users on Reddit, the unique IDs attributed to different discussions, URLs of those hidden discussions (if any), comment counts in each discussion thread and timestamps of when those conversations were initiated. As such, this data is supremely valuable for tech-savvy people wanting to stay up to date with the new developments in their field or professionals looking to keep abreast with industry trends. In short, it is a repository which helps people make sense and draw meaning out of what’s happening in the technology world at large - inspiring action on their part or simply educating them about forthcoming changes

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset includes six columns containing title, score, url address link to the discussion page on Reddit itself ,comment count ,created time stamp meaning when was it posted/uploaded/communicated and body containing actual text written regarding that particular post/discussion. By separately analyzing each column it can be made out what type information user require in regard with various aspects related to technology based discussions. One can develop hypothesis about correlations between different factors associated with rating or comment count by separate analysis within those columns themselves like discuss what does people comment or react mostly upon viewing which type of post inside reddit ? Does high rating always come along with extremely long comments.? And many more .By researching this way one can discover real facts hidden behind social networking websites such as reddit which contains large amount of rich information regarding user’s interest in different topics related to tech gadgets or otherwise .We can analyze different trends using voice search technology etc in order visualize users overall reaction towards any kind of information shared through public forums like stack overflow sites ,facebook posts etc .These small instances will allow us gain heavy insights for research purpose thereby providing another layer for potential business opportunities one may benefit from over a given period if not periodcally monitored .

    Research Ideas

    • Companies can use this dataset to create targeted online marketing campaigns directed towards Reddit users interested in specific areas of technology.
    • Academic researchers can use the data to track and analyze trends in conversations related to technology on Reddit over time.
    • Technology professionals can utilize the comments and discussions on this dataset as a way of gauging public opinion and consumer sentiment towards certain technological advancements or products

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: technology.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of the discussion. (String) | | score | The score of the discussion as measured by Reddit contributors. (Integer) | | url | The website URL associated with the discussion. (String) | | comms_num | The number of comments associated with the discussion. (Integer) | | created | The date and time the discussion was created. (DateTime) | | body | The body content of the discussion. (String) | | timestamp | The timestamp of the discussion. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  5. Data from: [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012)

    • smithsonian.figshare.com
    • search.dataone.org
    pdf
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Condit; Suzanne Lao; Rolando Pįŗ½rez; Steven B. Dolins; Robin Foster; Stephen Hubbell (2024). [Dataset:] Barro Colorado Forest Census Plot Data (Version 2012) [Dataset]. http://doi.org/10.5479/data.bci.20130603
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Smithsonian Tropical Research Institute
    Authors
    Richard Condit; Suzanne Lao; Rolando Pįŗ½rez; Steven B. Dolins; Robin Foster; Stephen Hubbell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Barro Colorado Island
    Description

    Abstract:The 50-hectare plot at Barro Colorado Island, Panama, is a 1000 meter by 500 meter rectangle of forest inside of which all woody trees and shrubs with stems at least 1 cm in stem diameter have been censused. Every individual tree in the 50 hectares was permanently numbered with an aluminum tag in 1982, and every individual has been revisited six times since (in 1985, 1990, 1995, 2000, 2005, and 2010). In each census, every tree was measured, mapped and identified to species. Details of the census method are presented in Condit (Tropical forest census plots: Methods and results from Barro Colorado Island, Panama and a comparison with other plots; Springer-Verlag, 1998), and a description of the seven-census results in Condit, Chisholm, and Hubbell (Thirty years of forest census at Barro Colorado and the Importance of Immigration in maintaining diversity; PLoS ONE, 7:e49826, 2012).Description:CITATION TO DATABASE: Condit, R., Lao, S., PƩrez, R., Dolins, S.B., Foster, R.B. Hubbell, S.P. 2012. Barro Colorado Forest Census Plot Data, 2012 Version. DOI http://dx.doi.org/10.5479/data.bci.20130603 CO-AUTHORS: Stephen Hubbell and Richard Condit have been principal investigators of the project for over 30 years. They are fully responsible for the field methods and data quality. As such, both request that data users contact them and invite them to be co-authors on publications relying on the data. More recent versions of the data, often with important updates, can be requested directly from R. Condit (conditr@gmail.com). ACKNOWLEDGMENTS: The following should be acknowledged in publications for contributions to the 50-ha plot project: R. Foster as plot founder and the first botanist able to identify so many trees in a diverse forest; R. PƩrez and S. Aguilar for species identification; S. Lao for data management; S. Dolins for database design; plus hundreds of field workers for the census work, now over 2 million tree measurements; the National Science Foundation, Smithsonian Tropical Research Institute, and MacArthur Foundation for the bulk of the financial support. File 1. RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (File 5). File 2. RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (File 7). File 3. ViewFullTable.zip: A zip archive with a single ascii text file named ViewFullTable.txt holding a table with all census data from the BCI 50-ha plot. Each row is a single measurement of a single stem, with columns indicating the census, date, species name, plus tree and stem identifiers; all seven censuses are included. A full description of all columns in the table can be found at http://dx.doi.org/10.5479/data.bci.20130604 (ViewFullTable, pp. 21-22 of the pdf). File 4. ViewTax.txt: An ascii text table with information on all tree species recorded in the 50-ha plot. There are columns with taxonomics names (family, genus, species, and subspecies), plus the taxonomic authority. The column 'Mnemonic' gives a shortened code identifying each species, a code used in the R tables (Files 5, 7). The column 'IDLevel' indicates the depth to which the species is identified: if IDLevel='species', it is a fully identified, but if IDLevel='genus', the genus is known but not the species. IDLevel can also be 'family', or 'none' in case the species is not even known to family. File 5. bci.full.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.full1.rdata' for the first census through 'bci.full7.rdata' for the seventh census. Each of the seven files is a table having one record per individual tree, and each includes a record for every tree found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputFull.pdf (File 1). File 6. bci.spptable.rdata: A list of the 1064 species found across all tree plots and inventories in Panama, in R format. This is a superset of species found in the BCI censuses: every BCI species is included, plus additional species never observed at BCI. The column 'sp' in this table is a code identifying the species in the R census tables (File 5, 7), and matching 'mnemomic' in ViewFullTable (File 3). File 7. bci.stem.Rdata31Aug2012.zip: A zip archive holding seven R Analytical Tables, versions of the BCI 50 ha plot census data in R format. These are designed for data analysis. There are seven files, one for each of the 7 censuses: 'bci.stem1.rdata' for the first census through 'bci.stem7.rdata' for the seventh census. Each of the seven files is a table having one record per individual stem, necessary because some individual trees have more than one stem. Each includes a record for every stem found over the entire seven censuses (i.e. whether or not they were observed alive in the given census, there is a record). Detailed documentation of these tables is given in RoutputStem.pdf (File 2). File 8. TSMAttributes.txt: An ascii text table giving full descriptions of measurement codes, which are also referred to as TSMCodes. These short codes are used in the column 'code' in R tables and in the column 'ListOfTSM' in ViewFullTable.txt, in both cases with individual codes separated by commas. File 9. bci_31August2012_mysql.zip: A zip archive holding one file, 'bci.sql', which is a mysqldump of the complete MySQL database (version 5.0.95, http://www.mysql.com) created 31 August 2012. The database includes data collected from seven censuses of the BCI 50 ha plot plus censuses of many additional plots elsewhere in Panama, plus transects where only species identifications were collected and trees were not tagged nor measurements made. Detailed documentation of all tables within the database can be found at (http://dx.doi.org/10.5479/data.bci.20130604). This version of the data is intended for experienced SQL users; for most, the R Analytical Tables in Rtables.zip are more useful.

  6. UK Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). UK Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/uk-weekly-real-estate-listings-2022-2023
    Explore at:
    zip(29112488 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    These Kaggle datasets provide downloaded real-estate listings from the UK real estate market, capturing data from a leading platform in the UK (Zoopla), reminiscent of the approach taken for the US dataset from Redfin and French dataset from Seloger. It encompasses detailed property listings, pricing, and market trends across UK, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as UK_clean_unique.csv.

    The cleaning process mirrored that of the US and French datasets, involving removing irrelevant features, normalizing variable names for dataset consistency with the USA and France, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on the UK's real estate market drivers.

    For exact column descriptions, see columns for UK_clean_unique.csv and my thesis.

    Table 2.6 and Section 2.2.2, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

  7. Reddit: /r/videos

    • kaggle.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/videos [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popular-and-quality-video-content-on/code
    Explore at:
    zip(127095 bytes)Available download formats
    Dataset updated
    Dec 17, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/videos

    Insights on Popularity and Content Quality

    By Reddit [source]

    About this dataset

    This dataset explores the media content on Reddit and how it is received by its community, providing detailed insights into both the popularity and quality of subreditvideos. Here you will find data about videos posted on Reddit, compiled from various metrics such as their upvotes, number of comments, date and time posted, body text and more. With this data you can dive deeper into the types of videos being shared and the topics being discussed – gaining a better understanding of what resonates with the Reddit community. This information allows us to gain insight into what kind of content has potential to reach a wide audience on Reddit; it also reveals which types of videos have been enjoying popularity amongst users over time. These insights can help researchers uncover valuable findings about media trends on popular social media sites such as Reddit – so don't hesitate to explore!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How To Use This Dataset

    This dataset is a great resource for analyzing the content and popularity of videos posted on Reddit. It provides various metrics such as score, url, comment count and creation date that let you compare the different types of content being shared on the subredditvideos subreddit.

    To get started, take a look at the title field for each post. This gives you an idea of what type of video is being shared, which can be helpful in understanding what topics are popular on the platform.

    Next, use the score field to identify posts that have done well in terms of receiving upvotes from users. The higher its score, the more popular it has been with viewers. A higher score does not necessarily indicate higher quality however; take a closer look at each post's body field to get an idea for its content quality before making assumptions about its value based solely off of its high score. Having said that, top scoring posts could be considered further when doing research analysis into popular topics or trends in media consumption behavior across Reddit’s userbase (e.g., trending topics among young adults). The url field provides you with links to directly access videos so you can review them yourself before sharing them or forwarding them onto friends or colleagues for their feedback/insight as well (something that could be done further depending on how detailed your research project requires). The comms_num column represents how many comments each video has received which may give insight into how engaged viewers have been when viewing stories submitted by this particular sub-reddit’s members - useful information if interactions/conversations surrounding particular types of content are part of your research objective too! Finally make sure to check out timestamp column as this records when each story was created - important information whenever attempting to draw conclusive insights from time-oriented data points (a time series analysis would serve very handy here!).
    Knowing all these features listed above should give researchers an easily accessible source into exploring popularity and quality levels amongst Reddit’s shared media channels – uncovering potentially useful insights related specifically those moving image stories found within subredditvideos are made available via this dataset here!

    Research Ideas

    • Identifying and tracking trends in the popularity of different genres of videos posted on Reddit, such as interviews, music videos, or educational content.
    • Investigating audience engagement with certain types of content to determine the types of posts that resonate most with users on Reddit.
    • Examining correlations between video score or comment count and specific video characteristics such as length, topic or visual style

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: videos.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------| | title | The title of ...

  8. The Pizza Problem

    • kaggle.com
    zip
    Updated Feb 8, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Jeanne (2019). The Pizza Problem [Dataset]. https://www.kaggle.com/jeremyjeanne/google-hashcode-pizza-training-2019
    Explore at:
    zip(178852 bytes)Available download formats
    Dataset updated
    Feb 8, 2019
    Authors
    Jeremy Jeanne
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Problem description

    Pizza

    The pizza is represented as a rectangular, 2-dimensional grid of R rows and C columns. The cells within the grid are referenced using a pair of 0-based coordinates [r, c] , denoting respectively the row and the column of the cell.

    Each cell of the pizza contains either:

    mushroom, represented in the input file as M
    tomato, represented in the input file as T
    

    Slice

    A slice of pizza is a rectangular section of the pizza delimited by two rows and two columns, without holes. The slices we want to cut out must contain at least L cells of each ingredient (that is, at least L cells of mushroom and at least L cells of tomato) and at most H cells of any kind in total - surprising as it is, there is such a thing as too much pizza in one slice. The slices being cut out cannot overlap. The slices being cut do not need to cover the entire pizza.

    Goal

    The goal is to cut correct slices out of the pizza maximizing the total number of cells in all slices. Input data set The input data is provided as a data set file - a plain text file containing exclusively ASCII characters with lines terminated with a single ā€˜ ’ character at the end of each line (UNIX- style line endings).

    File format

    The file consists of:

    one line containing the following natural numbers separated by single spaces:
    R (1 ≤ R ≤ 1000) is the number of rows
    C (1 ≤ C ≤ 1000) is the number of columns
    L (1 ≤ L ≤ 1000) is the minimum number of each ingredient cells in a slice
    H (1 ≤ H ≤ 1000) is the maximum total number of cells of a slice
    

    Google 2017, All rights reserved.

    R lines describing the rows of the pizza (one after another). Each of these lines contains C characters describing the ingredients in the cells of the row (one cell after another). Each character is either ā€˜M’ (for mushroom) or ā€˜T’ (for tomato).

    Example

    3 5 1 6
    TTTTT
    TMMMT
    TTTTT
    

    3 rows, 5 columns, min 1 of each ingredient per slice, max 6 cells per slice

    Example input file.

    Submissions

    File format

    The file must consist of:

    one line containing a single natural number S (0 ≤ S ≤ R Ɨ C) , representing the total number of slices to be cut,
    U lines describing the slices. Each of these lines must contain the following natural numbers separated by single spaces:
    r 1 , c 1 , r 2 , c 2 describe a slice of pizza delimited by the rows r (0 ≤ r1,r2 < R, 0 ≤ c1, c2 < C) 1 and r 2 and the columns c 1 and c 2 , including the cells of the delimiting rows and columns. The rows ( r 1 and r 2 ) can be given in any order. The columns ( c 1 and c 2 ) can be given in any order too.
    

    Example

    0 0 2 1
    0 2 2 2
    0 3 2 4
    

    3 slices.

    First slice between rows (0,2) and columns (0,1).
    Second slice between rows (0,2) and columns (2,2).
    Third slice between rows (0,2) and columns (3,4).
    Example submission file.
    

    Ā© Google 2017, All rights reserved.

    Slices described in the example submission file marked in green, orange and purple. Validation

    For the solution to be accepted:

    the format of the file must match the description above,
    each cell of the pizza must be included in at most one slice,
    each slice must contain at least L cells of mushroom,
    each slice must contain at least L cells of tomato,
    total area of each slice must be at most H
    

    Scoring

    The submission gets a score equal to the total number of cells in all slices. Note that there are multiple data sets representing separate instances of the problem. The final score for your team is the sum of your best scores on the individual data sets. Scoring example

    The example submission file given above cuts the slices of 6, 3 and 6 cells, earning 6 + 3 + 6 = 15 points.

  9. Google Data Analytics Case Study Cyclistic

    • kaggle.com
    zip
    Updated Sep 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions
    Explore at:
    zip(1299 bytes)Available download formats
    Dataset updated
    Sep 27, 2022
    Authors
    Udayakumar19
    Description

    Introduction

    Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario

    You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Ask

    How do annual members and casual riders use Cyclistic bikes differently?

    Guiding Question:

    What is the problem you are trying to solve?
      How do annual members and casual riders use Cyclistic bikes differently?
    How can your insights drive business decisions?
      The insight will help the marketing team to make a strategy for casual riders
    

    Prepare

    Guiding Question:

    Where is your data located?
      Data located in Cyclistic organization data.
    
    How is data organized?
      Dataset are in csv format for each month wise from Financial year 22.
    
    Are there issues with bias or credibility in this data? Does your data ROCCC? 
      It is good it is ROCCC because data collected in from Cyclistic organization.
    
    How are you addressing licensing, privacy, security, and accessibility?
      The company has their own license over the dataset. Dataset does not have any personal information about the riders.
    
    How did you verify the data’s integrity?
      All the files have consistent columns and each column has the correct type of data.
    
    How does it help you answer your questions?
      Insights always hidden in the data. We have the interpret with data to find the insights.
    
    Are there any problems with the data?
      Yes, starting station names, ending station names have null values.
    

    Process

    Guiding Question:

    What tools are you choosing and why?
      I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
    
    Have you ensured the data’s integrity?
     Yes, the data is consistent throughout the columns.
    
    What steps have you taken to ensure that your data is clean?
      First duplicates, null values are removed then added new columns for analysis.
    
    How can you verify that your data is clean and ready to analyze? 
     Make sure the column names are consistent thorough out all data sets by using the ā€œbind rowā€ function.
    
    Make sure column data types are consistent throughout all the dataset by using the ā€œcompare_df_colā€ from the ā€œjanitorā€ package.
    Combine the all dataset into single data frame to make consistent throught the analysis.
    Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
    Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
    Create the ā€œride_lengthā€ column from the started_at and ended_at column to find the average duration of the ride by the riders.
    Removed the null rows from the dataset by using the ā€œna.omit functionā€
    Have you documented your cleaning process so you can review and share those results? 
      Yes, the cleaning process is documented clearly.
    

    Analyze Phase:

    Guiding Questions:

    How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

    What surprises did you discover in the data?
      Casual member ride duration is higher than the annual members
      Causal member widely uses docked bike than the annual members
    What trends or relationships did you find in the data?
      Annual members are used mainly for commute purpose
      Casual member are preferred the docked bikes
      Annual members are preferred the electric or classic bikes
    How will these insights help answer your business questions?
      This insights helps to build a profile for members
    

    Share

    Guiding Quesions:

    Were you able to answer the question of how ...
    
  10. Reddit: /r/Damnthatsinteresting

    • kaggle.com
    zip
    Updated Dec 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/Damnthatsinteresting [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-power-of-user-engagement-on-damnth
    Explore at:
    zip(139409 bytes)Available download formats
    Dataset updated
    Dec 18, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/Damnthatsinteresting

    Investigating Popularity, Score and Engagement Across Subreddits

    By Reddit [source]

    About this dataset

    This dataset provides valuable insights into user engagement and popularity across the subreddit Damnthatsinteresting. With detailed metrics on various discussions such as the title, score, id, URL, comments, created date and time, body and timestamp of each discussion. This dataset opens a window into the world of user interaction on Reddit by letting researchers align their questions with data-driven results to understand social media behavior. Gain an understanding of what drives people to engage in certain conversations as well as why certain topics become trending phenomena – it’s all here for analysis. Enjoy exploring this fascinating collection of information about Reddit users' activities!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides valuable insights into user engagement and the impact of users interactions on the popular subreddit DamnThatsInteresting. Exploring this dataset can help uncover trends in participation, what content is resonating with viewers, and how different users are engaging with each other. In order to get the most out of this dataset, you will need to understand its structure in order to explore and extract meaningful insights. The columns provided include: title, score, url, comms_num, created date/time (created), body and timestamp.

    Research Ideas

    • Analyzing the impact of user comments on the popularity and engagement of discussions
    • Examining trends in user behavior over time to gain insight into popular topics of discussion
    • Investigating which discussions reach higher levels of score, popularity or engagement to identify successful strategies for engaging users

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Damnthatsinteresting.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------------------------------------| | title | The title of the discussion thread. (String) | | score | The number of upvotes the discussion has received from users. (Integer) | | url | The URL link for the discussion thread itself. (String) | | comms_num | The number of comments made on a particular discussion. (Integer) | | created | The date and time when the discussion was first created on Reddit by its original poster (OP). (DateTime) | | body | Full content including text body with rich media embedded within posts such as images/videos etc. (String) | | timestamp | When was last post updated by any particular user. (DateTime) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  11. Reddit: /r/pokemon

    • kaggle.com
    zip
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/pokemon [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-popular-pokemon-topics-and-user-inter
    Explore at:
    zip(434545 bytes)Available download formats
    Dataset updated
    Dec 19, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/pokemon

    Exploring Post Popularity, User Engagement and Topic Disscussion

    By Reddit [source]

    About this dataset

    This Kaggle dataset provides a unique opportunity to explore the ongoing conversations and discussions of the popular PokĆ©mon franchise across Reddit communities. It contains over a thousand entries compiled from posts and comments made by avid PokĆ©mon fans, providing valuable insights into post popularity, user engagement, and topic discussion. With these comprehensive data points including post title, score, post ID link URL, number of comments and date & time created along with body text and timestamp – powerful analysis can be conducted to assess how trends in PokĆ©mon-related activities are evolving over time. So why not dive deep into this fascinating world of PokĆ©-interactions? Follow us as we navigate through the wide range of interesting topics being discussed on Reddit about this legendary franchise!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains over a thousand entries of user conversations related to the PokĆ©mon community posted and commented on Reddit. By using this dataset, you can explore the popularity of PokĆ©mon related topics, the level of user engagement, and how user interactions shape the discussion around each topic. To do so, you’ll want to focus on columns such as title, score, url, comms_num (number of comments on a post), created (date and time when post was created) and timestamp.
    For starters you can start by looking at how many posts have been made about certain topics by using ā€œtitleā€ column as a keyword search bar – e.g., ā€˜Magikarp’ or ā€˜Team Rocket’ – to see just how many posts have been about them in total. With this data in mind, you could consider what makes popular posts become popular and look at the number of upvotes from users (stored in ā€œscoreā€)– i.e., what posts caught people's attention? Beyond upvotes however is downvotes - can these be taken into account when it comes to gauging popularity? One could also take into consideration user engagement by looking at comms_num as it contains information regarding number of comments left for each post - does an increase in comments lead to an increase in upvotes?
    Additionally one could examine how posts were communicated with users by reading into body texts stored under 'body'. Through this information users can create insights into overall discussion per topic: are they conversational or argumentative? Are there underlying regional trends taking place among commenters who place emphasis on different elements regarding their pokemon-related discussions?
    This opens up possibilities for further investigations into understanding pokemon-related phenomena through Reddit discussion; finding out what makes certain topics prevalent while others stay obscure; seeing where our World Regions lay within certain conversations; all while understanding specific nuances within conversation trees between commenters!

    Research Ideas

    • Analyzing the influence of post upvotes in user engagement and conversation outcomes
    • Investigating the frequency of topics discussed in PokĆ©mon related conversations
    • Examining the correlation between post score and number of comments on each post

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: pokemon.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The body text of the post. (String) | | timestamp | The timestamp of the post. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  12. Reddit's /r/funny Subreddit

    • kaggle.com
    zip
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit's /r/funny Subreddit [Dataset]. https://www.kaggle.com/datasets/thedevastator/explore-reddit-s-funny-subreddit-analyze-communi/code
    Explore at:
    zip(93052 bytes)Available download formats
    Dataset updated
    Dec 15, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Explore Reddit's Funny Subreddit & Analyze Community Engagement!

    Quantifying Community Interaction Through Reddit Posts

    By Reddit [source]

    About this dataset

    This dataset offers an insightful analysis into one of the most talked-about online communities today: Reddit. Specifically, we are focusing on the funny subreddit, a subsection of the main forum that enjoys the highest engagement across all Reddit users. Not only does this dataset include post titles, scores and other details regarding post creation and engagement; it also includes powerful metrics to measure active community interaction such as comment numbers and timestamps. By diving deep into this data, we can paint a fuller picture in terms of what people find funny in our digital age - how well do certain topics draw responses? How does sentiment change over time? And how can community managers use these insights to grow their platforms and better engage their userbase for lasting success? With this comprehensive dataset at your fingertips, you'll be able to answer each question - and more

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Introduction

    Welcome to the Reddit's Funny Subreddit Kaggle Dataset. In this dataset you will explore and analyze posts from the popular subreddit to gain insights into community engagement. With this dataset, you can understand user engagement trends and learn how people interact with content from different topics. This guide will provide further information about how to use this dataset for your data analysis projects.

    Important Columns

    This datasets contains columns such as: title, score, url, comms_num (number of comments), created (date of post), body (content of post) and timestamp. All these columns are important in understanding user interactions with each post on Reddit’s Funny Subreddit.

    Exploratory Data Analysis

    In order to get a better understanding of user engagement on the subreddit, some initial exploration is necessary. By using graphical tools such as histograms or boxplots we can understand basic parameter values like scores or comments numbers for each post in the subreddit easily by just observing their distribution over time or through different parameters (for example: type of joke).

    Dimensionality reduction

    For more advanced analytics it is recommended that a dimensionality reduction technique like PCA should be used first before tackling any real analysis tasks so that similar posts can be grouped together and easier conclusions regarding them can be drawn out later on more confidently by leaving out any kind of conflicting/irrelevant variables which could cloud up any data-driven decisions taken forward at a later date if not properly accounted for early on in an appropriate manner after dimensional consolidation has been performed successfully first correctly effectively right off the bat once starting out cleanly and properly upfront accordingly throughout..

    Further Guidance

    If further assistance with using this dataset is required then further readings into topics like text mining, natural language processing , machine learning , etc are highly recommended where detailed explanation related to various steps which could help unlock greater value from Reddit's funny subreddits are explained elaborately hopefully giving readers or researchers ideas over what sort of approaches need being taking when it comes analyzing text-based online service platforms such as Reddit during data analytics/science related tasks

    Research Ideas

    • Analyzing post title length vs. engagement (i.e., score, comments).
    • Comparing sentiment of post bodies between posts that have high/low scores and comments.
    • Comparing topics within the posts that have high/low scores and comments to look for any differences in content or style of writing based on engagement level

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: funny.csv | Column name | Description | |:--------------|:------------------------...

  13. Comprehensive Literary Greats Dataset

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Comprehensive Literary Greats Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-literary-greats-dataset
    Explore at:
    zip(29940528 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Comprehensive Literary Greats Dataset

    50,000+ Books Rated and Awarded Across Language, Genre, and Format

    By [source]

    About this dataset

    This remarkable dataset provides an awe-inspiring collection of over 50,000 books, encompassing the world's best practices in literature, poetry, and authorship. For each book included in the dataset, users can gain access to a wealth of insightful information such as title, author(s), average rating given by readers and critics alike, a brief description highlighting its plot or characteristics; language it is written in; unique ISBN which enables potential buyers to locate their favorite works with ease; genres it belongs to; any awards it has won or characters that inhabit its storyworld.

    Additionally, seeking out readers' opinions on exceptional books is made easier due to the availability of bbeScore (best books ever score) alongside details for the most accurate ratings given through well-detailed breakdowns in ā€œratingsByStarsā€ section. Making sure visibility and recognition are granted fairly – be it a classic novel from time immemorial or merely recently released newcomers - this source also allows us to evaluate new stories based off readers' engagement rate highlighted by likedPercent column (the percentage of readers who liked the book), bbeVotes (number of votes casted) as well as entries related to date published - including showstopping firstPublishDate!

    Aspiring literature researchers; literary historians and those seeking hidden literary gems alike would no doubt benefit from delving into this magnificent collection – 25 variables regarding different novels & poets that are presented by Kaggle open source dataset ā€œBest Books Ever: A Comprehensive Historical Collection of Literary Greatsā€. What worlds awaits you?

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Whether you are a student, researcher, or enthusiast of literature, this dataset provides a valuable source for exploring literary works from varied time periods and genres. By accessing all 25 variables in the dataset, readers have the opportunity to use them for building visualizations, creating new analysis tools and models, or finding books you might be interested in reading.

    First after downloading the dataset into Kaggle Notebooks platform or other programming interfaces of your choice such as R Studio/Python Jupyter Notebooks (Pandas) - make sure that data is arranged into columns with clearly labeled title names. This will help you understand which variable is related to what precise information. Afterwards explore each variable by finding any patterns across particular titles or interesting findings about certain authors/ratings that are available in your research interests.

    Utilize the vital columns of Title (title), Author(author), Rating (rating), Description (description), Language (language), Genres (genres) and Characters(characters) - these can assist you in discovering different trends between books according to style of composition or character types etc. Move further down on examining more specific details offered by Book Format(bookFormat), Edition(edition) Pages(pages). Peruse publisher info along with Publish Date(publishDate). Besides these structural elements also take note of Awards column considering recent recognition different titles have received; also observe how much ratings has been collected per text through Numbers Ratings column-(numRatings); analyze reader's feedback according on Ratings By Stars(_ratingsByStars); view LikedPercentage rate provided by readers when analyzing particular book(_likedPercent).

    Apart from more accessible factors mentioned previously delve deeper onto more sophisticated data presented: Setting (_setting); Cover Image (_coverImg); BbeScore_bbeScore); BbeVotes_bbeVotes). All those should provide greater insight when trying to explain why certain book has made its way onto GoodReads top selections list! To find value estimate test out Price (_price)) column too - determining if some texts retain large popularity despite rather costly publishing options cost-wise available on market currently?

    Finally combine different aspects observed while researching concerning individual titles- create personalized recommendations based upon released comprehensive lists! To achieve that utilize ISUBN code provided; compare publication Vs first publication dates historically recorded; verify awards labeling procedure relied upon give context information on discussed here books progress over years

    Research Ideas

    • Creating a web or mobile...
  14. Reddit /r/datasets Dataset

    • kaggle.com
    zip
    Updated Nov 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit /r/datasets Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/the-meta-corpus-of-datasets-the-reddit-dataset
    Explore at:
    zip(9619636 bytes)Available download formats
    Dataset updated
    Nov 28, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Meta-Corpus of Datasets: The Reddit Dataset

    The Complete Collection of Datasets Posted on Reddit

    By SocialGrep [source]

    About this dataset

    A subreddit dataset is a collection of posts and comments made on Reddit's /r/datasets board. This dataset contains all the posts and comments made on the /r/datasets subreddit from its inception to March 1, 2022. The dataset was procured using SocialGrep. The data does not include usernames to preserve users' anonymity and to prevent targeted harassment

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to use this dataset, you will need to have a text editor such as Microsoft Word or LibreOffice installed on your computer. You will also need a web browser such as Google Chrome or Mozilla Firefox.

    Once you have the necessary software installed, open the The Reddit Dataset folder and double-click on the the-reddit-dataset-dataset-posts.csv file to open it in your preferred text editor.

    In the document, you will see a list of posts with the following information for each one: title, sentiment, score, URL, created UTC, permalink, subreddit NSFW status, and subreddit name.

    You can use this information to analyze trends in data sets posted on /r/datasets over time. For example, you could calculate the average score for all posts and compare it to the average score for posts in specific subReddits. Additionally, sentiment analysis could be performed on the titles of posts to see if there is a correlation between positive/negative sentiment and upvotes/downvotes

    Research Ideas

    • Finding correlations between different types of datasets
    • Determining which datasets are most popular on Reddit
    • Analyzing the sentiments of post and comments on Reddit's /r/datasets board

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: the-reddit-dataset-dataset-comments.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | body | The body of the post. (String) | | sentiment | The sentiment of the post. (String) | | score | The score of the post. (Integer) |

    File: the-reddit-dataset-dataset-posts.csv | Column name | Description | |:-------------------|:---------------------------------------------------| | type | The type of post. (String) | | subreddit.name | The name of the subreddit. (String) | | subreddit.nsfw | Whether or not the subreddit is NSFW. (Boolean) | | created_utc | The time the post was created, in UTC. (Timestamp) | | permalink | The permalink for the post. (String) | | score | The score of the post. (Integer) | | domain | The domain of the post. (String) | | url | The URL of the post. (String) | | selftext | The self-text of the post. (String) | | title | The title of the post. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit SocialGrep.

  15. Articles metadata from CrossRef

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kea Kohv (2025). Articles metadata from CrossRef [Dataset]. https://www.kaggle.com/datasets/keakohv/articles-doi-metadata
    Explore at:
    zip(72982417 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Kea Kohv
    Description

    This data originates from Crossref API. It has metadata on the articles contained in Data Citation Corpus where the citation pair dataset is a DOI.

    How to recreate this dataset in Jupyter Notebook:

    1) Prepare list of articles to query ```python import pandas as pd

    See: https://www.kaggle.com/datasets/keakohv/data-citation-coprus-v4-1-eupmc-and-datacite

    CITATIONS_PARQUET = "data_citation_corpus_filtered_v4.1.parquet"

    Load the citation pairs from the Parquet file

    citation_pairs = pd.read_parquet(CITATIONS_PARQUET)

    Remove all rows where https is in the 'publication' column but no "doi.org" is present

    citation_pairs = citation_pairs[ ~((citation_pairs['dataset'].str.contains("https")) & (~citation_pairs['dataset'].str.contains("doi.org"))) ]

    Remove all rows where figshare is in the dataset name

    citation_pairs = citation_pairs[ ~citation_pairs['dataset'].str.contains("figshare") ]

    citation_pairs['is_doi'] = citation_pairs['dataset'].str.contains('doi.org', na=False)

    citation_pairs_doi = citation_pairs[citation_pairs['is_doi'] == True].copy()

    articles = list(set(citation_pairs_doi['publication'].to_list()))

    articles = [doi.replace("_", "/") for doi in articles]

    Save list articles to a file

    with open("articles.txt", "w") as f: for article in articles: f.write(f"{article} ") ```

    2) Query articles from CrossRef API

    
    %%writefile enrich.py
    #!pip install -q aiolimiter
    import sys, pathlib, asyncio, aiohttp, orjson, sqlite3, time
    from aiolimiter import AsyncLimiter
    
    # ---------- config ----------
    HEADERS  = {"User-Agent": "ForDataCiteEnrichment (mailto:your_email)"} # Put your email here
    MAX_RPS  = 45           # polite pool limit (50), leave head-room
    BATCH_SIZE = 10_000         # rows per INSERT
    DB_PATH  = pathlib.Path("crossref.sqlite").resolve()
    ARTICLES  = pathlib.Path("articles.txt")
    # -----------------------------
    
    # ---- platform tweak: prefer selector loop on Windows ----
    if sys.platform == "win32":
      asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    
    # ---- read the DOI list ----
    with ARTICLES.open(encoding="utf-8") as f:
      DOIS = [line.strip() for line in f if line.strip()]
    
    # ---- make sure DB & table exist BEFORE the async part ----
    DB_PATH.parent.mkdir(parents=True, exist_ok=True)
    with sqlite3.connect(DB_PATH) as db:
      db.execute("""
        CREATE TABLE IF NOT EXISTS works (
          doi  TEXT PRIMARY KEY,
          json TEXT
        )
      """)
      db.execute("PRAGMA journal_mode=WAL;")   # better concurrency
    
    # ---------- async section ----------
    limiter = AsyncLimiter(MAX_RPS, 1)       # 45 req / second
    sem   = asyncio.Semaphore(100)        # cap overall concurrency
    
    async def fetch_one(session, doi: str):
      url = f"https://api.crossref.org/works/{doi}"
      async with limiter, sem:
        try:
          async with session.get(url, headers=HEADERS, timeout=10) as r:
            if r.status == 404:         # common ā€œnot foundā€
              return doi, None
            r.raise_for_status()        # propagate other 4xx/5xx
            return doi, await r.json()
        except Exception as e:
          return doi, None            # log later, don’t crash
    
    async def main():
      start = time.perf_counter()
      db  = sqlite3.connect(DB_PATH)        # KEEP ONE connection
      db.execute("PRAGMA synchronous = NORMAL;")   # speed tweak
    
      async with aiohttp.ClientSession(json_serialize=orjson.dumps) as s:
        for chunk_start in range(0, len(DOIS), BATCH_SIZE):
          slice_ = DOIS[chunk_start:chunk_start + BATCH_SIZE]
          tasks = [asyncio.create_task(fetch_one(s, d)) for d in slice_]
          results = await asyncio.gather(*tasks)    # all tuples, no exc
    
          good_rows, bad_dois = [], []
          for doi, payload in results:
            if payload is None:
              bad_dois.append(doi)
            else:
              good_rows.append((doi, orjson.dumps(payload).decode()))
    
          if good_rows:
            db.executemany(
              "INSERT OR IGNORE INTO works (doi, json) VALUES (?, ?)",
              good_rows,
            )
            db.commit()
    
          if bad_dois:                # append for later retry
            with open("failures.log", "a", encoding="utf-8") as fh:
              fh.writelines(f"{d}
    " for d in bad_dois)
    
          done = chunk_start + len(slice_)
          rate = done / (time.perf_counter() - start)
          print(f"{done:,}/{len(DOIS):,} ({rate:,.1f} DOI/s)")
    
      db.close()
    
    if _name_ == "_main_":
      asyncio.run(main())
    

    Then run: python !python enrich.py

    3) Finally extract the necessary fields

    import sqlite3
    import orjson
    i...
    
  16. Housing Price Prediction using DT and RF in R

    • kaggle.com
    zip
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Housing Price Prediction using DT and RF in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/housing-price-prediction-using-dt-and-rf-in-r
    Explore at:
    zip(629100 bytes)Available download formats
    Dataset updated
    Aug 31, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    • Objective: To predict the prices of houses in the City of Melbourne
    • Approach: Using Decision Tree and Random Forest https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Ffc6fb7d0bd8e854daf7a6f033937a397%2FPicture1.png?generation=1693489996707941&alt=media" alt="">
    • Data Cleaning:
    • Date column is shown as a character vector which is converted into a date vector using the library ā€˜lubridate’
    • We create a new column called age to understand the age of the house as it can be a factor in the pricing of the house. We extract the year from column ā€˜Date’ and subtract it from the column ā€˜Year Built’
    • We remove 11566 records which have missing values
    • We drop columns which are not significant such as ā€˜X’, ā€˜suburb’, ā€˜address’, (we have kept zipcode as it serves the purpose in place of suburb and address), ā€˜type’, ā€˜method’, ā€˜SellerG’, ā€˜date’, ā€˜Car’, ā€˜year built’, ā€˜Council Area’, ā€˜Region Name’
    • We split the data into ā€˜train’ and ā€˜test’ in 80/20 ratio using the sample function
    • Run libraries ā€˜rpart’, ā€˜rpart.plot’, ā€˜rattle’, ā€˜RcolorBrewer’
    • Run decision tree using the rpart function. ā€˜Price’ is the dependent variable https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6065322d19b1376c4a341a4f22933a51%2FPicture2.png?generation=1693490067579017&alt=media" alt="">
    • Average price for 5464 houses is $1084349
    • Where building area is less than 200.5, the average price for 4582 houses is $931445. Where building area is less than 200.5 & age of the building is less than 67.5 years, the avg price for 3385 houses is $799299.6.
    • $4801538 is the Highest average prices of 13 houses where distance is lower than 5.35 & building are is >280.5
      https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F136542b7afb6f03c1890bae9b07dc464%2FDecision%20Tree%20Plot.jpeg?generation=1693490124083168&alt=media" alt="">
    • We use the caret package for tuning the parameter and the optimal complexity parameter found is 0.01 with RMSE 445197.9 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Feb1633df9dd61ba3a51574873b055fd0%2FPicture3.png?generation=1693490163033658&alt=media" alt="">
    • We use library (Metrics) to find out the RMSE ($392107), MAPE (0.297) which means an accuracy of 99.70% and MAE ($272015.4)
    • Variables ā€˜postcode’, longitude and building are the most important variables
    • Test$Price indicates the actual price and test$predicted indicates the predicted price for particular 6 houses. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F620b1aad968c9aee169d0e7371bf3818%2FPicture4.png?generation=1693490211728176&alt=media" alt="">
    • We use the default parameters of random forest on the train data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fe9a3c3f8776ee055e4a1bb92d782e19c%2FPicture5.png?generation=1693490244695668&alt=media" alt="">
    • The below image indicates that ā€˜Building Area’, ā€˜Age of the house’ and ā€˜Distance’ are the most important variables that affect the price of the house. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fc14d6266184db8f30290c528d72b9f6b%2FRandom%20Forest%20Variables%20Importance.jpeg?generation=1693490284920037&alt=media" alt="">
    • Based on the default parameters, RMSE is $250426.2, MAPE is 0.147 (accuracy is 99.853%) and MAE is $151657.7
    • Error starts to remain constant between 100 to 200 trees and thereafter there is almost minimal reduction. We can choose N tree=200. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F365f9e8587d3a65805330889d22f9e60%2FNtree%20Plot.jpeg?generation=1693490308734539&alt=media" alt="">
    • We tune the model and find mtry = 3 has the lowest out of bag error
    • We use the caret package and use 5 fold cross validation technique
    • RMSE is $252216.10 , MAPE is 0.146 (accuracy is 99.854%) , MAE is $151669.4
    • We can conclude that Random Forest give us more accurate results as compared to Decision Tree
    • In Random Forest , the default parameters (N tree = 500) give us lower RMSE and MAPE as compared to N tree = 200. So we can proceed with those parameters.
  17. Record High Temperatures for US Cities

    • kaggle.com
    zip
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Record High Temperatures for US Cities [Dataset]. https://www.kaggle.com/datasets/thedevastator/record-high-temperatures-for-us-cities-in-2015
    Explore at:
    zip(9955 bytes)Available download formats
    Dataset updated
    Jan 18, 2023
    Authors
    The Devastator
    Area covered
    United States
    Description

    Record High Temperatures for US Cities

    Clearly Defined Monthly Data

    By Gary Hoover [source]

    About this dataset

    This dataset contains all the record-breaking temperatures for your favorite US cities in 2015. With this information, you can prepare for any unexpected weather that may come your way in the future, or just revel in the beauty of these high heat spells from days past! With record highs spanning from January to December, stay warm (or cool) with these handy historical temperature data points

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains the record high temperatures for various US cities during the year of 2015. The dataset includes columns for each individual month, along with column for the records highs over the entire year. This data is sourced from www.weatherbase.com and can be used to analyze which cities experienced hot summers, or compare temperature variations between different regions.

    Here are some useful tips on how to work with this dataset: - Analyze individual monthly temperatures - this dataset allows you to compare high temperatures across months and locations in order to identify which areas experienced particularly hot summers or colder winters.
    - Compare annual versus monthly data - use this data to compare average annual highs against monthly highs in order to understand temperature trends at a given location throughout all four seasons of a single year, or explore how different regions vary based on yearly weather patterns as well as across given months within any one year; - Heatmap analysis - use this data plot temperature information in an interactive heatmap format in order to pinpoint particular regions that experience unique weather conditions or higher-than-average levels of warmth compared against cooler pockets of similar size geographic areas; - Statistically model the relationships between independent variables (temperature variations by month, region/city and more!) and dependent variables (e.g., tourism volumes). Use regression techniques such as linear models (OLS), ARIMA models/nonlinear transformations and other methods through statistical software such as STATA or R programming language;
    - Look into climate trends over longer periods - adjust time frames included in analyses beyond 2018 when possible by expanding upon the monthly station observations already present within the study timeframe utilized here; take advantage of digitally available historical temperature readings rather than relying only upon printed reports

    With these helpful tips, you can get started analyzing record high temperatures for US cities during 2015 using our 'Record High Temperatures for US Cities' dataset!

    Research Ideas

    • Create a heat map chart of US cities representing the highest temperature on record for each city from 2015.
    • Analyze trends in monthly high temperatures in order to predict future climate shifts and weather patterns across different US cities.
    • Track and compare monthly high temperature records for all US cities to identify regional hot spots with higher than average records and potential implications for agriculture and resource management planning

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: Highest temperature on record through 2015 by US City.csv | Column name | Description | |:--------------|:--------------------------------------------------------------| | CITY | Name of the city. (String) | | JAN | Record high temperature for the month of January. (Integer) | | FEB | Record high temperature for the month of February. (Integer) | | MAR | Record high temperature for the month of March. (Integer) | | APR | Record high temperature for the month of April. (Integer) | | MAY | Record high temperature for the month of May. (Integer) | | JUN | Record high temperature for the month of June. (Integer) | | JUL | Record high temperature for the month of July. (Integer) | | AUG | Record high temperature for the month of August. (Integer) | | SEP | Record high temperature for the month of September. (Integer) | | OCT | Record high temperature for the month of October. (Integer) | | ...

  18. Reddit: /r/science

    • kaggle.com
    zip
    Updated Dec 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Reddit: /r/science [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-reddit-r-science-subreddit-interaction
    Explore at:
    zip(205948 bytes)Available download formats
    Dataset updated
    Dec 17, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reddit: /r/science

    Investigating Social Media Interactions and Popularity Metrics

    By Reddit [source]

    About this dataset

    The Reddit Subreddit Science dataset offers an in-depth exploration of the science-related conversations and content taking place on the popular website, Reddit. This dataset provides valuable insights into user interactions, sentiment analysis and popularity trends across various types of science topics ranging from astrophysics to neuroscience. The data comprises key features such as post titles, post scores, comment counts, creation times and post URLs which will help us to understand the dynamics and sentiments of the scientific discussions within this popular forum. Utilizing this data set can empower us to analyze how a certain topic has changed over time in terms or relevance or what kind of posts are most successful at gaining attention from users. Ultimately we can leverage this analysis to better comprehend shifts in public opinion towards various aspects of current scientific knowledge

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Research Ideas

    • Analyzing the topic trends within the subreddit over time, in order to understand which topics are most popular with readers.
    • Identifying relationships between levels of interaction (comments and upvotes) and sentiment (through text analysis), to track how users react to certain topics.
    • Tracking post and user metrics over time (such as average post length or number of comments per post), in order to monitor changes in outlook on the subreddit community as a whole

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: science.csv | Column name | Description | |:--------------|:-----------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL associated with the post. (String) | | comms_num | The number of comments associated with the post. (Integer) | | created | The date and time the post was created. (DateTime) | | body | The content of the post. (String) | | timestamp | The timestamp of the post. (DateTime) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

  19. KC_House Dataset -Linear Regression of Home Prices

    • kaggle.com
    zip
    Updated May 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). KC_House Dataset -Linear Regression of Home Prices [Dataset]. https://www.kaggle.com/datasets/vikramamin/kc-house-dataset-home-prices
    Explore at:
    zip(776807 bytes)Available download formats
    Dataset updated
    May 15, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    1. Dataset: House pricing dataset containing 21 columns and 21613 rows.
    2. Programming Language : R
    3. Objective : To predict house prices by creating a model
    4. Steps : A) Import the dataset B) Install and run libraries C) Data Cleaning - Remove Null Values , Change Data Types , Dropping of Columns which are not important D) Data Analysis - (i)Linear Regression Model was used to establish the relationship between the dependent variable (price) and other independent variable (ii) Outliers were identified and removed (iii) Regression model was run once again after removing the outliers (iv) Multiple R- squared was calculated which indicated the independent variables can explain 73% change/ variation in the dependent variable (v) P value was less than that of alpha 0.05 which shows it is statistically significant. (vi) Interpreting the meaning of the results of the coefficients (vii) Checked the assumption of multicollinearity (viii) VIF(Variance inflation factor) was calculated for all the independent variables and their absolute value was found to be less than 5. Hence, there is not threat of multicollinearity and that we can proceed with the independent variables specified.
  20. AskReddit (Submissions & Comments)

    • kaggle.com
    zip
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). AskReddit (Submissions & Comments) [Dataset]. https://www.kaggle.com/datasets/thedevastator/uncovering-askreddit-trends-a-study-of-subreddit
    Explore at:
    zip(124445 bytes)Available download formats
    Dataset updated
    Dec 15, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Uncovering AskReddit Trends: A Study of Subreddit Engagement

    User Engagement Through Posts

    By Reddit [source]

    About this dataset

    This comprehensive dataset contains information from the AskReddit subreddit on Reddit.com, with over 8 columns of data providing valuable insights into user engagement and interaction. It includes the title of each post, their score, how many comments are associated with them, and when they were created/posted. Use this data to gain insight into how different posts engage users on Reddit, what kind of content resonates with readers, and how user engagement has shifted over time. Learn more about AskReddit posts and analyze the patterns that emerge as you examine user engagement across different types of content

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    One of the most popular subreddits on Reddit is AskReddit – a platform for users to post questions and share knowledge with other Redditors. As active askers, commenters, and posters, it’s essential to gain insight into how people engage in the subreddit.

    Using this dataset you can begin your exploration by examining distributions related to post scores (score), number of comments on posts (comms_num), time elapsed between when posts are made and when they receive replies( created vs timestamp). Additionally there may be interesting correlations between features such as post title length(title) & average words per comment(body). The ultimate goal is uncovering key trends in user behavior & identifying features which help predict outcomes so that insight into various engagements can be better accessed.

    Utilizing this dataset could potentially provide valuable understanding on how popular & active topics are being discussed on AskReddit whether it's something political or scientific - resulting in an improved UX for both moderators & users alike! Happy Exploring!

    Research Ideas

    • Examining the influence of post titles on user engagement, such as looking at the correlation between more descriptive title lengths and higher scores or comment counts
    • Utilizing natural language processing techniques to analyze the body of posts in order to gain more insight into user attitudes and opinions
    • Studying post timing effects by tracking changes in user engagement over time for various subject topics as well as understanding when it may be best to create a post for maximum exposure

    Acknowledgements

    If you use this dataset in your research, please credit the original authors.

    Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: AskReddit.csv | Column name | Description | |:--------------|:--------------------------------------------------------| | title | The title of the post. (String) | | score | The number of upvotes the post has received. (Integer) | | url | The URL of the post. (String) | | comms_num | The number of comments the post has received. (Integer) | | created | The date and time the post was created. (Datetime) | | body | The body of the post. (String) | | timestamp | The timestamp of the post. (Integer) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Reddit.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
Organization logo

France Weekly Real Estate Listings 2022-2023

Seloger Separate Raw and Clean Merged Listings from 2022-06-26 to 2023-02-26

Explore at:
zip(2750497 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Area covered
France
Description

These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

Search
Clear search
Close search
Google apps
Main menu