83 datasets found

Daily website visitors (time series regression)
kaggle.com
Updated Aug 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bob Nau
Description
Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.
A
‘Popular Website Traffic Over Time ’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Popular Website Traffic Over Time ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-popular-website-traffic-over-time-62e4/62549059/?iid=003-357&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Popular Website Traffic Over Time ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/popular-website-traffice on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

Background

Have you every been in a conversation and the question comes up, who uses Bing? This question comes up occasionally because people wonder if these sites have any views. For this research study, we are going to be exploring popular website traffic for many popular websites.

Methodology

The data collected originates from SimilarWeb.com.

Source

For the analysis and study, go to The Concept Center

This dataset was created by Chase Willden and contains around 0 samples along with 1/1/2017, Social Media, technical information and other features such as: - 12/1/2016 - 3/1/2017 - and more.

How to use this dataset

Analyze 11/1/2016 in relation to 2/1/2017

Study the influence of 4/1/2017 on 1/1/2017

More datasets

Acknowledgements

If you use this dataset in your research, please credit Chase Willden

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
Google Analytics Sample
kaggle.com
zip
Updated Sep 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/bigquery/google-analytics-sample
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 19, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?
u
Data from: Google Analytics & Twitter dataset from a movies, TV series and...
portalcientificovalencia.univeuropea.com
figshare.com
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yeste, Víctor; Yeste, Víctor (2024). Google Analytics & Twitter dataset from a movies, TV series and videogames website [Dataset]. https://portalcientificovalencia.univeuropea.com/documentos/67321ed3aea56d4af0485dc8
Explore at:
Dataset updated
2024
Authors
Yeste, Víctor; Yeste, Víctor
Description
Author: Víctor Yeste. Universitat Politècnica de Valencia.The object of this study is the design of a cybermetric methodology whose objectives are to measure the success of the content published in online media and the possible prediction of the selected success variables.In this case, due to the need to integrate data from two separate areas, such as web publishing and the analysis of their shares and related topics on Twitter, has opted for programming as you access both the Google Analytics v4 reporting API and Twitter Standard API, always respecting the limits of these.The website analyzed is hellofriki.com. It is an online media whose primary intention is to solve the need for information on some topics that provide daily a vast number of news in the form of news, as well as the possibility of analysis, reports, interviews, and many other information formats. All these contents are under the scope of the sections of cinema, series, video games, literature, and comics.This dataset has contributed to the elaboration of the PhD Thesis:Yeste Moreno, VM. (2021). Diseño de una metodología cibermétrica de cálculo del éxito para la optimización de contenidos web [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/176009Data have been obtained from each last-minute news article published online according to the indicators described in the doctoral thesis. All related data are stored in a database, divided into the following tables:tesis_followers: User ID list of media account followers.tesis_hometimeline: data from tweets posted by the media account sharing breaking news from the web.status_id: Tweet IDcreated_at: date of publicationtext: content of the tweetpath: URL extracted after processing the shortened URL in textpost_shared: Article ID in WordPress that is being sharedretweet_count: number of retweetsfavorite_count: number of favoritestesis_hometimeline_other: data from tweets posted by the media account that do not share breaking news from the web. Other typologies, automatic Facebook shares, custom tweets without link to an article, etc. With the same fields as tesis_hometimeline.tesis_posts: data of articles published by the web and processed for some analysis.stats_id: Analysis IDpost_id: Article ID in WordPresspost_date: article publication date in WordPresspost_title: title of the articlepath: URL of the article in the middle webtags: Tags ID or WordPress tags related to the articleuniquepageviews: unique page viewsentrancerate: input ratioavgtimeonpage: average visit timeexitrate: output ratiopageviewspersession: page views per sessionadsense_adunitsviewed: number of ads viewed by usersadsense_viewableimpressionpercent: ad display ratioadsense_ctr: ad click ratioadsense_ecpm: estimated ad revenue per 1000 page viewstesis_stats: data from a particular analysis, performed at each published breaking news item. Fields with statistical values can be computed from the data in the other tables, but total and average calculations are saved for faster and easier further processing.id: ID of the analysisphase: phase of the thesis in which analysis has been carried out (right now all are 1)time: "0" if at the time of publication, "1" if 14 days laterstart_date: date and time of measurement on the day of publicationend_date: date and time when the measurement is made 14 days latermain_post_id: ID of the published article to be analysedmain_post_theme: Main section of the published article to analyzesuperheroes_theme: "1" if about superheroes, "0" if nottrailer_theme: "1" if trailer, "0" if notname: empty field, possibility to add a custom name manuallynotes: empty field, possibility to add personalized notes manually, as if some tag has been removed manually for being considered too generic, despite the fact that the editor put itnum_articles: number of articles analysednum_articles_with_traffic: number of articles analysed with traffic (which will be taken into account for traffic analysis)num_articles_with_tw_data: number of articles with data from when they were shared on the media’s Twitter accountnum_terms: number of terms analyzeduniquepageviews_total: total page viewsuniquepageviews_mean: average page viewsentrancerate_mean: average input ratioavgtimeonpage_mean: average duration of visitsexitrate_mean: average output ratiopageviewspersession_mean: average page views per sessiontotal: total of ads viewedadsense_adunitsviewed_mean: average of ads viewedadsense_viewableimpressionpercent_mean: average ad display ratioadsense_ctr_mean: average ad click ratioadsense_ecpm_mean: estimated ad revenue per 1000 page viewsTotal: total incomeretweet_count_mean: average incomefavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesterms_ini_num_tweets: total tweets on the terms on the day of publicationterms_ini_retweet_count_total: total retweets on the terms on the day of publicationterms_ini_retweet_count_mean: average retweets on the terms on the day of publicationterms_ini_favorite_count_total: total of favorites on the terms on the day of publicationterms_ini_favorite_count_mean: average of favorites on the terms on the day of publicationterms_ini_followers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the terms on the day of publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms on the day of publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who spoke about the terms on the day of publicationterms_ini_user_age_mean: average age in days of users who have spoken of the terms on the day of publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms on the day of publicationterms_end_num_tweets: total tweets on terms 14 days after publicationterms_ini_retweet_count_total: total retweets on terms 14 days after publicationterms_ini_retweet_count_mean: average retweets on terms 14 days after publicationterms_ini_favorite_count_total: total bookmarks on terms 14 days after publicationterms_ini_favorite_count_mean: average of favorites on terms 14 days after publicationterms_ini_followers_talking_rate: ratio of media Twitter account followers who have recently posted a tweet talking about the terms 14 days after publicationterms_ini_user_num_followers_mean: average followers of users who have spoken of the terms 14 days after publicationterms_ini_user_num_tweets_mean: average number of tweets published by users who have spoken about the terms 14 days after publicationterms_ini_user_age_mean: the average age in days of users who have spoken of the terms 14 days after publicationterms_ini_ur_inclusion_rate: URL inclusion ratio of tweets talking about terms 14 days after publication.tesis_terms: data of the terms (tags) related to the processed articles.stats_id: Analysis IDtime: "0" if at the time of publication, "1" if 14 days laterterm_id: Term ID (tag) in WordPressname: Name of the termslug: URL of the termnum_tweets: number of tweetsretweet_count_total: total retweetsretweet_count_mean: average retweetsfavorite_count_total: total of favoritesfavorite_count_mean: average of favoritesfollowers_talking_rate: ratio of followers of the media Twitter account who have recently published a tweet talking about the termuser_num_followers_mean: average followers of users who were talking about the termuser_num_tweets_mean: average number of tweets published by users who were talking about the termuser_age_mean: average age in days of users who were talking about the termurl_inclusion_rate: URL inclusion ratio
D
Monthly Page Views to CDC.gov
data.cdc.gov
data.virginia.gov
+4more
application/rdfxml +5
Updated Jul 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of the Associate Director for Communication, Division of News and Electronic Media (2025). Monthly Page Views to CDC.gov [Dataset]. https://data.cdc.gov/Web-Metrics/Monthly-Page-Views-to-CDC-gov/rq85-buyi
Explore at:
xml, application/rdfxml, json, csv, application/rssxml, tsvAvailable download formats
Dataset updated
Jul 1, 2025
Dataset authored and provided by
Office of the Associate Director for Communication, Division of News and Electronic Media
Description
For more information on CDC.gov metrics please see http://www.cdc.gov/metrics/
d
Website Analytics
catalog.data.gov
data.nola.gov
+4more
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.nola.gov (2025). Website Analytics [Dataset]. https://catalog.data.gov/dataset/website-analytics
Explore at:
Dataset updated
Jun 28, 2025
Dataset provided by
data.nola.gov
Description
This data about nola.gov provides a window into how people are interacting with the the City of New Orleans online. The data comes from a unified Google Analytics account for New Orleans. We do not track individuals and we anonymize the IP addresses of all visitors.
P
Alexa Domains Dataset
paperswithcode.com
opendatalab.com
Updated Feb 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Corley; Jonathan Lwowski; Justin Hoffman (2001). Alexa Domains Dataset [Dataset]. https://paperswithcode.com/dataset/gagan-bhatia
Explore at:
Dataset updated
Feb 1, 2001
Authors
Isaac Corley; Jonathan Lwowski; Justin Hoffman
Description
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
Google Analytics Sample
console.cloud.google.com
Updated Jul 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=de&inv=1&invt=Ab2fng (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=de
Explore at:
Dataset updated
Jul 15, 2017
Dataset provided by
Googlehttp://google.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
SSC 4D Orbit Viewer Web Services API at the Space Physics Data Facility...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). SSC 4D Orbit Viewer Web Services API at the Space Physics Data Facility (SPDF) - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/ssc-4d-orbit-viewer-web-services-api-at-the-space-physics-data-facility-spdf
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This API allows any web application to offer an interactive 4D display of the trajectories of satellites known to SSCweb (Satellite Situation Center). The Space Physics Data Facility (SPDF) is the archive of non-solar data for the Heliospheric Science Division (HSD) at NASA's Goddard Space Flight Center.
NewsUnravel Dataset
zenodo.org
data.niaid.nih.gov
csv, png
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anon; anon (2024). NewsUnravel Dataset [Dataset]. http://doi.org/10.5281/zenodo.8344891
Explore at:
csv, pngAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8344891
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
anon; anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About the NUDA Dataset
Media bias is a multifaceted problem, leading to one-sided views and impacting decision-making. A way to address bias in news articles is to automatically detect and indicate it through machine-learning methods. However, such detection is limited due to the difficulty of obtaining reliable training data. To facilitate the data-gathering process, we introduce NewsUnravel, a news-reading web application leveraging an initially tested feedback mechanism to collect reader feedback on machine-generated bias highlights within news articles. Our approach augments dataset quality by significantly increasing inter-annotator agreement by 26.31% and improving classifier performance by 2.49%. As the first human-in-the-loop application for media bias, NewsUnravel shows that a user-centric approach to media bias data collection can return reliable data while being scalable and evaluated as easy to use. NewsUnravel demonstrates that feedback mechanisms are a promising strategy to reduce data collection expenses, fluidly adapt to changes in language, and enhance evaluators' diversity.

General

This dataset was created through user feedback on automatically generated bias highlights on news articles on the website NewsUnravel made by ANON. Its goal is to improve the detection of linguistic media bias for analysis and to indicate it to the public. Support came from ANON. None of the funders played any role in the dataset creation process or publication-related decisions.

The dataset consists of text, namely biased sentences with binary bias labels (processed, biased or not biased) as well as metadata about the article. It includes all feedback that was given. The single ratings (unprocessed) used to create the labels with correlating User IDs are included.

For training, this dataset was combined with the BABE dataset. All data is completely anonymous. Some sentences might be offensive or triggering as they were taken from biased or more extreme news sources. The dataset does not identify sub-populations or can be considered sensitive to them, nor is it possible to identify individuals.

Description of the Data Files

This repository contains the datasets for the anonymous NewsUnravel submission. The tables contain the following data:

NUDAdataset.csv: the NUDA dataset with 310 new sentences with bias labels
Statistics.png: contains all Umami statistics for NewsUnravel's usage data
Feedback.csv: holds the participantID of a single feedback with the sentence ID (contentId), the bias rating, and provided reasons
Content.csv: holds the participant ID of a rating with the sentence ID (contentId) of a rated sentence and the bias rating, and reason, if given
Article.csv: holds the article ID, title, source, article metadata, article topic, and bias amount in %
Participant.csv: holds the participant IDs and data processing consent

Collection Process

Data was collected through interactions with the Feedback Mechanism on NewsUnravel. A news article was displayed with automatically generated bias highlights. Each highlight could be selected, and readers were able to agree or disagree with the automatic label. Through a majority vote, labels were generated from those feedback interactions. Spammers were excluded through a spam detection approach.

Readers came to our website voluntarily through posts on LinkedIn and social media as well as posts on university boards. The data collection period lasted for one week, from March 4th to March 11th (2023). The landing page informed them about the goal and the data processing. After being informed, they could proceed to the article overview.

So far, the dataset has been used on top of BABE to train a linguistic bias classifier, adopting hyperparameter configurations from BABE with a pre-trained model from Hugging Face.
The dataset will be open source. On acceptance, a link with all details and contact information will be provided. No third parties are involved.

The dataset will not be maintained as it captures the first test of NewsUnravel at a specific point in time. However, new datasets will arise from further iterations. Those will be linked in the repository. Please cite the NewsUnravel paper if you use the dataset and contact us if you're interested in more information or joining the project.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
E-commerce - Users of a French C2C fashion store
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Jeffrey Mvutu Mabilama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Foreword

This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

My Telegram bot will answer your queries and allow you to contact me.

Context

There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

This dataset is part of a preview of a much larger dataset. Please contact me for more.

Content

The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Questions you might want to answer using this dataset:

Are e-commerce users interested in social network feature ?

Are my users active enough (compared to those of this dataset) ?

How likely are people from other countries to sign up in a C2C website ?

How many users are likely to drop off after years of using my service ?

Example works:

Report(s) made using SQL queries can be found on the data.world page of the dataset.

Notebooks may be found on the Kaggle page of the dataset.

License

CC-BY-NC-SA 4.0

For other licensing options, contact me.
Linear Regression E-commerce Dataset
kaggle.com
zip
Updated Sep 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Kolawale (2019). Linear Regression E-commerce Dataset [Dataset]. https://www.kaggle.com/datasets/kolawale/focusing-on-mobile-app-or-website
Explore at:
zip(44169 bytes)Available download formats
Dataset updated
Sep 16, 2019
Authors
Saurabh Kolawale
Description
This dataset is having data of customers who buys clothes online. The store offers in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want.

The company is trying to decide whether to focus their efforts on their mobile app experience or their website.
d
People Counts May 2022
data.dundeecity.gov.uk
dtechtive.com
+1more
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DCC Public GIS Portal (2022). People Counts May 2022 [Dataset]. https://data.dundeecity.gov.uk/datasets/976e97bf048c4aa6b985b260c3f39804
Explore at:
Dataset updated
Jun 29, 2022
Dataset authored and provided by
DCC Public GIS Portal
Description
People Counts This data set is sourced from Dundee City Council’s Public Space Camera Surveillance System. It shows a count of people in 8 specified areas across Dundee. The data set shows a snapshot of people within these areas every Monday, Wednesday and Saturday during the period 1pm-2pm.This data is experimental and subject to further refinement. Please note that due the nature of CCTV cameras at times data may not be collected as specified above. Therefore, caution should be exercised when analysing data and drawing conclusions for this data set.CCTV datasets contain information on object detections taken from a selection of the CCTV cameras throughout Dundee City. CCTV images are translated into object counts, objects counted include ‘person’, ‘car’, ‘bicycle’, ‘bus’, ‘motorcycle', 'truck, ‘pickup truck 'and ‘van’. The data is generated and owned by Dundee City Council. Copyright © Dundee City Council 2022. This dataset is available for use under the Open Government Licence.Background information about the Dundee CCTV cameras including a map showing the location of the cameras is available on the Dundee City Council website and can be accessed using the following link:https://www.dundeecity.gov.uk/service-area/city-development/sustainable-transport-and-roads/dundees-public-space-camera-surveillance-system
Z
Effect of suicide rates on life expectancy dataset
data.niaid.nih.gov
zenodo.org
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Filip Zoubek (2021). Effect of suicide rates on life expectancy dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4694269
Explore at:
Dataset updated
Apr 16, 2021
Dataset authored and provided by
Filip Zoubek
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
Effect of suicide rates on life expectancy dataset

Abstract In 2015, approximately 55 million people died worldwide, of which 8 million committed suicide. In the USA, one of the main causes of death is the aforementioned suicide, therefore, this experiment is dealing with the question of how much suicide rates affects the statistics of average life expectancy. The experiment takes two datasets, one with the number of suicides and life expectancy in the second one and combine data into one dataset. Subsequently, I try to find any patterns and correlations among the variables and perform statistical test using simple regression to confirm my assumptions.

Data

The experiment uses two datasets - WHO Suicide Statistics[1] and WHO Life Expectancy[2], which were firstly appropriately preprocessed. The final merged dataset to the experiment has 13 variables, where country and year are used as index: Country, Year, Suicides number, Life expectancy, Adult Mortality, which is probability of dying between 15 and 60 years per 1000 population, Infant deaths, which is number of Infant Deaths per 1000 population, Alcohol, which is alcohol, recorded per capita (15+) consumption, Under-five deaths, which is number of under-five deaths per 1000 population, HIV/AIDS, which is deaths per 1 000 live births HIV/AIDS, GDP, which is Gross Domestic Product per capita, Population, Income composition of resources, which is Human Development Index in terms of income composition of resources, and Schooling, which is number of years of schooling.

LICENSE

THE EXPERIMENT USES TWO DATASET - WHO SUICIDE STATISTICS AND WHO LIFE EXPECTANCY, WHICH WERE COLLEECTED FROM WHO AND UNITED NATIONS WEBSITE. THEREFORE, ALL DATASETS ARE UNDER THE LICENSE ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 3.0 IGO (https://creativecommons.org/licenses/by-nc-sa/3.0/igo/).

[1] https://www.kaggle.com/szamil/who-suicide-statistics

[2] https://www.kaggle.com/kumarajarshi/life-expectancy-who
M
U.S. Wind Turbine Database, Minnesota and National
gisdata.mn.gov
ags_mapserver, csv +6
Updated Jun 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geospatial Information Office (2025). U.S. Wind Turbine Database, Minnesota and National [Dataset]. https://gisdata.mn.gov/dataset/util-uswtdb
Explore at:
csv, html, fgdb, webapp, ags_mapserver, shp, gpkg, jpegAvailable download formats
Dataset updated
Jun 24, 2025
Dataset provided by
Geospatial Information Office
Area covered
Minnesota
Description
This dataset provides locations and technical specifications of wind turbines in the United States, almost all of which are utility-scale. Utility-scale turbines are ones that generate power and feed it into the grid, supplying a utility with energy. They are usually much larger than turbines that would feed a homeowner or business.

The data formats downloadable from the Minnesota Geospatial Commons contain just the Minnesota turbines. Data, maps and services accessed from the USWTDB website provide nationwide turbines.

The regularly updated database has wind turbine records that have been collected, digitized, and locationally verified. Turbine data were gathered from the Federal Aviation Administration's (FAA) Digital Obstacle File (DOF) and Obstruction Evaluation Airport Airspace Analysis (OE-AAA), the American Wind Energy Association (AWEA), Lawrence Berkeley National Laboratory (LBNL), and the United States Geological Survey (USGS), and were merged and collapsed into a single data set.

Verification of the turbine positions was done by visual interpretation using high-resolution aerial imagery in Esri ArcGIS Desktop. A locational error of plus or minus 10 meters for turbine locations was tolerated. Technical specifications for turbines were assigned based on the wind turbine make and models as provided by manufacturers and project developers directly, and via FAA datasets, information on the wind project developer or turbine manufacturer websites, or other online sources. Some facility and turbine information on make and model did not exist or was difficult to obtain. Thus, uncertainty may exist for certain turbine specifications. Similarly, some turbines were not yet built, not built at all, or for other reasons cannot be verified visually. Location and turbine specifications data quality are rated and a confidence is recorded for both. None of the data are field verified.

The U.S. Wind Turbine Database website provides the national data in many different formats: shapefile, CSV, GeoJSON, web services (cached and dynamic), API, and web viewer. See: https://eerscmap.usgs.gov/uswtdb/

The web viewer provides many options to search; filter by attribute, date and location; and customize the map display. For details and screenshots of these options, see: https://eerscmap.usgs.gov/uswtdb/help/

------------
This metadata record was adapted by the Minnesota Geospatial Information Office (MnGeo) from the national version of the metadata. It describes the Minnesota extract of the shapefile data that has been projected from geographic to UTM coordinates and converted to Esri file geodatabase (fgdb) format. There may be more recent updates available on the national website. Accessing the data via the national web services or API will always provide the most recent data.
b
Corporate Website — Analytics — Popular pages
data.brisbane.qld.gov.au
prod-brisbane-queensland.opendatasoft.com
csv, excel, json
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Corporate Website — Analytics — Popular pages [Dataset]. https://data.brisbane.qld.gov.au/explore/dataset/corporate-website-analytics-popular-pages/
Explore at:
json, excel, csvAvailable download formats
Dataset updated
Apr 17, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Monthly analytics reports for the Brisbane City Council website

Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
Movie lens 1m
kaggle.com
grouplens.org
+1more
zip
Updated Apr 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Younis (2024). Movie lens 1m [Dataset]. https://www.kaggle.com/datasets/oyounis/movie-lens-1m
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
Omar Younis
Description
MovieLens 1M Dataset

Overview

The MovieLens 1M dataset is a collection of movie ratings provided by the GroupLens Research Project at the University of Minnesota. This dataset contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 users who joined MovieLens in 2000. It is widely used in research on recommendation systems, data mining, and machine learning.

Contents

Ratings: 1,000,209 ratings on a scale of 1-5 stars.

Movies: Information on 3,883 movies including titles, genres, and release dates.

Users: Demographic information for 6,040 users including age, gender, occupation, and zip codes.

Features

Ratings: Each rating has the following attributes:

UserID: A unique identifier for the user.

MovieID: A unique identifier for the movie.

Rating: The rating given by the user (1-5 stars).

Timestamp: The time the rating was made.

Movies: Each movie entry includes:

MovieID: A unique identifier for the movie.

Title: The title of the movie.

Genres: The genres associated with the movie (e.g., Action, Comedy).

Users: Each user entry includes:

UserID: A unique identifier for the user.

Gender: The gender of the user.

Age: The age of the user.

Occupation: The occupation of the user.

Zip-code: The zip-code of the user.

Use Cases

Recommendation Systems: Ideal for building and testing collaborative filtering and content-based recommendation algorithms.

Data Mining: Suitable for discovering patterns and insights in user behavior.

Machine Learning: Can be used for training predictive models for various tasks such as predicting user ratings, movie popularity, etc.

Format

The dataset is provided in plain text format with tab-separated values, making it easy to load and process using various data analysis tools and programming languages.

Acknowledgements

This dataset is made available by the GroupLens Research Project at the University of Minnesota. For more information about their work, visit the GroupLens website.
Texas County Voting Website Data
kaggle.com
zip
Updated Sep 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emily Russell (2020). Texas County Voting Website Data [Dataset]. https://www.kaggle.com/mewbius/lwv-oct-2017
Explore at:
zip(40347 bytes)Available download formats
Dataset updated
Sep 1, 2020
Authors
Emily Russell
Area covered
Texas
Description
General

The League of Women Voters conducts surveys of Texas County voting websites. The data and further reading is available here (under County Website Reports). Any mistakes or errors found here are mine and the data on the LWV website is the authoritative data - I have no affiliation with the LWV but wanted to make the datasets more accessible.

Data Changes

I cleaned some of the data (split numeric and text ratings from one column to two columns) and made a few edits to values that appeared to be typos based on context - these will be noted in the description of each set. Column names were shortened in some cases and "NA" was added to empty cells. Each survey used slightly different questions, thought both 2016 sets appear to use the same ones and the 2017 is very similar.

Commonalities

Abbreviations used include SOS for the Texas Secretary of State website and 203 refers to Section 203 of the federal Voting Rights Act (for information, see this 2016 report).

Each dataset has at least these columns: county name, fips, date, total points, overall evaluation, perc calc na, and perc calc num.

The county name was changed to match the name listed in the FIPS set, there were some typos and variations with hyphens.

The FIPS (Federal Information Processing Standards) code is from here.

The date is the month and year associated with the survey.

The total number of points is the sum of all points a county received.

The overall evaluation is the category associated with the number of points - these varied between sets and for 2020 the categories from the report were added to the dataset.

I added two columns to the end of each set, perc_calc_na and perc_calc_num that represent the percent of total points for that county out of the possible points for that dataset - the first has "NA" for any county without a website and the second has "-1" for those counties. Some of the surveys included bonus points - these were included in the total possible points for the calculation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code

Daily website visitors (time series regression)

Predict tomorrow's number of website visitors from 5 years of daily data

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 20, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Bob Nau

Description

Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.

Clear search

Close search

Google apps

Main menu

Daily website visitors (time series regression)

Context

Content

Inspiration

‘Popular Website Traffic Over Time ’ analyzed by Analyst-2

About this dataset

Background

Methodology

Source

How to use this dataset

Acknowledgements

Start A New Notebook!

Google Analytics Sample

Context

Content

Acknowledgements

Inspiration

Data from: Google Analytics & Twitter dataset from a movies, TV series and...

Monthly Page Views to CDC.gov

Website Analytics

Alexa Domains Dataset

Google Analytics Sample

SSC 4D Orbit Viewer Web Services API at the Space Physics Data Facility...

NewsUnravel Dataset

Datasets for Sentiment Analysis

fineweb

E-commerce - Users of a French C2C fashion store

Foreword

Context

Content

Acknowledgements

Inspiration

License

Linear Regression E-commerce Dataset

People Counts May 2022

Effect of suicide rates on life expectancy dataset

U.S. Wind Turbine Database, Minnesota and National

Corporate Website — Analytics — Popular pages

Movie lens 1m

MovieLens 1M Dataset

Overview

Contents

Features

Use Cases

Format

Acknowledgements

Texas County Voting Website Data

General

Data Changes

Commonalities

Daily website visitors (time series regression)

Predict tomorrow's number of website visitors from 5 years of daily data

Context

Content

Inspiration