46 datasets found

Daily website visitors (time series regression)
kaggle.com
Updated Aug 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bob Nau
Description
Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.
D
Website Analytics
data.nola.gov
gimi9.com
+4more
application/rdfxml +4
Updated Feb 2, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Information Technology and Innovation Web Team (2017). Website Analytics [Dataset]. https://data.nola.gov/City-Administration/Website-Analytics/62d3-pst8
Explore at:
csv, tsv, xml, application/rdfxml, jsonAvailable download formats
Dataset updated
Feb 2, 2017
Dataset authored and provided by
Information Technology and Innovation Web Team
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data about nola.gov provides a window into how people are interacting with the the City of New Orleans online. The data comes from a unified Google Analytics account for New Orleans. We do not track individuals and we anonymize the IP addresses of all visitors.
Google Analytics Sample
console.cloud.google.com
Updated Jul 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
https://console.cloud.google.com/marketplace/browse?filter=partner:Obfuscated%20Google%20Analytics%20360%20data&hl=de&inv=1&invt=Ab2fng (2017). Google Analytics Sample [Dataset]. https://console.cloud.google.com/marketplace/product/obfuscated-ga360-data/obfuscated-ga360-data?hl=de
Explore at:
Dataset updated
Jul 15, 2017
Dataset provided by
Googlehttp://google.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset provides 12 months (August 2016 to August 2017) of obfuscated Google Analytics 360 data from the Google Merchandise Store , a real ecommerce store that sells Google-branded merchandise, in BigQuery. It’s a great way analyze business data and learn the benefits of using BigQuery to analyze Analytics 360 data Learn more about the data The data includes The data is typical of what an ecommerce website would see and includes the following information:Traffic source data: information about where website visitors originate, including data about organic traffic, paid search traffic, and display trafficContent data: information about the behavior of users on the site, such as URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions on the Google Merchandise Store website.Limitations: All users have view access to the dataset. This means you can query the dataset and generate reports but you cannot complete administrative tasks. Data for some fields is obfuscated such as fullVisitorId, or removed such as clientId, adWordsClickInfo and geoNetwork. “Not available in demo dataset” will be returned for STRING values and “null” will be returned for INTEGER values when querying the fields containing no data.This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery
d
Open Data Website Traffic
catalog.data.gov
data.lacity.org
+1more
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.lacity.org (2025). Open Data Website Traffic [Dataset]. https://catalog.data.gov/dataset/open-data-website-traffic
Explore at:
Dataset updated
Jun 21, 2025
Dataset provided by
data.lacity.org
Description
Daily utilization metrics for data.lacity.org and geohub.lacity.org. Updated monthly
Google Analytics Sample
kaggle.com
zip
Updated Sep 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/bigquery/google-analytics-sample
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 19, 2019
Dataset provided by
BigQueryhttps://cloud.google.com/bigquery
Googlehttp://google.com/
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?
c
Exhibit of Datasets
datacatalogue.cessda.eu
ssh.datastations.nl
Updated Sep 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P.K. Doorn; L. Breure (2024). Exhibit of Datasets [Dataset]. http://doi.org/10.17026/SS/TLTMIR
Explore at:
Unique identifier
https://doi.org/10.17026/SS/TLTMIR
Dataset updated
Sep 3, 2024
Dataset provided by
DANS (retired)
Authors
P.K. Doorn; L. Breure
Description
The Exhibit of Datasets was an experimental project with the aim of providing concise introductions to research datasets in the humanities and social sciences deposited in a trusted repository and thus made accessible for the long term. The Exhibit consists of so-called 'showcases', short webpages summarizing and supplementing the corresponding data papers, published in the Research Data Journal for the Humanities and Social Sciences. The showcase is a quick introduction to such a dataset, a bit longer than an abstract, with illustrations, interactive graphs and other multimedia (if available). As a rule it also offers the option to get acquainted with the data itself, through an interactive online spreadsheet, a data sample or link to the online database of a research project. Usually, access to these datasets requires several time consuming actions, such as downloading data, installing the appropriate software and correctly uploading the data into these programs. This makes it difficult for interested parties to quickly assess the possibilities for reuse in other projects.

The Exhibit aimed to help visitors of the website to get the right information at a glance by: - Attracting attention to (recently) acquired deposits: showing why data are interesting. - Providing a concise overview of the dataset's scope and research background; more details are to be found, for example, in the associated data paper in the Research Data Journal (RDJ). - Bringing together references to the location of the dataset and to more detailed information elsewhere, such as the project website of the data producers. - Allowing visitors to explore (a sample of) the data without downloading and installing associated software at first (see below). - Publishing related multimedia content, such as videos, animated maps, slideshows etc., which are currently difficult to include in online journals as RDJ. - Making it easier to review the dataset. The Exhibit would also have been the right place to publish these reviews in the same way as a webshop publishes consumer reviews of a product, but this could not yet be achieved within the limited duration of the project.

Note (1) The text of the showcase is a summary of the corresponding data paper in RDJ, and as such a compilation made by the Exhibit editor. In some cases a section 'Quick start in Reusing Data' is added, whose text is written entirely by the editor. (2) Various hyperlinks such as those to pages within the Exhibit website will no longer work. The interactive Zoho spreadsheets are also no longer available because this facility has been discontinued.
e
OGD Portal: Daily usage by record (since January 2024)
data.europa.eu
csv, excel xls, json +5
Updated Apr 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kanton-basel-landschaft (2025). OGD Portal: Daily usage by record (since January 2024) [Dataset]. https://data.europa.eu/data/datasets/12610-kanton-basel-landschaft?locale=en
Explore at:
n3, rdf xml, csv, json-ld, json, rdf turtle, parquet, excel xlsAvailable download formats
Dataset updated
Apr 6, 2025
Dataset authored and provided by
kanton-basel-landschaft
License
http://dcat-ap.ch/vocabulary/licenses/terms_byhttp://dcat-ap.ch/vocabulary/licenses/terms_by
Description
The data on the use of the data sets on the OGD portal BL (data.bl.ch) are collected and published by the specialist and coordination office OGD BL. Contains the day the usage was measured.dataset_title: The title of the dataset_id record: The technical ID of the dataset.visitors: Specifies the number of daily visitors to the record. Visitors are recorded by counting the unique IP addresses that recorded access on the day of the survey. The IP address represents the network address of the device from which the portal was accessed.interactions: Includes all interactions with any record on data.bl.ch. A visitor can trigger multiple interactions. Interactions include clicks on the website (searching datasets, filters, etc.) as well as API calls (downloading a dataset as a JSON file, etc.).RemarksOnly calls to publicly available datasets are shown.IP addresses and interactions of users with a login of the Canton of Basel-Landschaft - in particular of employees of the specialist and coordination office OGD - are removed from the dataset before publication and therefore not shown.Calls from actors that are clearly identifiable as bots by the user agent header are also not shown.Combinations of dataset and date for which no use occurred (Visitors == 0 & Interactions == 0) are not shown.Due to synchronization problems, data may be missing by the day.
Website Metrics
catalog.data.gov
datasets.ai
+1more
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FEMA/Office of External Affairs/Communication Division (2025). Website Metrics [Dataset]. https://catalog.data.gov/dataset/website-metrics
Explore at:
Dataset updated
Jun 7, 2025
Dataset provided by
Federal Emergency Management Agencyhttp://www.fema.gov/
Description
Per the Federal Digital Government Strategy, the Department of Homeland Security Metrics Plan, and the Open FEMA Initiative, FEMA is providing the following web performance metrics with regards to FEMA.gov.rnrnInformation in this dataset includes total visits, avg visit duration, pageviews, unique visitors, avg pages/visit, avg time/page, bounce ratevisits by source, visits by Social Media Platform, and metrics on new vs returning visitors.rnrnExternal Affairs strives to make all communications accessible. If you have any challenges accessing this information, please contact FEMAWebTeam@fema.dhs.gov.
n
FOI-01782 - Datasets - Open Data Portal
opendata.nhsbsa.net
Updated Mar 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). FOI-01782 - Datasets - Open Data Portal [Dataset]. https://opendata.nhsbsa.net/dataset/foi-01782
Explore at:
Dataset updated
Mar 21, 2024
Description
Thank you for explaining that you don’t collect data on the number of abandoned applications. Alternatively, please could you share the website analytics which shows the number of visitors to each webpage, from this information we can compare against form completion rates and if there is a particular drop in traffic on certain pages/questions? Response A copy of the information is attached. Please read the below notes to ensure correct understanding of the data. Attached is raw data covering individual page hits from 19 February 2024 to 17 March 2024. Please be advised that our Data Analysts have viewed the Google analytics for the Healthy Start website pages, and despite the search options including country, regions and town or city, the data provided within these fields is an approximation and cannot be guaranteed as a true location of a user. We believe that Google analytics geo location capabilities are based on IP (Internet Protocol) addresses which may not resolve to a true location, and instead could be based off the users ISP (Internet Service Provider) server location. Therefore, please be aware that this raw data is not reliable.
E-commerce - Users of a French C2C fashion store
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Mvutu Mabilama (2024). E-commerce - Users of a French C2C fashion store [Dataset]. https://www.kaggle.com/jmmvutu/ecommerce-users-of-a-french-c2c-fashion-store/notebooks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Jeffrey Mvutu Mabilama
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
Foreword

This users dataset is a preview of a much bigger dataset, with lots of related data (product listings of sellers, comments on listed products, etc...).

My Telegram bot will answer your queries and allow you to contact me.

Context

There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.

Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).

This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.

For instance, if you see that most of your users are not very active, you may look into this dataset to compare your store's performance.

If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.

This dataset is part of a preview of a much larger dataset. Please contact me for more.

Content

The data was scraped from a successful online C2C fashion store with over 10M registered users. The store was first launched in Europe around 2009 then expanded worldwide.

Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Questions you might want to answer using this dataset:

Are e-commerce users interested in social network feature ?

Are my users active enough (compared to those of this dataset) ?

How likely are people from other countries to sign up in a C2C website ?

How many users are likely to drop off after years of using my service ?

Example works:

Report(s) made using SQL queries can be found on the data.world page of the dataset.

Notebooks may be found on the Kaggle page of the dataset.

License

CC-BY-NC-SA 4.0

For other licensing options, contact me.
A
‘Spotify Past Decades Songs Attributes’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Spotify Past Decades Songs Attributes’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-spotify-past-decades-songs-attributes-57a7/4e9b7dfe/?iid=011-638&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Spotify Past Decades Songs Attributes’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/cnic92/spotify-past-decades-songs-50s10s on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Why do we like some songs more than others? Is there something about a song that pleases out subconscious, making us listening to it on repeat? To understand this I collected various attributes from a selection of songs available in the Spotify's playlist "All out ..s" starting from the 50s up to the newly ended 10s. Can you find the secret sauce to make a song popular?

Content

This data repo contains 7 datasets (.csv files), each representing a Spotify's "All out ..s" type of playlist. Those playlists collect the most popular/iconic songs from the decade. For each song, a set of attributes have been reported in order to perform some data analysis. The attributes have been scraped from this amazing website. In particular, according to the website the attributes are:

top genre: genre of the song

year: year of the song (due to re-releases, the year might not correspond to the release year of the original song)

bpm(beats per minute): beats per minute

nrgy(energy): energy of a song, the higher the value the more energetic the song is

dnce(danceability): the higher the value, the easier it is to dance to this song.

dB(loudness): the higher the value, the louder the song.

live(liveness): the higher the value, the more likely the song is a live recording.

val(valence): the higher the value, the more positive mood for the song.

dur(duration): the duration of the song.

acous(acousticness): the higher the value the more acoustic the song is.

spch(speechiness): the higher the value the more spoken word the song contains.

pop(popularity): the higher the value the more popular the song is.

Acknowledgements

I got inspired by the top-notch work by Leonardo Henrique in this dataset. Thanks to him I discovered this website, from which all the data collected here have been scraped.

--- Original source retains full ownership of the source dataset ---
P
Alexa Domains Dataset
paperswithcode.com
opendatalab.com
Updated Feb 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaac Corley; Jonathan Lwowski; Justin Hoffman (2001). Alexa Domains Dataset [Dataset]. https://paperswithcode.com/dataset/gagan-bhatia
Explore at:
Dataset updated
Feb 1, 2001
Authors
Isaac Corley; Jonathan Lwowski; Justin Hoffman
Description
This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest
The Items Dataset
zenodo.org
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Egan; Patrick Egan (2024). The Items Dataset [Dataset]. http://doi.org/10.5281/zenodo.10964134
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10964134
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Egan; Patrick Egan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019

I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.

The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.

The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.

Updates to these datasets will be announced and published as the project progresses.

II. What’s included? This data set includes:

The Items Dataset – a .CSV containing Media Note, OriginalFormat, On Website, Collection Ref, Missing In Duplication, Collection, Outside Link, Performer, Solo/multiple, Sub-item, type of tune, Tune, Position, Location, State, Date, Notes/Composer, Potential Linked Data, Instrument, Additional Notes, Tune Cleanup. This .CSV is the direct export of the Items Google Spreadsheet

III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.

IV. Data Set Field Descriptions

IV

a) Collections dataset field descriptions

ItemId – this is the identifier for the collection that was found at the AFC

Viewed – if the collection has been viewed, or accessed in any way by the researchers.

On LOC – whether or not there are audio recordings of this collection available on the Library of Congress website.

On Other Website – if any of the recordings in this collection are available elsewhere on the internet

Original Format – the format that was used during the creation of the recordings that were found within each collection

Search – this indicates the type of search that was performed in order that resulted in locating recordings and collections within the AFC

Collection – the official title for the collection as noted on the Library of Congress website

State – The primary state where recordings from the collection were located

Other States – The secondary states where recordings from the collection were located

Era / Date – The decade or year associated with each collection

Call Number – This is the official reference number that is used to locate the collections, both in the urls used on the Library website, and in the reference search for catalog cards (catalog cards can be searched at this address: https://memory.loc.gov/diglib/ihas/html/afccards/afccards-home.html)

Finding Aid Online? – Whether or not a finding aid is available for this collection on the internet

b) Items dataset field descriptions

id – the specific identification of the instance of a tune, song or dance within the dataset

Media Note – Any information that is included with the original format, such as identification, name of physical item, additional metadata written on the physical item

Original Format – The physical format that was used when recording each specific performance. Note: this field is used in order to calculate the number of physical items that were created in each collection such as 32 wax cylinders.

On Webste? – Whether or not each instance of a performance is available on the Library of Congress website

Collection Ref – The official reference number of the collection

Missing In Duplication – This column marks if parts of some recordings had been made available on other websites, but not all of the recordings were included in duplication (see recordings from Philadelphia Céilí Group on Villanova University website)

Collection – The official title of the collection given by the American Folklife Center

Outside Link – If recordings are available on other websites externally

Performer – The name of the contributor(s)

Solo/multiple – This field is used to calculate the amount of solo performers vs group performers in each collection

Sub-item – In some cases, physical recordings contained extra details, the sub-item column was used to denote these details

Type of item – This column describes each individual item type, as noted by performers and collectors

Item – The item title, as noted by performers and collectors. If an item was not described, it was entered as “unidentified”

Position – The position on the recording (in some cases during playback, audio cassette player counter markers were used)

Location – Local address of the recording

State – The state where the recording was made

Date – The date that the recording was made

Notes/Composer – The stated composer or source of the item recorded

Potential Linked Data – If items may be linked to other recordings or data, this column was used to provide examples of potential relationships between them

Instrument – The instrument(s) that was used during the performance

Additional Notes – Notes about the process of capturing, transcribing and tagging recordings (for researcher and intern collaboration purposes)

Tune Cleanup – This column was used to tidy each item so that it could be read by machines, but also so that spelling mistakes from the Item column could be corrected, and as an aid to preserving iterations of the editing process

V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.

VI. Creator and Contributor Information

Creator: Connections In Sound

Contributors: Library of Congress Labs

VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.
Z
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music...
data.niaid.nih.gov
Updated Mar 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Berg-Kirkpatrick, Taylor (2025). PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13763755
Explore at:
Dataset updated
Mar 17, 2025
Dataset provided by
Berg-Kirkpatrick, Taylor
McAuley, Julian
Long, Phillip
Novack, Zachary
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We introduce PDMX: a Public Domain MusicXML dataset for symbolic music processing. Refer to our paper for more information, and our GitHub repository for any code-related details. Please cite both our paper and our collaborators' paper if you use this dataset (see our GitHub for more information).

Upon further use of the PDMX dataset, we discovered a discrepancy between the public-facing copyright metadata on the MuseScore website and the internal copyright data of the MuseScore files themselves, which affected 31,221 (12.29% of) songs. We have decided to proceed with the former given its public visibility on Musescore (i.e. this is what the MuseScore website presents its users with). We have noted files with conflicting internal licenses in the license_conflict column of PDMX. We recommend using the no_license_conflict subset of PDMX (which still includes 222,856 songs) moving forward.

Additionally, for each song in PDMX, we not only provide the MusicRender and metadata JSON files, but we also try to include the associated compressed MusicXML (MXL), sheet music (PDF), and MIDI (MID) files when available. Due to the corruption of 42 of the original MuseScore files, these songs lack those associated files (since they could not be converted to those formats) and only include the MusicRender and metadata JSON files. The all_valid subset of PDMX describes the songs where all associated files are valid.
Z
Dataset for: The Evolution of the Manosphere Across the Web
data.niaid.nih.gov
Updated Aug 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emiliano De Cristofaro (2020). Dataset for: The Evolution of the Manosphere Across the Web [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4007912
Explore at:
Dataset updated
Aug 30, 2020
Dataset provided by
Emiliano De Cristofaro
Gianluca Stringhini
Jeremy Blackburn
Savvas Zannettou
Summer Long
Barry Bradlyn
Stephanie Greenberg
Manoel Horta Ribeiro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Evolution of the Manosphere Across the Web

We make available data related to subreddit and standalone forums from the manosphere.

We also make available Perspective API annotations for all posts.

You can find the code in GitHub.

Please cite this paper if you use this data:

@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }

Reddit data

We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:

{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.

Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.

Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.

No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.

I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.

Tallcels are fakecels and they all can (and should) suck my cock.

If I were 17cm taller my life would be a heaven and I would be the happiest man alive.

Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }

Forums

We here describe the .sqlite and .ndjson files that contain the data from the following forums.

(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/

The files are in folders /sqlite/ and /ndjson.

2.1 .sqlite

All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:

idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:

"type": (list) in some forums you can add a descriptor such as [RageFuel] to each topic, and you may also have special types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread; "link": (str) link to the thread; "author_topic": (str) username that created the thread; "replies": (int) number of replies, may differ from number of posts due to difference in crawling date; "views": (int) number of views; "subforum": (str) name of the subforum; "collected": (bool) indicates if raw posts have been collected; "crawled_idx_at": (str) datetime of the collection.

processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:

"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.

2.2 .ndjson

Each line consists of a json object representing a different comment with the following fields:

"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.

Perspective

We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output

{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }

Working with sqlite

A nice way to read some of the files of the dataset is using SqliteDict, for example:

from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")

for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass

Helpers

Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:

channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.

author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.

These are used in the paper for the migration analyses.

Examples and particularities for forums

Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.

6.1 incels

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.

quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.

6.2 LoveShy

Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.

types: no types were parsed. There are some rules in the forum, but not significant.

quotes: quotes were obtained from exact text+author match, or author match + a jaccard
g
Daily web access to the Open Data portal | gimi9.com
gimi9.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daily web access to the Open Data portal | gimi9.com [Dataset]. https://gimi9.com/dataset/eu_ds1474/
Explore at:
Description
The dataset contains information, divided by day, on the accesses made to the online services offered by the opendata portal and provided by the municipality of Milan. The pageviews column represents the total number of web pages, which have been displayed within the time frame used. The visits column represents the total number of visits made, within the time frame used. The visitors column represents the total number of unique visitors who have accessed the web pages. By unique visitor, we mean a visitor counted only once within the time frame used.
R
Basketball Dataset
universe.roboflow.com
zip
Updated May 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zaki (2022). Basketball Dataset [Dataset]. https://universe.roboflow.com/zaki-b86c6/basketball-jagmz/model/4
Explore at:
zipAvailable download formats
Dataset updated
May 25, 2022
Dataset authored and provided by
zaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Hit Bounding Boxes
Description
Here are a few use cases for this project:

Sports Analysis: Coaches and analysts can use this computer vision model to track the performance of players during a game or practice session. They can get insights about precise ball movements, successful hits, and goal rates, leading to better training and strategic decisions.

Highlight Generation: Sports media companies can implement the "basketball" model to automatically detect exciting moments like successful goals or impressive hits during a game. This can enable them to create instant highlights for social media, web portals, or live broadcasts, enhancing user engagement.

Virtual Coaching: This model can be integrated into mobile applications or websites that offer virtual basketball coaching. Users would be able to upload their videos, and the model would provide them with feedback based on their technique, ball handling, and shooting accuracy.

Smart Camera Systems: The "basketball" model can be embedded in smart cameras for sports facilities or courts. This would allow the cameras to follow the action as it happens, automatically zooming in on goals or exciting plays, thus enhancing the overall viewing experience for spectators.

Basketball Simulation Games: Game developers can utilize the model's capability to recognize various aspects of a basketball game to create more realistic and engaging basketball simulation games. The AI-driven virtual players would exhibit authentic in-game actions and responses, providing a closer-to-real gaming experience to the users.
f
Data from: ADVERPred–Web Service for Prediction of Adverse Effects of Drugs
acs.figshare.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey M. Ivanov; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Vladimir V. Poroikov (2023). ADVERPred–Web Service for Prediction of Adverse Effects of Drugs [Dataset]. http://doi.org/10.1021/acs.jcim.7b00568.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.7b00568.s002
Dataset updated
Jun 7, 2023
Dataset provided by
ACS Publications
Authors
Sergey M. Ivanov; Alexey A. Lagunin; Anastasia V. Rudik; Dmitry A. Filimonov; Vladimir V. Poroikov
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Application of structure–activity relationships (SARs) for the prediction of adverse effects of drugs (ADEs) has been reported in many published studies. Training sets for the creation of SAR models are usually based on drug label information which allows for the generation of data sets for many hundreds of drugs. Since many ADEs may not be related to drug consumption, one of the main problems in such studies is the quality of data on drug–ADE pairs obtained from labels. The information on ADEs may be included in three sections of the drug labels: “Boxed warning,” “Warnings and Precautions,” and “Adverse reactions.” The first two sections, especially Boxed warning, usually contain the most frequent and severe ADEs that have either known or probable relationships to drug consumption. Using this information, we have created manually curated data sets for the five most frequent and severe ADEs: myocardial infarction, arrhythmia, cardiac failure, severe hepatotoxicity, and nephrotoxicity, with more than 850 drugs on average for each effect. The corresponding SARs were built with PASS (Prediction of Activity Spectra for Substances) software and had balanced accuracy values of 0.74, 0.7, 0.77, 0.67, and 0.75, respectively. They were implemented in a freely available ADVERPred web service (http://www.way2drug.com/adverpred/), which enables a user to predict five ADEs based on the structural formula of compound. This web service can be applied for estimation of the corresponding ADEs for hits and lead compounds at the early stages of drug discovery.
C
Monthly web access to the Open Data portal
ckan.mobidatalab.eu
csv, json
Updated Oct 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unità Open Data (2023). Monthly web access to the Open Data portal [Dataset]. https://ckan.mobidatalab.eu/ru/dataset/ds1475_monthly-web-accesses-to-the-open-data-portal
Explore at:
csv(4846), json(14505)Available download formats
Dataset updated
Oct 9, 2023
Dataset provided by
Unità Open Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains information, divided by month, on accesses made to the online services offered by the opendata portal and provided by the municipality of Milan. The pageviews column represents the total number of web pages that have been viewed within the time frame used. The visits column represents the total visits made, within the time frame used. The visitors column represents the total number of unique visitors who have accessed the web pages. By unique visitor, we mean a visitor counted only once within the time frame used.
Game by Game MLB Batter Data (2017-2020)
kaggle.com
Updated Aug 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Adamek (2022). Game by Game MLB Batter Data (2017-2020) [Dataset]. https://www.kaggle.com/datasets/johnadamek/game-by-game-mlb-batter-data-20172020
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 5, 2022
Dataset provided by
Kaggle
Authors
John Adamek
Description
Content

This dataset utilized raw data from Advanced Sports Analytics (https://www.advancedsportsanalytics.com/).

This is a great website that provides raw MLB game data for every game. It is quite messy and requires a quite a bit cleaning but the data is worth it! Batting, Pitching, and play by play data was exported into csv files for the 2017-2020 seasons. R script is provided

Columns

Key Column information:

Batting Order = Where the player batted in the lineup for that given day Position = The position they played for that game Pit = Total amount of pitches they saw over the course of the game Str = Total amount of strikes they saw over the course of the game Team.R = Total runs scored by the batters team in the game Team.H = Total hits by the batters team in the game Opponent.R = Total runs scored by the opposing team in the game Opponent.H = Total hits by the opposing team in the game X1b.Ump = First base umpire for the game X2b.Ump = Second base umpire for the game X3b.Ump = Third base umpire for the game HP.Ump = Home Plate umpire for the game Date = Date of the game Game.Time = Game time H.A = Home or Away Precipitation = yes/no Sky = Whether it was sunny, cloudy, overcast, rain, drizzle, night, or in dome Stadium = Stadium played in Temperature = Temperature at game time Weather = Character combining temperature, wind speed, wind direction, and stadium/sky ** Wind.Direction** = Direction of the wind speed Wind.Speed = Wind speed in mph Starting.Pitcher = Starting pitcher Over.Under = Over/Under of the game Moneyline = The moneyline for the batters team Wagers = Amount of wagers placed on the game

UPDATE

Unfortunately, it seems like they no longer have this raw data available on their website so I will be uploading the raw data along with the cleaned files so that other's can manipulate the data anyway they like!

Facebook

Twitter

Click to copy link

Link copied

Cite

Bob Nau (2020). Daily website visitors (time series regression) [Dataset]. https://www.kaggle.com/bobnau/daily-website-visitors/code

Daily website visitors (time series regression)

Predict tomorrow's number of website visitors from 5 years of daily data

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 20, 2020

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Bob Nau

Description

Context

This file contains 5 years of daily time series data for several measures of traffic on a statistical forecasting teaching notes website whose alias is statforecasting.com. The variables have complex seasonality that is keyed to the day of the week and to the academic calendar. The patterns you you see here are similar in principle to what you would see in other daily data with day-of-week and time-of-year effects. Some good exercises are to develop a 1-day-ahead forecasting model, a 7-day ahead forecasting model, and an entire-next-week forecasting model (i.e., next 7 days) for unique visitors.

Content

The variables are daily counts of page loads, unique visitors, first-time visitors, and returning visitors to an academic teaching notes website. There are 2167 rows of data spanning the date range from September 14, 2014, to August 19, 2020. A visit is defined as a stream of hits on one or more pages on the site on a given day by the same user, as identified by IP address. Multiple individuals with a shared IP address (e.g., in a computer lab) are considered as a single user, so real users may be undercounted to some extent. A visit is classified as "unique" if a hit from the same IP address has not come within the last 6 hours. Returning visitors are identified by cookies if those are accepted. All others are classified as first-time visitors, so the count of unique visitors is the sum of the counts of returning and first-time visitors by definition. The data was collected through a traffic monitoring service known as StatCounter.

Inspiration

This file and a number of other sample datasets can also be found on the website of RegressIt, a free Excel add-in for linear and logistic regression which I originally developed for use in the course whose website generated the traffic data given here. If you use Excel to some extent as well as Python or R, you might want to try it out on this dataset.

Clear search

Close search

Google apps

Main menu

Daily website visitors (time series regression)

Context

Content

Inspiration

Website Analytics

Google Analytics Sample

Open Data Website Traffic

Google Analytics Sample

Context

Content

Acknowledgements

Inspiration

Exhibit of Datasets

OGD Portal: Daily usage by record (since January 2024)

Website Metrics

FOI-01782 - Datasets - Open Data Portal

E-commerce - Users of a French C2C fashion store

Foreword

Context

Content

Acknowledgements

Inspiration

License

‘Spotify Past Decades Songs Attributes’ analyzed by Analyst-2

Context

Content

Acknowledgements

Alexa Domains Dataset

The Items Dataset

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music...

Dataset for: The Evolution of the Manosphere Across the Web

Daily web access to the Open Data portal | gimi9.com

Basketball Dataset

Data from: ADVERPred–Web Service for Prediction of Adverse Effects of Drugs

Monthly web access to the Open Data portal

Game by Game MLB Batter Data (2017-2020)

Content

Columns

UPDATE

Daily website visitors (time series regression)

Predict tomorrow's number of website visitors from 5 years of daily data

Context

Content

Inspiration