100+ datasets found

Online Sales Dataset - Popular Marketplace Data
kaggle.com
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShreyanshVerma27
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

Order ID: Unique identifier for each sales order.

Date:Date of the sales transaction.

Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).

Product Name:Specific name or model of the product sold.

Quantity:Number of units of the product sold in the transaction.

Unit Price:Price of one unit of the product.

Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).

Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).

Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.

2. Explore the popularity of different product categories across regions.

3. Investigate the impact of payment methods on sales volume or revenue.

4. Identify top-selling products within each category to optimize inventory and marketing strategies.

5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.
Z
Popularity Dataset for Online Stats Training
data.niaid.nih.gov
zenodo.org
Updated Aug 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rens van de Schoot (2020). Popularity Dataset for Online Stats Training [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3962122
Explore at:
Dataset updated
Aug 25, 2020
Dataset authored and provided by
Rens van de Schoot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a dataset used for the online stats training website (https://www.rensvandeschoot.com/tutorials/) and is based on the data used by van de Schoot, van der Velden, Boom, and Brugman (2010).

The dataset is based on a study that investigates an association between popularity status and antisocial behavior from at-risk adolescents (n = 1491), where gender and ethnic background are moderators under the association. The study distinguished subgroups within the popular status group in terms of overt and covert antisocial behavior.For more information on the sample, instruments, methodology, and research context, we refer the interested readers to van de Schoot, van der Velden, Boom, and Brugman (2010).

Variable name Description

Respnr = Respondents’ number

Dutch = Respondents’ ethnic background (0 = Dutch origin, 1 = non-Dutch origin)

gender = Respondents’ gender (0 = boys, 1 = girls)

sd = Adolescents’ socially desirable answering patterns

covert = Covert antisocial behavior

overt = Overt antisocial behavior
e
Most popular websites in the Netherlands 2015 - Dataset - B2FIND
b2find.eudat.eu
Updated Jun 2, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2017). Most popular websites in the Netherlands 2015 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/3edeb59b-b49b-59cb-9757-9127aed7e8af
Explore at:
Dataset updated
Jun 2, 2017
Area covered
Netherlands
Description
This dataset contains a list of 3654 Dutch websites that we considered the most popular websites in 2015. This list served as whitelist for the Newstracker Research project in which we monitored the online web behaviour of a group of respondents.The research project 'The Newstracker' was a subproject of the NWO-funded project 'The New News Consumer: A User-Based Innovation Project to Meet Paradigmatic Change in News Use and Media Habits'.For the Newstracker project we aimed to understand the web behaviour of a group of respondents. We created custom-built software to monitor their web browsing behaviour on their laptops and desktops (please find the code in open access at https://github.com/NITechLabs/NewsTracker). For reasons of scale and privacy we created a whitelist with websites that were the most popular websites in 2015. We manually compiled this list by using data of DDMM, Alexa and own research. The dataset consists of 5 columns:- the URL- the type of website: We created a list of types of websites and each website has been manually labeled with 1 category- Nieuws-regio: When the category was 'News', we subdivided these websites in the regional focus: International, National or Local- Nieuws-onderwerp: Furthermore, each website under the category News was further subdivided in type of news website. For this we created an own list of news categories and manually coded each website- Bron: For each website we noted which source we used to find this website.The full description of the research design of the Newstracker including the set-up of this whitelist is included in the following article: Kleppe, M., Otte, M. (in print), 'Analysing & understanding news consumption patterns by tracking online user behaviour with a multimodal research design', Digital Scholarship in the Humanities, doi 10.1093/llc/fqx030.
m
UI/UX user interaction dataset across popular digital platforms
data.mendeley.com
Updated Nov 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Atikur Rahman (2024). UI/UX user interaction dataset across popular digital platforms [Dataset]. http://doi.org/10.17632/dxthxmnkhx.6
Explore at:
Unique identifier
https://doi.org/10.17632/dxthxmnkhx.6
Dataset updated
Nov 19, 2024
Authors
Md Atikur Rahman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset comprises 2,271 entries and provides insights into user interface (UI) and user experience (UX) preferences across various digital platforms. Key information includes user demographics (Name, Age, Gender) and platform preferences (e.g., Twitter, YouTube, Facebook, Website). It captures user experiences and satisfaction levels with various UI/UX elements such as color schemes, visual hierarchy, typography, multimedia usage, and layout design. The dataset also includes evaluations of mobile responsiveness, call-to-action buttons, form usability, feedback/error messages, loading speed, personalization, accessibility, and interactions (like scrolling behavior and gestures). Each UI/UX component is rated on a scale, allowing for quantitative analysis of user preferences and experiences, making this dataset valuable for research in user-centered design and usability optimization.
Website Screenshots Dataset
universe.roboflow.com
kaggle.com
zip
Updated Aug 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow (2022). Website Screenshots Dataset [Dataset]. https://universe.roboflow.com/roboflow-gw7yv/website-screenshots/model/1
Explore at:
zipAvailable download formats
Dataset updated
Aug 19, 2022
Dataset authored and provided by
Roboflowhttps://roboflow.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Elements Bounding Boxes
Description
About This Dataset

The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes: :fa-spacer: * button - navigation links, tabs, etc. * heading - text that was enclosed in <h1> to <h6> tags. * link - inline, textual <a> tags. * label - text labeling form fields. * text - all other text. * image - <img>, <svg>, or <video> tags, and icons. * iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset: https://i.imgur.com/mOG3u3Z.png" alt="WIkipedia Screenshot">

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

Collecting Custom Data

Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless. :fa-spacer: Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility. :fa-spacer:
Network Traffic Dataset
kaggle.com
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravikumar Gattu (2023). Network Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/ravikumargattu/network-traffic-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravikumar Gattu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The data presented here was obtained in a Kali Machine from University of Cincinnati,Cincinnati,OHIO by carrying out packet captures for 1 hour during the evening on Oct 9th,2023 using Wireshark.This dataset consists of 394137 instances were obtained and stored in a CSV (Comma Separated Values) file.This large dataset could be used utilised for different machine learning applications for instance classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

The dataset can be used for a variety of machine learning tasks, such as network intrusion detection, traffic classification, and anomaly detection.

Content :

This network traffic dataset consists of 7 features.Each instance contains the information of source and destination IP addresses, The majority of the properties are numeric in nature, however there are also nominal and date kinds due to the Timestamp.

The network traffic flow statistics (No. Time Source Destination Protocol Length Info) were obtained using Wireshark (https://www.wireshark.org/).

Dataset Columns:

No : Number of Instance. Timestamp : Timestamp of instance of network traffic Source IP: IP address of Source Destination IP: IP address of Destination Portocol: Protocol used by the instance Length: Length of Instance Info: Information of Traffic Instance

Acknowledgements :

I would like thank University of Cincinnati for giving the infrastructure for generation of network traffic data set.

Ravikumar Gattu , Susmitha Choppadandi

Inspiration : This dataset goes beyond the majority of network traffic classification datasets, which only identify the type of application (WWW, DNS, ICMP,ARP,RARP) that an IP flow contains. Instead, it generates machine learning models that can identify specific applications (like Tiktok,Wikipedia,Instagram,Youtube,Websites,Blogs etc.) from IP flow statistics (there are currently 25 applications in total).

**Dataset License: ** CC0: Public Domain

Dataset Usages : This dataset can be used for different machine learning applications in the field of cybersecurity such as classification of Network traffic,Network performance monitoring,Network Security Management , Network Traffic Management ,network intrusion detection and anomaly detection.

ML techniques benefits from this Dataset :

This dataset is highly useful because it consists of 394137 instances of network traffic data obtained by using the 25 applications on a public,private and Enterprise networks.Also,the dataset consists of very important features that can be used for most of the applications of Machine learning in cybersecurity.Here are few of the potential machine learning applications that could be benefited from this dataset are :

Network Performance Monitoring : This large network traffic data set can be utilised for analysing the network traffic to identifying the network patterns in the network .This help in designing the network security algorithms for minimise the network probelms.

Anamoly Detection : Large network traffic dataset can be utilised training the machine learning models for finding the irregularitues in the traffic which could help identify the cyber attacks.

3.Network Intrusion Detection : This large dataset could be utilised for machine algorithms training and designing the models for detection of the traffic issues,Malicious traffic network attacks and DOS attacks as well.
Z
Dataset used for HTTPS traffic classification using packet burst statistics
data.niaid.nih.gov
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cejka Tomas (2022). Dataset used for HTTPS traffic classification using packet burst statistics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4911550
Explore at:
Dataset updated
Apr 11, 2022
Dataset provided by
Cejka Tomas
Hynek Karel
Tropkova Zdena
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are publishing a dataset we created for the HTTPS traffic classification.

Since the data were captured mainly in the real backbone network, we omitted IP addresses and ports. The datasets consist of calculated from bidirectional flows exported with flow probe Ipifixprobe. This exporter can export a sequence of packet lengths and times and a sequence of packet bursts and time. For more information, please visit ipfixprobe repository (Ipifixprobe).

During our research, we divided HTTPS into five categories: L -- Live Video Streaming, P -- Video Player, M -- Music Player, U -- File Upload, D -- File Download, W -- Website, and other traffic.

We have chosen the service representatives known for particular traffic types based on the Alexa Top 1M list and Moz's list of the most popular 500 websites for each category. We also used several popular websites that primarily focus on the audience in our country. The identified traffic classes and their representatives are provided below:

Live Video Stream Twitch, Czech TV, YouTube Live

Video Player DailyMotion, Stream.cz, Vimeo, YouTube

Music Player AppleMusic, Spotify, SoundCloud

File Upload/Download FileSender, OwnCloud, OneDrive, Google Drive

Website and Other Traffic Websites from Alexa Top 1M list
u
Behance Community Art Data
cseweb.ucsd.edu
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project, Behance Community Art Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
Explore at:
jsonAvailable download formats
Dataset authored and provided by
UCSD CSE Research Project
Description
Likes and image data from the community art website Behance. This is a small, anonymized, version of a larger proprietary dataset.

Metadata includes

appreciates (likes)

timestamps

extracted image features

Basic Statistics:

Users: 63,497

Items: 178,788

Appreciates (likes): 1,000,000
G2 Dataset
brightdata.com
.json, .csv, .xlsx
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data, G2 Dataset [Dataset]. https://brightdata.com/products/datasets/g2
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Use our G2 dataset to collect product descriptions, ratings, reviews, and pricing information from the world's largest tech marketplace. You may purchase a full or partial dataset depending on your business needs. The G2 Software Products Dataset, with a focus on top-rated products, serves as a valuable resource for software buyers, businesses, and technology enthusiasts. This use case highlights products that have received exceptional ratings and positive reviews on the G2 platform, offering insights into customer satisfaction and popularity. For software buyers, this dataset acts as a trusted guide, presenting a curated selection of G2's top-rated software products, ensuring a higher likelihood of satisfaction with purchases. Businesses and technology professionals can leverage this dataset to identify popular and well-reviewed software solutions, optimizing their decision-making process. This use case emphasizes the dataset's utility for those specifically interested in exploring and acquiring top-rated software products from G2's Product Overview The G2 software products and reviews dataset offer a detailed and thorough overview of leading software companies. The dataset includes all major data points: Product descriptions Average rating (1-5) Sellers number of reviews Key features (highest and lowest rated) Competitors Website & social media links and more.

Instagram accounts with the most followers worldwide 2024

statista.com
de.statista.com
+4more

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

              The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.

              How popular is Instagram?

              Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.

              Who uses Instagram?

              Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.

              Celebrity influencers on Instagram
              Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.

Facebook users worldwide 2017-2027
statista.com
tokrwards.com
+4more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacy Jo Dixon, Facebook users worldwide 2017-2027 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset provided by
Statistahttp://statista.com/
Authors
Stacy Jo Dixon
Description
The global number of Facebook users was forecast to continuously increase between 2023 and 2027 by in total 391 million users (+14.36 percent). After the fourth consecutive increasing year, the Facebook user base is estimated to reach 3.1 billion users and therefore a new peak in 2027. Notably, the number of Facebook users was continuously increasing over the past years. User figures, shown here regarding the platform Facebook, have been estimated by taking into account company filings or press material, secondary research, app downloads and traffic data. They refer to the average monthly active users over the period and count multiple accounts by persons only once.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).
O
Corporate Website — Analytics — Popular pages
data.qld.gov.au
html
Updated Oct 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brisbane City Council (2025). Corporate Website — Analytics — Popular pages [Dataset]. https://www.data.qld.gov.au/dataset/corporate-website-analytics-popular-pages
Explore at:
htmlAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Brisbane City Council
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is available on Brisbane City Council’s open data website – data.brisbane.qld.gov.au. The site provides additional features for viewing and interacting with the data and for downloading the data in various formats.

Monthly analytics reports for the Brisbane City Council website

Information regarding the sessions for Brisbane City Council website during the month including page views and unique page views.
Machine Learning Dataset
brightdata.com
.json, .csv, .xlsx
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2024). Machine Learning Dataset [Dataset]. https://brightdata.com/products/datasets/machine-learning
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Utilize our machine learning datasets to develop and validate your models. Our datasets are designed to support a variety of machine learning applications, from image recognition to natural language processing and recommendation systems. You can access a comprehensive dataset or tailor a subset to fit your specific requirements, using data from a combination of various sources and websites, including custom ones. Popular use cases include model training and validation, where the dataset can be used to ensure robust performance across different applications. Additionally, the dataset helps in algorithm benchmarking by providing extensive data to test and compare various machine learning algorithms, identifying the most effective ones for tasks such as fraud detection, sentiment analysis, and predictive maintenance. Furthermore, it supports feature engineering by allowing you to uncover significant data attributes, enhancing the predictive accuracy of your machine learning models for applications like customer segmentation, personalized marketing, and financial forecasting.
U.S. Facebook data requests from government agencies 2013-2023
statista.com
de.statista.com
+4more
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stacy Jo Dixon, U.S. Facebook data requests from government agencies 2013-2023 [Dataset]. https://www.statista.com/topics/1164/social-networks/
Explore at:
Dataset provided by
Statistahttp://statista.com/
Authors
Stacy Jo Dixon
Description
Facebook received 73,390 user data requests from federal agencies and courts in the United States during the second half of 2023. The social network produced some user data in 88.84 percent of requests from U.S. federal authorities. The United States accounts for the largest share of Facebook user data requests worldwide.
🏯 Manga & Anime dataset 2024 ️🌸
kaggle.com
Updated Jan 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duong Truong Binh (2024). 🏯 Manga & Anime dataset 2024 ️🌸 [Dataset]. https://www.kaggle.com/datasets/duongtruongbinh/manga-and-anime-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Duong Truong Binh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Manga & Anime dataset 2024

Data Description

This dataset comprises information on top-rated anime and manga sourced from the popular website MyAnimeList.

Files

anime.csv: Contains information about top-rated anime series.

manga.csv: Contains information about top-rated manga series.

Data fields

anime.csv

Title: Title of the anime (in both Japanese and English).

Score: Score given by users (out of 10).

Votes: Number of user votes for the anime

Ranked: Rank of the anime based on score.

Popularity: Popularity rank.

Episodes: Number of episodes.

Status: Current airing status (e.g., Finished Airing).

Aired: Airing period.

Premiered: Premiere season and year.

Producers: Production companies.

Licensors: Licensing companies.

Studios: Animation studios.

Source: Source material (e.g., Manga).

Duration: Duration per episode.

Rating: Age rating.

manga.csv

Title: Title of the manga (in both Japanese and English).

Score: Score given by users (out of 10).

Votes: Number of user votes for the manga.

Ranked: Rank of the manga based on score.

Popularity: Popularity rank.

Members: Number of community members who have added the manga to their lists.

Favorites: Number of community members who favorited the manga.

Volumes: Number of volumes.

Chapters: Number of chapters.

Status: Publishing status (e.g., Finished).

Published: Publiscation period.

Genres: Genres of the manga.

Themes: Themes explored in the manga.

Demographics: Target demographic (e.g., Shounen).

Serialization: Manga serialization information (e.g., Shounen Jump).

Authors: Authors of the manga.

Acknowledgements

Dataset was scraped from MyAnimeList on January 2024.

License

This dataset is released under the Creative Commons Zero v1.0 Universal license. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Please give an upvote 👍️ if you found this dataset useful! Thank you and enjoy! 🌟
LinkedIn Datasets
brightdata.com
.json, .csv, .xlsx
Updated Dec 17, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bright Data (2021). LinkedIn Datasets [Dataset]. https://brightdata.com/products/datasets/linkedin
Explore at:
.json, .csv, .xlsxAvailable download formats
Dataset updated
Dec 17, 2021
Dataset authored and provided by
Bright Datahttps://brightdata.com/
License
https://brightdata.com/licensehttps://brightdata.com/license
Area covered
Worldwide
Description
Unlock the full potential of LinkedIn data with our extensive dataset that combines profiles, company information, and job listings into one powerful resource for business decision-making, strategic hiring, competitive analysis, and market trend insights. This all-encompassing dataset is ideal for professionals, recruiters, analysts, and marketers aiming to enhance their strategies and operations across various business functions. Dataset Features

Profiles: Dive into detailed public profiles featuring names, titles, positions, experience, education, skills, and more. Utilize this data for talent sourcing, lead generation, and investment signaling, with a refresh rate ensuring up to 30 million records per month. Companies: Access comprehensive company data including ID, country, industry, size, number of followers, website details, subsidiaries, and posts. Tailored subsets by industry or region provide invaluable insights for CRM enrichment, competitive intelligence, and understanding the startup ecosystem, updated monthly with up to 40 million records. Job Listings: Explore current job opportunities detailed with job titles, company names, locations, and employment specifics such as seniority levels and employment functions. This dataset includes direct application links and real-time application numbers, serving as a crucial tool for job seekers and analysts looking to understand industry trends and the job market dynamics.

Customizable Subsets for Specific Needs Our LinkedIn dataset offers the flexibility to tailor the dataset according to your specific business requirements. Whether you need comprehensive insights across all data points or are focused on specific segments like job listings, company profiles, or individual professional details, we can customize the dataset to match your needs. This modular approach ensures that you get only the data that is most relevant to your objectives, maximizing efficiency and relevance in your strategic applications. Popular Use Cases

Strategic Hiring and Recruiting: Track talent movement, identify growth opportunities, and enhance your recruiting efforts with targeted data. Market Analysis and Competitive Intelligence: Gain a competitive edge by analyzing company growth, industry trends, and strategic opportunities. Lead Generation and CRM Enrichment: Enrich your database with up-to-date company and professional data for targeted marketing and sales strategies. Job Market Insights and Trends: Leverage detailed job listings for a nuanced understanding of employment trends and opportunities, facilitating effective job matching and market analysis. AI-Driven Predictive Analytics: Utilize AI algorithms to analyze large datasets for predicting industry shifts, optimizing business operations, and enhancing decision-making processes based on actionable data insights.

Whether you are mapping out competitive landscapes, sourcing new talent, or analyzing job market trends, our LinkedIn dataset provides the tools you need to succeed. Customize your access to fit specific needs, ensuring that you have the most relevant and timely data at your fingertips.

Number of global social network users 2017-2028

statista.com
grusthub.com
+4more

Facebook

Twitter

Click to copy link

Link copied

Cite

Stacy Jo Dixon, Number of global social network users 2017-2028 [Dataset]. https://www.statista.com/topics/1164/social-networks/

Explore at:

Dataset provided by

Statistahttp://statista.com/

Authors

Stacy Jo Dixon

Description

How many people use social media?

              Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.

              Who uses social media?
              Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
              when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.

              How much time do people spend on social media?
              Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.

              What are the most popular social media platforms?
              Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.

z
Requirements data sets (user stories)
zenodo.org
data.mendeley.com
txt
Updated Jan 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabiano Dalpiaz; Fabiano Dalpiaz (2025). Requirements data sets (user stories) [Dataset]. http://doi.org/10.17632/7zbk8zsd8y.1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.17632/7zbk8zsd8y.1
Dataset updated
Jan 13, 2025
Dataset provided by
Mendeley Data
Authors
Fabiano Dalpiaz; Fabiano Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of 22 data set of 50+ requirements each, expressed as user stories.

The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]

The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light

This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1

Overview of the datasets [data and links added in December 2024]

The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.

Public administration and transparency

g02-federalspending.txt (2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.

g03-loudoun.txt (2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.

g04-recycling.txt(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).

g05-openspending.txt (2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.

g11-nsf.txt (2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.

(Research) data and meta-data management

g08-frictionless.txt (2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.

g14-datahub.txt (2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.

g16-mis.txt (2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.

g17-cask.txt (2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.

g18-neurohub.txt (2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.

g22-rdadmp.txt (2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.

g23-archivesspace.txt (2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Multilingual Scraper of Privacy Policies and Terms of Service
zenodo.org
bin, zip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold (2025). Multilingual Scraper of Privacy Policies and Terms of Service [Dataset]. http://doi.org/10.5281/zenodo.14562039
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14562039
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Bernhard; David Bernhard; Luka Nenadic; Luka Nenadic; Stefan Bechtold; Karel Kubicek; Karel Kubicek; Stefan Bechtold
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

This dataset supplements publication "Multilingual Scraper of Privacy Policies and Terms of Service" at ACM CSLAW’25, March 25–27, 2025, München, Germany. It includes the first 12 months of scraped policies and terms from about 800k websites, see concrete numbers below.

The following table lists the amount of websites visited per month:

Month Number of websites
2024-01 551'148
2024-02 792'921
2024-03 844'537
2024-04 802'169
2024-05 805'878
2024-06 809'518
2024-07 811'418
2024-08 813'534
2024-09 814'321
2024-10 817'586
2024-11 828'662
2024-12 827'101

The amount of websites visited should always be higher than the number of jobs (Table 1 of the paper) as a website may redirect, resulting in two websites scraped or it has to be retried.

To simplify the access, we release the data in large CSVs. Namely, there is one file for policies and another for terms per month. All of these files contain all metadata that are usable for the analysis. If your favourite CSV parser reports the same numbers as above then our dataset is correctly parsed. We use ‘,’ as a separator, the first row is the heading and strings are in quotes.

Since our scraper sometimes collects other documents than policies and terms (for how often this happens, see the evaluation in Sec. 4 of the publication) that might contain personal data such as addresses of authors of websites that they maintain only for a selected audience. We therefore decided to reduce the risks for websites by anonymizing the data using Presidio. Presidio substitutes personal data with tokens. If your personal data has not been effectively anonymized from the database and you wish for it to be deleted, please contact us.

Preliminaries

The uncompressed dataset is about 125 GB in size, so you will need sufficient storage. This also means that you likely cannot process all the data at once in your memory, so we split the data in months and in files for policies and terms.

Files and structure

The files have the following names:

2024_policy.csv for policies

2024_terms.csv for terms

Shared metadata

Both files contain the following metadata columns:

website_month_id - identification of crawled website

job_id - one website can have multiple jobs in case of redirects (but most commonly has only one)

website_index_status - network state of loading the index page. This is resolved by the Chromed DevTools Protocol.

DNS_ERROR - domain cannot be resolved

OK - all fine

REDIRECT - domain redirect to somewhere else

TIMEOUT - the request timed out

BAD_CONTENT_TYPE - 415 Unsupported Media Type

HTTP_ERROR - 404 error

TCP_ERROR - error in the network connection

UNKNOWN_ERROR - unknown error

website_lang - language of index page detected based on langdetect library

website_url - the URL of the website sampled from the CrUX list (may contain subdomains, etc). Use this as a unique identifier for connecting data between months.

job_domain_status - indicates the status of loading the index page. Can be:

OK - all works well (at the moment, should be all entries)

BLACKLISTED - URL is on our list of blocked URLs

UNSAFE - website is not safe according to save browsing API by Google

LOCATION_BLOCKED - country is in the list of blocked countries

job_started_at - when the visit of the website was started

job_ended_at - when the visit of the website was ended

job_crux_popularity - JSON with all popularity ranks of the website this month

job_index_redirect - when we detect that the domain redirects us, we stop the crawl and create a new job with the target URL. This saves time if many websites redirect to one target, as it will be crawled only once. The index_redirect is then the job.id corresponding to the redirect target.

job_num_starts - amount of crawlers that started this job (counts restarts in case of unsuccessful crawl, max is 3)

job_from_static - whether this job was included in the static selection (see Sec. 3.3 of the paper)

job_from_dynamic - whether this job was included in the dynamic selection (see Sec. 3.3 of the paper) - this is not exclusive with from_static - both can be true when the lists overlap.

job_crawl_name - our name of the crawl, contains year and month (e.g., 'regular-2024-12' for regular crawls, in Dec 2024)

Policy data

policy_url_id - ID of the URL this policy has

policy_keyword_score - score (higher is better) according to the crawler's keywords list that given document is a policy

policy_ml_probability - probability assigned by the BERT model that given document is a policy

policy_consideration_basis - on which basis we decided that this url is policy. The following three options are executed by the crawler in this order:

'keyword matching' - this policy was found using the crawler navigation (which is based on keywords)

'search' - this policy was found using search engine

'path guessing' - this policy was found by using well-known URLs like example.com/policy

policy_url - full URL to the policy

policy_content_hash - used as identifier - if the document remained the same between crawls, it won't create a new entry

policy_content - contains the text of policies and terms extracted to Markdown using Mozilla's readability library

policy_lang - Language detected by fasttext of the content

Terms data

Analogous to policy data, just substitute policy to terms.

Updates

Check this Google Docs for an updated version of this README.md.
Airlines Flights Data
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). Airlines Flights Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.

This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

This analyse will be helpful for those working in Airlines, Travel domain.

Using this dataset, we answered multiple questions with Python in our Project.

Q.1. What are the airlines in the dataset, accompanied by their frequencies?

Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.

Q.3. Show Bar Graphs representing the Source City & Destination City.

Q.4. Does price varies with airlines ?

Q.5. Does ticket price change based on the departure time and arrival time?

Q.6. How the price changes with change in Source and Destination?

Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?

Q.8. How does the ticket price vary between Economy and Business class?

Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?

These are the main Features/Columns available in the dataset :

1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

11) Price: Target variable stores information of the ticket price.

Month	Number of websites
2024-01	551'148
2024-02	792'921
2024-03	844'537
2024-04	802'169
2024-05	805'878
2024-06	809'518
2024-07	811'418
2024-08	813'534
2024-09	814'321
2024-10	817'586
2024-11	828'662
2024-12	827'101

Facebook

Twitter

Click to copy link

Link copied

Cite

ShreyanshVerma27 (2024). Online Sales Dataset - Popular Marketplace Data [Dataset]. https://www.kaggle.com/datasets/shreyanshverma27/online-sales-dataset-popular-marketplace-data

Online Sales Dataset - Popular Marketplace Data

Global Transactions Across Various Product Categories

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 25, 2024

Dataset provided by

Kagglehttp://kaggle.com/

Authors

ShreyanshVerma27

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset provides a comprehensive overview of online sales transactions across different product categories. Each row represents a single transaction with detailed information such as the order ID, date, category, product name, quantity sold, unit price, total price, region, and payment method.

Columns:

Order ID: Unique identifier for each sales order.
Date:Date of the sales transaction.
Category:Broad category of the product sold (e.g., Electronics, Home Appliances, Clothing, Books, Beauty Products, Sports).
Product Name:Specific name or model of the product sold.
Quantity:Number of units of the product sold in the transaction.
Unit Price:Price of one unit of the product.
Total Price: Total revenue generated from the sales transaction (Quantity * Unit Price).
Region:Geographic region where the transaction occurred (e.g., North America, Europe, Asia).
Payment Method: Method used for payment (e.g., Credit Card, PayPal, Debit Card).

Insights:

1. Analyze sales trends over time to identify seasonal patterns or growth opportunities.
2. Explore the popularity of different product categories across regions.
3. Investigate the impact of payment methods on sales volume or revenue.
4. Identify top-selling products within each category to optimize inventory and marketing strategies.
5. Evaluate the performance of specific products or categories in different regions to tailor marketing campaigns accordingly.

Clear search

Close search

Google apps

Main menu

Online Sales Dataset - Popular Marketplace Data

Columns:

Insights:

Popularity Dataset for Online Stats Training

Most popular websites in the Netherlands 2015 - Dataset - B2FIND

UI/UX user interaction dataset across popular digital platforms

Website Screenshots Dataset

About This Dataset

Example

Usage

Collecting Custom Data

About Roboflow

Network Traffic Dataset

Dataset used for HTTPS traffic classification using packet burst statistics

Behance Community Art Data

G2 Dataset

Instagram accounts with the most followers worldwide 2024

Facebook users worldwide 2017-2027

Corporate Website — Analytics — Popular pages

Machine Learning Dataset

U.S. Facebook data requests from government agencies 2013-2023

🏯 Manga & Anime dataset 2024 ️🌸

Manga & Anime dataset 2024

Data Description

Files

Data fields

anime.csv

manga.csv

Acknowledgements

License

LinkedIn Datasets

Number of global social network users 2017-2028

Requirements data sets (user stories)

Overview of the datasets [data and links added in December 2024]

Public administration and transparency

(Research) data and meta-data management

Multilingual Scraper of Privacy Policies and Terms of Service

Multilingual Scraper of Privacy Policies and Terms of Service: Scraped Documents of 2024

Preliminaries

Files and structure

Shared metadata

Policy data

Terms data

Updates

Airlines Flights Data

📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

Online Sales Dataset - Popular Marketplace Data

Global Transactions Across Various Product Categories

Columns:

Insights: