100+ datasets found

d
Global Web Data | Web Scraping Data | Job Postings Data | Source: Company...
datarade.ai
.json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PredictLeads, Global Web Data | Web Scraping Data | Job Postings Data | Source: Company Website | 232M+ Records [Dataset]. https://datarade.ai/data-products/predictleads-web-data-web-scraping-data-job-postings-dat-predictleads
Explore at:
.jsonAvailable download formats
Dataset authored and provided by
PredictLeads
Area covered
Bosnia and Herzegovina, French Guiana, Kuwait, El Salvador, Virgin Islands (British), Northern Mariana Islands, Comoros, Guadeloupe, Bonaire, Kosovo
Description
PredictLeads Job Openings Data provides high-quality hiring insights sourced directly from company websites - not job boards. Using advanced web scraping technology, our dataset offers real-time access to job trends, salaries, and skills demand, making it a valuable resource for B2B sales, recruiting, investment analysis, and competitive intelligence.

Key Features:

✅232M+ Job Postings Tracked – Data sourced from 92 Million company websites worldwide. ✅7,1M+ Active Job Openings – Updated in real-time to reflect hiring demand. ✅Salary & Compensation Insights – Extract salary ranges, contract types, and job seniority levels. ✅Technology & Skill Tracking – Identify emerging tech trends and industry demands. ✅Company Data Enrichment – Link job postings to employer domains, firmographics, and growth signals. ✅Web Scraping Precision – Directly sourced from employer websites for unmatched accuracy.

Primary Attributes:

id (string, UUID) – Unique identifier for the job posting.

type (string, constant: "job_opening") – Object type.

title (string) – Job title.

description (string) – Full job description, extracted from the job listing.

url (string, URL) – Direct link to the job posting.

first_seen_at – Timestamp when the job was first detected.

last_seen_at – Timestamp when the job was last detected.

last_processed_at – Timestamp when the job data was last processed.

Job Metadata:

contract_types (array of strings) – Type of employment (e.g., "full time", "part time", "contract").

categories (array of strings) – Job categories (e.g., "engineering", "marketing").

seniority (string) – Seniority level of the job (e.g., "manager", "non_manager").

status (string) – Job status (e.g., "open", "closed").

language (string) – Language of the job posting.

location (string) – Full location details as listed in the job description.

Location Data (location_data) (array of objects)

city (string, nullable) – City where the job is located.

state (string, nullable) – State or region of the job location.

zip_code (string, nullable) – Postal/ZIP code.

country (string, nullable) – Country where the job is located.

region (string, nullable) – Broader geographical region.

continent (string, nullable) – Continent name.

fuzzy_match (boolean) – Indicates whether the location was inferred.

Salary Data (salary_data)

salary (string) – Salary range extracted from the job listing.

salary_low (float, nullable) – Minimum salary in original currency.

salary_high (float, nullable) – Maximum salary in original currency.

salary_currency (string, nullable) – Currency of the salary (e.g., "USD", "EUR").

salary_low_usd (float, nullable) – Converted minimum salary in USD.

salary_high_usd (float, nullable) – Converted maximum salary in USD.

salary_time_unit (string, nullable) – Time unit for the salary (e.g., "year", "month", "hour").

Occupational Data (onet_data) (object, nullable)

code (string, nullable) – ONET occupation code.

family (string, nullable) – Broad occupational family (e.g., "Computer and Mathematical").

occupation_name (string, nullable) – Official ONET occupation title.

Additional Attributes:

tags (array of strings, nullable) – Extracted skills and keywords (e.g., "Python", "JavaScript").

📌 Trusted by enterprises, recruiters, and investors for high-precision job market insights.

PredictLeads Dataset: https://docs.predictleads.com/v3/guide/job_openings_dataset
o
Data from: Essential-Web v1.0: 24T tokens of organized web data
registry.opendata.aws
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EssentialAI (2025). Essential-Web v1.0: 24T tokens of organized web data [Dataset]. https://registry.opendata.aws/eai-essential-web-v1/
Explore at:
Dataset updated
Sep 18, 2025
Dataset provided by
<a href="https://www.essential.ai">EssentialAI</a>
Description
A 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality.
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Isle of Man, Northern Mariana Islands, Tunisia, Andorra, Nepal, British Indian Ocean Territory, Moldova (Republic of), Bangladesh, Taiwan, Canada
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
d
Google SERP Data, Web Search Data, Google Images Data | Real-Time API
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWeb Ninja, Google SERP Data, Web Search Data, Google Images Data | Real-Time API [Dataset]. https://datarade.ai/data-products/openweb-ninja-google-data-google-image-data-google-serp-d-openweb-ninja
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
OpenWeb Ninja
Area covered
Uganda, Burundi, Barbados, South Georgia and the South Sandwich Islands, Panama, Grenada, Tokelau, Ireland, Virgin Islands (U.S.), Uruguay
Description
OpenWeb Ninja's Google Images Data (Google SERP Data) API provides real-time image search capabilities for images sourced from all public sources on the web.

The API enables you to search and access more than 100 billion images from across the web including advanced filtering capabilities as supported by Google Advanced Image Search. The API provides Google Images Data (Google SERP Data) including details such as image URL, title, size information, thumbnail, source information, and more data points. The API supports advanced filtering and options such as file type, image color, usage rights, creation time, and more. In addition, any Advanced Google Search operators can be used with the API.

OpenWeb Ninja's Google Images Data & Google SERP Data API common use cases:

Creative Media Production: Enhance digital content with a vast array of real-time images, ensuring engaging and brand-aligned visuals for blogs, social media, and advertising.

AI Model Enhancement: Train and refine AI models with diverse, annotated images, improving object recognition and image classification accuracy.

Trend Analysis: Identify emerging market trends and consumer preferences through real-time visual data, enabling proactive business decisions.

Innovative Product Design: Inspire product innovation by exploring current design trends and competitor products, ensuring market-relevant offerings.

Advanced Search Optimization: Improve search engines and applications with enriched image datasets, providing users with accurate, relevant, and visually appealing search results.

OpenWeb Ninja's Annotated Imagery Data & Google SERP Data Stats & Capabilities:

100B+ Images: Access an extensive database of over 100 billion images.

Images Data from all Public Sources (Google SERP Data): Benefit from a comprehensive aggregation of image data from various public websites, ensuring a wide range of sources and perspectives.

Extensive Search and Filtering Capabilities: Utilize advanced search operators and filters to refine image searches by file type, color, usage rights, creation time, and more, making it easy to find exactly what you need.

Rich Data Points: Each image comes with more than 10 data points, including URL, title (annotation), size information, thumbnail, and source information, providing a detailed context for each image.
Web Analytics Dataset
kaggle.com
zip
Updated Oct 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Merve Afranur ARTAR (2020). Web Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/afranur/web-analytics-dataset
Explore at:
zip(7376 bytes)Available download formats
Dataset updated
Oct 12, 2020
Authors
Merve Afranur ARTAR
Description
Dataset

This dataset was created by Merve Afranur ARTAR

Contents
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
Dataset for Privacy Exercises
kaggle.com
zip
Updated Apr 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shining (2024). Dataset for Privacy Exercises [Dataset]. https://www.kaggle.com/datasets/shiningana/dataset-for-privacy-exercises
Explore at:
zip(7327312 bytes)Available download formats
Dataset updated
Apr 9, 2024
Authors
Shining
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset gives some data of a hypothetical business that can be used to practice your privacy data transformation and analysis skills.

The dataset contains the following files/tables: 1. customer_orders_for_privacy_exercises.csv contains data of a business about customer orders (columns separated by commas) 2. users_web_browsing_for_privacy_exercises.csv contains data collected by the business website about its users (columns separated by commas) 3. iot_example.csv contains data collected by a smart device on users' bio-metric data (columns separated by commas) 4. members.csv contains data collected by a library on its users (columns separated by commas)
NYC STEW-MAP Staten Island organizations' website hyperlink webscrape
catalog.data.gov
s.cnmilf.com
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
New York, Staten Island
Description
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
Z
Data set of the article: Using Machine Learning for Web Page Classification...
data.niaid.nih.gov
Updated Jan 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matošević, Goran; Dobša, Jasminka; Mladenić, Dunja (2021). Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4416122
Explore at:
Dataset updated
Jan 6, 2021
Dataset provided by
Faculty of Organization and Informatics Varaždin, University of Zagreb, 10000 Zagreb, Croatia
Faculty of Economics and Tourism, University of Pula, 52100 Pula, Croatia
Institute Jozes Stefan Ljubljana, 1000 Ljubljana, Slovenia
Authors
Matošević, Goran; Dobša, Jasminka; Mladenić, Dunja
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization"

Abstract of the article:

This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.
Google Analytics Sample
kaggle.com
zip
Updated Sep 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 19, 2019
Dataset provided by
Googlehttp://google.com/
BigQueryhttps://cloud.google.com/bigquery
Authors
Google BigQuery
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

Content

The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

Fork this kernel to get started.

Acknowledgements

Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

Banner Photo by Edho Pratama from Unsplash.

Inspiration

What is the total number of transactions generated per device browser in July 2017?

The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

What was the average number of product pageviews for users who made a purchase in July 2017?

What was the average number of product pageviews for users who did not make a purchase in July 2017?

What was the average total transactions per user that made a purchase in July 2017?

What is the average amount of money spent per session in July 2017?

What is the sequence of pages viewed?
H
Hydroinformatics Instruction Module Example Code: Programmatic Data Access...
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Mar 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amber Spackman Jones; Jeffery S. Horsburgh (2022). Hydroinformatics Instruction Module Example Code: Programmatic Data Access with USGS Data Retrieval [Dataset]. https://www.hydroshare.org/resource/a58b5d522d7f4ab08c15cd05f3fd2ad3
Explore at:
zip(34.5 KB)Available download formats
Dataset updated
Mar 3, 2022
Dataset provided by
HydroShare
Authors
Amber Spackman Jones; Jeffery S. Horsburgh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This resource contains Jupyter Notebooks with examples for accessing USGS NWIS data via web services and performing subsequent analysis related to drought with particular focus on sites in Utah and the southwestern United States (could be modified to any USGS sites). The code uses the Python DataRetrieval package. The resource is part of set of materials for hydroinformatics and water data science instruction. Complete learning module materials are found in HydroLearn: Jones, A.S., Horsburgh, J.S., Bastidas Pacheco, C.J. (2022). Hydroinformatics and Water Data Science. HydroLearn. https://edx.hydrolearn.org/courses/course-v1:USU+CEE6110+2022/about.

This resources consists of 6 example notebooks: 1. Example 1: Import and plot daily flow data 2. Example 2: Import and plot instantaneous flow data for multiple sites 3. Example 3: Perform analyses with USGS annual statistics data 4. Example 4: Retrieve data and find daily flow percentiles 3. Example 5: Further examination of drought year flows 6. Coding challenge: Assess drought severity
d
Chapter 14 Examples
data.world
zip
Updated Sep 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Semantic Web for the Working Ontologist (2025). Chapter 14 Examples [Dataset]. https://data.world/swwo/chapter-14-examples
Explore at:
zipAvailable download formats
Dataset updated
Sep 21, 2025
Authors
Semantic Web for the Working Ontologist
Description
Example data for Chapter 14 of Semantic Web for the Working Ontologist
d
DATAANT | Custom Data Extraction | Web Scraping Data | Dataset, API | Data...
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataant, DATAANT | Custom Data Extraction | Web Scraping Data | Dataset, API | Data Parsing and Processing | Worldwide [Dataset]. https://datarade.ai/data-products/dataant-custom-data-extraction-web-scraping-data-datase-dataant
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Dataant
Area covered
Bulgaria, Israel, Morocco, Lithuania, Andorra, Vanuatu, Uruguay, Niger, Algeria, Yemen
Description
DATAANT provides the ability to extract data from any website using its web scraping service.

Receive raw HTML data by triggering the API or request a custom dataset from any website.

Use the received data for: - data analysis - data enrichment - data intelligence - data comparison

The only two parameters needed to start a data extraction project: - data source (website URL) - attributes set for extraction

All the data can be delivered using the following: - One-Time delivery - Scheduled updates delivery - DB access - API

All the projects are highly customizable, so our team of data specialists could provide any data enrichment.
Z
Network Traffic Analysis: Data and Code
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11479410
Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Loyola University Chicago
Authors
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code:

Packet_Features_Generator.py & Features.py

To run this code:

pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

Purpose:

Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

Uses Features.py to calcualte the features.

startMachineLearning.sh & machineLearning.py

To run this code:

bash startMachineLearning.sh

This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

Options (to be edited within this file):

--evaluate-only to test 5 fold cross validation accuracy

--test-scaling-normalization to test 6 different combinations of scalers and normalizers

Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

Purpose:

Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

Data

Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

First number is a classification number to denote what website, query, or vr action is taking place.

The remaining numbers in each line denote:

The size of a packet,

and the direction it is traveling.

negative numbers denote incoming packets

positive numbers denote outgoing packets

Figure 4 Data

This data uses specific lines from the Virtual Reality.txt file.

The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

The .xlsx and .csv file are identical

Each file includes (from right to left):

The origional packet data,

each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.
w
Amazon Web Services - Public Data Sets
data.wu.ac.at
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
Explore at:
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
About

From website:

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
Data from: Web Traffic Dataset
kaggle.com
zip
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ramin Huseyn (2024). Web Traffic Dataset [Dataset]. https://www.kaggle.com/datasets/raminhuseyn/web-traffic-time-series-dataset
Explore at:
zip(14740 bytes)Available download formats
Dataset updated
May 19, 2024
Authors
Ramin Huseyn
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset contains information about web requests to a single website. It's a time series dataset, which means it tracks data over time, making it great for machine learning analysis.
BGS offshore activities and samples Web service - Dataset - data.gov.uk
ckan.publishing.service.gov.uk
Updated Aug 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ckan.publishing.service.gov.uk (2025). BGS offshore activities and samples Web service - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/bgs-offshore-activities-and-samples-web-service
Explore at:
Dataset updated
Aug 5, 2025
Dataset provided by
CKANhttps://ckan.org/
Description
This Web service provides layers which show metadata relating to offshore sample collection and other activities undertaken by the British Geological Survey (BGS) and its predecessors. The layers are point layers which indicate the spatial locations of the samples or activities. This service groups data by the type of sample: borehole-type samples (including boreholes, vibrocorers, piston corers and other types of corer), grab samples and other samples (including dredge samples and cone penetrometer tests). For each sample type, two layers are provided: 1) A summary metadata layer containing details about the sample, the survey or cruise during which it was collected, and additional descriptive information, plus a link to scanned images of sample station datasheets (where available). 2) For samples which have undergone further geological interpretation, a layer which contains geological observations, measurements and interpretations at specific depth intervals. Two additional layers containing the results of particle size analysis (PSA) and geotechnical data (where collected) are also provided.
Sample Power BI Data
kaggle.com
zip
Updated Oct 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmitRaghav007 (2022). Sample Power BI Data [Dataset]. https://www.kaggle.com/datasets/amitraghav007/us-store-data
Explore at:
zip(1031090 bytes)Available download formats
Dataset updated
Oct 2, 2022
Authors
AmitRaghav007
Description
Dataset

This dataset was created by AmitRaghav007

Contents

E commerce website data to make reports.
365 Data Science Web site statistics
kaggle.com
zip
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yasser messahli (2024). 365 Data Science Web site statistics [Dataset]. https://www.kaggle.com/yassermessahli/365-data-science-web-site-statistics
Explore at:
zip(3895191 bytes)Available download formats
Dataset updated
Aug 9, 2024
Authors
yasser messahli
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
365 Data Science Database

365 Data Science is a website that provides online courses and resources for learning data science, machine learning, and data analysis.

It is common for websites that offer online courses to have **databases **to store information about their courses, students, and progress. It is also possible that they use databases for storing and organizing the data used in their courses and examples.

If you're looking for specific information about the database used by 365 Data Science, I recommend reaching out to them directly through their Website or support channels.
W
Web Scraper Software Market Report
promarketreports.com
doc, pdf, ppt
Updated Jul 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pro Market Reports (2025). Web Scraper Software Market Report [Dataset]. https://www.promarketreports.com/reports/web-scraper-software-market-8662
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Jul 14, 2025
Dataset authored and provided by
Pro Market Reports
License
https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The web scraper software market offers a range of solutions tailored to different needs and complexities: General-Purpose Web Crawlers: These versatile tools are designed to extract data from a wide variety of websites with diverse structures and content. Popular examples include UiPath and Octoparse, offering robust features and flexibility for broad-scale scraping projects. Specialized Web Crawlers: These solutions are optimized for specific websites or domains, providing enhanced efficiency and accuracy for targeted data extraction. Scrapinghub is a notable example, offering specialized tools and integrations for specific web applications. Incremental Web Crawlers: Designed for ongoing data updates, these crawlers focus on identifying and extracting only newly added or modified content, ensuring datasets remain current and relevant. Mozenda exemplifies this category, providing tools for efficient monitoring and updating. Deep Web Crawlers: These advanced tools access data residing within hidden or protected sections of the web that are not readily accessible through traditional methods. DeepCrawl is an example of a platform designed to navigate and extract data from these typically less accessible areas of the internet. Recent developments include: March 2022: KaraMD announced Pure Health Apple Cider Vinegar Gummies, a vegan gummy aimed to aid ketosis, digestion regulation, weight management, and encourage greater levels of energy., January 2022: Solace Nutrition, a US-based medical nutrition company, bought R-Kane Nutritionals' assets for an unknown sum. This asset acquisition enables Solace Nutrition to develop synergy between both brands, accelerate growth, and establish a position in an adjacent nutrition sector. R-Kane Nutritionals is a firm established in the United States that specializes in high-protein meal replacement products for weight loss., February 2021: Hydroxycut's newest creation, CUT Energy, a delectable clean energy drink, was released. This powerful mix was carefully formulated for regular energy drink consumers, exercise enthusiasts, and dieters looking to lose weight..

Facebook

Twitter

Click to copy link

Link copied

Cite

PredictLeads, Global Web Data | Web Scraping Data | Job Postings Data | Source: Company Website | 232M+ Records [Dataset]. https://datarade.ai/data-products/predictleads-web-data-web-scraping-data-job-postings-dat-predictleads

Global Web Data | Web Scraping Data | Job Postings Data | Source: Company Website | 232M+ Records

Explore at:

.jsonAvailable download formats

Dataset authored and provided by

PredictLeads

Area covered

Bosnia and Herzegovina, French Guiana, Kuwait, El Salvador, Virgin Islands (British), Northern Mariana Islands, Comoros, Guadeloupe, Bonaire, Kosovo

Description

PredictLeads Job Openings Data provides high-quality hiring insights sourced directly from company websites - not job boards. Using advanced web scraping technology, our dataset offers real-time access to job trends, salaries, and skills demand, making it a valuable resource for B2B sales, recruiting, investment analysis, and competitive intelligence.

Key Features:

✅232M+ Job Postings Tracked – Data sourced from 92 Million company websites worldwide. ✅7,1M+ Active Job Openings – Updated in real-time to reflect hiring demand. ✅Salary & Compensation Insights – Extract salary ranges, contract types, and job seniority levels. ✅Technology & Skill Tracking – Identify emerging tech trends and industry demands. ✅Company Data Enrichment – Link job postings to employer domains, firmographics, and growth signals. ✅Web Scraping Precision – Directly sourced from employer websites for unmatched accuracy.

Primary Attributes:

id (string, UUID) – Unique identifier for the job posting.
type (string, constant: "job_opening") – Object type.
title (string) – Job title.
description (string) – Full job description, extracted from the job listing.
url (string, URL) – Direct link to the job posting.
first_seen_at – Timestamp when the job was first detected.
last_seen_at – Timestamp when the job was last detected.
last_processed_at – Timestamp when the job data was last processed.

Job Metadata:

contract_types (array of strings) – Type of employment (e.g., "full time", "part time", "contract").
categories (array of strings) – Job categories (e.g., "engineering", "marketing").
seniority (string) – Seniority level of the job (e.g., "manager", "non_manager").
status (string) – Job status (e.g., "open", "closed").
language (string) – Language of the job posting.
location (string) – Full location details as listed in the job description.
Location Data (location_data) (array of objects)
city (string, nullable) – City where the job is located.
state (string, nullable) – State or region of the job location.
zip_code (string, nullable) – Postal/ZIP code.
country (string, nullable) – Country where the job is located.
region (string, nullable) – Broader geographical region.
continent (string, nullable) – Continent name.
fuzzy_match (boolean) – Indicates whether the location was inferred.

Salary Data (salary_data)

salary (string) – Salary range extracted from the job listing.
salary_low (float, nullable) – Minimum salary in original currency.
salary_high (float, nullable) – Maximum salary in original currency.
salary_currency (string, nullable) – Currency of the salary (e.g., "USD", "EUR").
salary_low_usd (float, nullable) – Converted minimum salary in USD.
salary_high_usd (float, nullable) – Converted maximum salary in USD.
salary_time_unit (string, nullable) – Time unit for the salary (e.g., "year", "month", "hour").

Occupational Data (onet_data) (object, nullable)

code (string, nullable) – ONET occupation code.
family (string, nullable) – Broad occupational family (e.g., "Computer and Mathematical").
occupation_name (string, nullable) – Official ONET occupation title.

Additional Attributes:

tags (array of strings, nullable) – Extracted skills and keywords (e.g., "Python", "JavaScript").

📌 Trusted by enterprises, recruiters, and investors for high-precision job market insights.

PredictLeads Dataset: https://docs.predictleads.com/v3/guide/job_openings_dataset

Clear search

Close search

Google apps

Main menu

Global Web Data | Web Scraping Data | Job Postings Data | Source: Company...

Data from: Essential-Web v1.0: 24T tokens of organized web data

Company Datasets for Business Profiling

Google SERP Data, Web Search Data, Google Images Data | Real-Time API

Web Analytics Dataset

Dataset

Contents

fineweb

Dataset for Privacy Exercises

NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

Data set of the article: Using Machine Learning for Web Page Classification...

Google Analytics Sample

Context

Content

Acknowledgements

Inspiration

Hydroinformatics Instruction Module Example Code: Programmatic Data Access...

Chapter 14 Examples

DATAANT | Custom Data Extraction | Web Scraping Data | Dataset, API | Data...

Network Traffic Analysis: Data and Code

Amazon Web Services - Public Data Sets

About

Data from: Web Traffic Dataset

BGS offshore activities and samples Web service - Dataset - data.gov.uk

Sample Power BI Data

Dataset

Contents

365 Data Science Web site statistics

365 Data Science Database

Web Scraper Software Market Report

Global Web Data | Web Scraping Data | Job Postings Data | Source: Company Website | 232M+ RecordsSee More Versions

Global Web Data | Web Scraping Data | Job Postings Data | Source: Company Website | 232M+ Records