TagX Web Browsing Clickstream Data: Unveiling Digital Behavior Across North America and EU Unique Insights into Online User Behavior TagX Web Browsing clickstream Data offers an unparalleled window into the digital lives of 1 million users across North America and the European Union. This comprehensive dataset stands out in the market due to its breadth, depth, and stringent compliance with data protection regulations. What Makes Our Data Unique?
Extensive Geographic Coverage: Spanning two major markets, our data provides a holistic view of web browsing patterns in developed economies. Large User Base: With 300K active users, our dataset offers statistically significant insights across various demographics and user segments. GDPR and CCPA Compliance: We prioritize user privacy and data protection, ensuring that our data collection and processing methods adhere to the strictest regulatory standards. Real-time Updates: Our clickstream data is continuously refreshed, providing up-to-the-minute insights into evolving online trends and user behaviors. Granular Data Points: We capture a wide array of metrics, including time spent on websites, click patterns, search queries, and user journey flows.
Data Sourcing: Ethical and Transparent Our web browsing clickstream data is sourced through a network of partnered websites and applications. Users explicitly opt-in to data collection, ensuring transparency and consent. We employ advanced anonymization techniques to protect individual privacy while maintaining the integrity and value of the aggregated data. Key aspects of our data sourcing process include:
Voluntary user participation through clear opt-in mechanisms Regular audits of data collection methods to ensure ongoing compliance Collaboration with privacy experts to implement best practices in data anonymization Continuous monitoring of regulatory landscapes to adapt our processes as needed
Primary Use Cases and Verticals TagX Web Browsing clickstream Data serves a multitude of industries and use cases, including but not limited to:
Digital Marketing and Advertising:
Audience segmentation and targeting Campaign performance optimization Competitor analysis and benchmarking
E-commerce and Retail:
Customer journey mapping Product recommendation enhancements Cart abandonment analysis
Media and Entertainment:
Content consumption trends Audience engagement metrics Cross-platform user behavior analysis
Financial Services:
Risk assessment based on online behavior Fraud detection through anomaly identification Investment trend analysis
Technology and Software:
User experience optimization Feature adoption tracking Competitive intelligence
Market Research and Consulting:
Consumer behavior studies Industry trend analysis Digital transformation strategies
Integration with Broader Data Offering TagX Web Browsing clickstream Data is a cornerstone of our comprehensive digital intelligence suite. It seamlessly integrates with our other data products to provide a 360-degree view of online user behavior:
Social Media Engagement Data: Combine clickstream insights with social media interactions for a holistic understanding of digital footprints. Mobile App Usage Data: Cross-reference web browsing patterns with mobile app usage to map the complete digital journey. Purchase Intent Signals: Enrich clickstream data with purchase intent indicators to power predictive analytics and targeted marketing efforts. Demographic Overlays: Enhance web browsing data with demographic information for more precise audience segmentation and targeting.
By leveraging these complementary datasets, businesses can unlock deeper insights and drive more impactful strategies across their digital initiatives. Data Quality and Scale We pride ourselves on delivering high-quality, reliable data at scale:
Rigorous Data Cleaning: Advanced algorithms filter out bot traffic, VPNs, and other non-human interactions. Regular Quality Checks: Our data science team conducts ongoing audits to ensure data accuracy and consistency. Scalable Infrastructure: Our robust data processing pipeline can handle billions of daily events, ensuring comprehensive coverage. Historical Data Availability: Access up to 24 months of historical data for trend analysis and longitudinal studies. Customizable Data Feeds: Tailor the data delivery to your specific needs, from raw clickstream events to aggregated insights.
Empowering Data-Driven Decision Making In today's digital-first world, understanding online user behavior is crucial for businesses across all sectors. TagX Web Browsing clickstream Data empowers organizations to make informed decisions, optimize their digital strategies, and stay ahead of the competition. Whether you're a marketer looking to refine your targeting, a product manager seeking to enhance user experience, or a researcher exploring digital trends, our cli...
The data collection encompasses a variety of tables sourced from the Swedish website 'Lantmäteriet' (National Land Survey). Each table is a unique assemblage of related data, offering a comprehensive overview of various aspects. Given the diverse range of tables available, the collection is able to provide a thorough examination of its subject. The dataset collection is meticulously organized into rows and columns, making it easier for the user to navigate and understand the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of trackers as captured by PrivacyScore (privacyscore.org) on February 19th, 2019. Data collected with Privacy Score was done one-time only, as the presence of trackers is tied with the website, not the research subject. The data collection reflects the state of the particular website at that particular data. Data collected showed that the number of third party embeds (third parties that provide services to the first party) is 575 for only ten websites, set by 328 unique companies, and the number of third-party calls is 172. The queried websites were: Wired, The Guardian, Ars Technica, EuraktivJobs, Forumotion, Motheboard (VICE), Politico EU. The found trackers were tringulated with data from Better.fyi and WhoTracksMe in order to detect the purpose for tracking and the tracking type detected. Visual analysis is provided in the published paper (see details below).
https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy
BASE YEAR | 2024 |
HISTORICAL DATA | 2019 - 2024 |
REPORT COVERAGE | Revenue Forecast, Competitive Landscape, Growth Factors, and Trends |
MARKET SIZE 2023 | 3.24(USD Billion) |
MARKET SIZE 2024 | 3.73(USD Billion) |
MARKET SIZE 2032 | 11.46(USD Billion) |
SEGMENTS COVERED | Deployment Mode ,Data Source ,Extraction Type ,Cloud Type ,Application ,Regional |
COUNTRIES COVERED | North America, Europe, APAC, South America, MEA |
KEY MARKET DYNAMICS | 1 AIpowered data extraction 2 Growing demand for structured data 3 Cloudbased data scraping services 4 Realtime web data extraction 5 Increased use of web scraping for business intelligence |
MARKET FORECAST UNITS | USD Billion |
KEY COMPANIES PROFILED | Dexi.io ,Cheerio ,ScrapingBee ,Import.io ,Scrapinghub ,80legs ,Bright Data ,Mozenda ,Phantombuster ,Helium Scraper ,ScraperAPI ,Octoparse ,Apify ,ParseHub ,Diffbot |
MARKET FORECAST PERIOD | 2024 - 2032 |
KEY MARKET OPPORTUNITIES | Automation for efficient data collection Realtime data extraction for enhanced decisionmaking Cloudbased tools for scalability and flexibility AIpowered tools for advanced data analysis Increased demand for web scraping in various industries |
COMPOUND ANNUAL GROWTH RATE (CAGR) | 15.06% (2024 - 2032) |
Altosight | AI Custom Web Scraping Data
✦ Altosight provides global web scraping data services with AI-powered technology that bypasses CAPTCHAs, blocking mechanisms, and handles dynamic content.
We extract data from marketplaces like Amazon, aggregators, e-commerce, and real estate websites, ensuring comprehensive and accurate results.
✦ Our solution offers free unlimited data points across any project, with no additional setup costs.
We deliver data through flexible methods such as API, CSV, JSON, and FTP, all at no extra charge.
― Key Use Cases ―
➤ Price Monitoring & Repricing Solutions
🔹 Automatic repricing, AI-driven repricing, and custom repricing rules 🔹 Receive price suggestions via API or CSV to stay competitive 🔹 Track competitors in real-time or at scheduled intervals
➤ E-commerce Optimization
🔹 Extract product prices, reviews, ratings, images, and trends 🔹 Identify trending products and enhance your e-commerce strategy 🔹 Build dropshipping tools or marketplace optimization platforms with our data
➤ Product Assortment Analysis
🔹 Extract the entire product catalog from competitor websites 🔹 Analyze product assortment to refine your own offerings and identify gaps 🔹 Understand competitor strategies and optimize your product lineup
➤ Marketplaces & Aggregators
🔹 Crawl entire product categories and track best-sellers 🔹 Monitor position changes across categories 🔹 Identify which eRetailers sell specific brands and which SKUs for better market analysis
➤ Business Website Data
🔹 Extract detailed company profiles, including financial statements, key personnel, industry reports, and market trends, enabling in-depth competitor and market analysis
🔹 Collect customer reviews and ratings from business websites to analyze brand sentiment and product performance, helping businesses refine their strategies
➤ Domain Name Data
🔹 Access comprehensive data, including domain registration details, ownership information, expiration dates, and contact information. Ideal for market research, brand monitoring, lead generation, and cybersecurity efforts
➤ Real Estate Data
🔹 Access property listings, prices, and availability 🔹 Analyze trends and opportunities for investment or sales strategies
― Data Collection & Quality ―
► Publicly Sourced Data: Altosight collects web scraping data from publicly available websites, online platforms, and industry-specific aggregators
► AI-Powered Scraping: Our technology handles dynamic content, JavaScript-heavy sites, and pagination, ensuring complete data extraction
► High Data Quality: We clean and structure unstructured data, ensuring it is reliable, accurate, and delivered in formats such as API, CSV, JSON, and more
► Industry Coverage: We serve industries including e-commerce, real estate, travel, finance, and more. Our solution supports use cases like market research, competitive analysis, and business intelligence
► Bulk Data Extraction: We support large-scale data extraction from multiple websites, allowing you to gather millions of data points across industries in a single project
► Scalable Infrastructure: Our platform is built to scale with your needs, allowing seamless extraction for projects of any size, from small pilot projects to ongoing, large-scale data extraction
― Why Choose Altosight? ―
✔ Unlimited Data Points: Altosight offers unlimited free attributes, meaning you can extract as many data points from a page as you need without extra charges
✔ Proprietary Anti-Blocking Technology: Altosight utilizes proprietary techniques to bypass blocking mechanisms, including CAPTCHAs, Cloudflare, and other obstacles. This ensures uninterrupted access to data, no matter how complex the target websites are
✔ Flexible Across Industries: Our crawlers easily adapt across industries, including e-commerce, real estate, finance, and more. We offer customized data solutions tailored to specific needs
✔ GDPR & CCPA Compliance: Your data is handled securely and ethically, ensuring compliance with GDPR, CCPA and other regulations
✔ No Setup or Infrastructure Costs: Start scraping without worrying about additional costs. We provide a hassle-free experience with fast project deployment
✔ Free Data Delivery Methods: Receive your data via API, CSV, JSON, or FTP at no extra charge. We ensure seamless integration with your systems
✔ Fast Support: Our team is always available via phone and email, resolving over 90% of support tickets within the same day
― Custom Projects & Real-Time Data ―
✦ Tailored Solutions: Every business has unique needs, which is why Altosight offers custom data projects. Contact us for a feasibility analysis, and we’ll design a solution that fits your goals
✦ Real-Time Data: Whether you need real-time data delivery or scheduled updates, we provide the flexibility to receive data when you need it. Track price changes, monitor product trends, or gather...
Field sampling occurred at locations within Ten Thousand Islands National Wildlife Refuge on three transects along the natural salinity gradient of increasing salinity to the coastal south. We used three replicates per tier (east-west) for a total of nine sampling sites. Sites were approximately 1300 m apart in all directions. Sampling events occurred every 3–4 weeks from January to May for 7 sampling events in 2015 and 5 in 2016. These dates were selected to capture signals of the natural variation in water levels and salinity that occur during the transition from the wet season to the dry season. Fish traps were deployed at each of the nine sites and then retrieved the following day, allowing 24 hours soak time. Two minnow traps (Gee Minnow Trap; 22.9 x 44.5 cm, 0.6 cm mesh, 1.9 cm diameter opening) and two Breder traps (15 cm x 15 cm x 30 cm, 12 mm opening width, 15 cm opening height) were placed at each sampling site because the two trap types have the potential to catch different fish species.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code:
Packet_Features_Generator.py & Features.py
To run this code:
pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j
-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j
Purpose:
Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.
Uses Features.py to calcualte the features.
startMachineLearning.sh & machineLearning.py
To run this code:
bash startMachineLearning.sh
This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags
Options (to be edited within this file):
--evaluate-only to test 5 fold cross validation accuracy
--test-scaling-normalization to test 6 different combinations of scalers and normalizers
Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use
--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'
Purpose:
Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.
Data
Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.
Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:
First number is a classification number to denote what website, query, or vr action is taking place.
The remaining numbers in each line denote:
The size of a packet,
and the direction it is traveling.
negative numbers denote incoming packets
positive numbers denote outgoing packets
Figure 4 Data
This data uses specific lines from the Virtual Reality.txt file.
The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.
The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.
The .xlsx and .csv file are identical
Each file includes (from right to left):
The origional packet data,
each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,
and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.
https://qdr.syr.edu/policies/qdr-restricted-access-conditionshttps://qdr.syr.edu/policies/qdr-restricted-access-conditions
Project Summary This dataset contains all qualitative and quantitative data collected in the first phase of the Pandemic Journaling Project (PJP). PJP is a combined journaling platform and interdisciplinary, mixed-methods research study developed by two anthropologists, with support from a team of colleagues and students across the social sciences, humanities, and health fields. PJP launched in Spring 2020 as the COVID-19 pandemic was emerging in the United States. PJP was created in order to “pre-design an archive” of COVID-19 narratives and experiences open to anyone around the world. The project is rooted in a commitment to democratizing knowledge production, in the spirit of “archival activism” and using methods of “grassroots collaborative ethnography” (Willen et al. 2022; Wurtz et al. 2022; Zhang et al 2020; see also Carney 2021). The motto on the PJP website encapsulates these commitments: “Usually, history is written only by the powerful. When the history of COVID-19 is written, let’s make sure that doesn’t happen.” (A version of this Project Summary with links to the PJP website and other relevant sites is included in the public documentation of the project at QDR.) In PJP’s first phase (PJP-1), the project provided a digital space where participants could create weekly journals of their COVID-19 experiences using a smartphone or computer. The platform was designed to be accessible to as wide a range of potential participants as possible. Anyone aged 15 or older, living anywhere in the world, could create journal entries using their choice of text, images, and/or audio recordings. The interface was accessible in English and Spanish, but participants could submit text and audio in any language. PJP-1 ran on a weekly basis from May 2020 to May 2022. Data Overview This Qualitative Data Repository (QDR) project contains all journal entries and closed-ended survey responses submitted during PJP-1, along with accompanying descriptive and explanatory materials. The dataset includes individual journal entries and accompanying quantitative survey responses from more than 1,800 participants in 55 countries. Of nearly 27,000 journal entries in total, over 2,700 included images and over 300 are audio files. All data were collected via the Qualtrics survey platform. PJP-1 was approved as a research study by the Institutional Review Board (IRB) at the University of Connecticut. Participants were introduced to the project in a variety of ways, including through the PJP website as well as professional networks, PJP’s social media accounts (on Facebook, Instagram, and Twitter) , and media coverage of the project. Participants provided a single piece of contact information — an email address or mobile phone number — which was used to distribute weekly invitations to participate. This contact information has been stripped from the dataset and will not be accessible to researchers. PJP uses a mixed-methods research approach and a dynamic cohort design. After enrolling in PJP-1 via the project’s website, participants received weekly invitations to contribute to their journals via their choice of email or SMS (text message). Each weekly invitation included a link to that week’s journaling prompts and accompanying survey questions. Participants could join at any point, and they could stop participating at any point as well. They also could stop participating and later restart. Retention was encouraged with a monthly raffle of three $100 gift cards. All individuals who had contributed that month were eligible. Regardless of when they joined, all participants received the project’s narrative prompts and accompanying survey questions in the same order. In Week 1, before contributing their first journal entries, participants were presented with a baseline survey that collected demographic information, including political leanings, as well as self-reported data about COVID-19 exposure and physical and mental health status. Some of these survey questions were repeated at periodic intervals in subsequent weeks, providing quantitative measures of change over time that can be analyzed in conjunction with participants' qualitative entries. Surveys employed validated questions where possible. The core of PJP-1 involved two weekly opportunities to create journal entries in the format of their choice (text, image, and/or audio). Each week, journalers received a link with an invitation to create one entry in response to a recurring narrative prompt (“How has the COVID-19 pandemic affected your life in the past week?”) and a second journal entry in response to their choice of two more tightly focused prompts. Typically the pair of prompts included one focusing on subjective experience (e.g., the impact of the pandemic on relationships, sense of social connectedness, or mental health) and another with an external focus (e.g., key sources of scientific information, trust in government, or COVID-19’s economic impact). Each week,...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019
I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.
The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.
The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.
Updates to these datasets will be announced and published as the project progresses.
II. What’s included? This data set includes:
III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.
IV. Data Set Field Descriptions
IV
a) Collections dataset field descriptions
b) Items dataset field descriptions
V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.
VI. Creator and Contributor Information
Creator: Connections In Sound
Contributors: Library of Congress Labs
VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.
Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.
We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Imgur is an image hosting and sharing website founded in 2009. It became one of the most popular websites around the world with approximately 250 million users. The website does not require registration and anyone can browse its content. However, to be able to post an account must be created. It is famous for an event that it created in 2013 where members get to register to send/receive gifts from other members on the website. The event takes place during Christmas time and people share their gifts via the website where they post pictures of the process or what they received in a specific tag. Today the data provided covers two sections that I think are important to understanding certain patterns within the Imgur community. The first is the Most Viral section and the second is the Secret Santa tag.
I have participated twice in The Imgur secret Santa event and always found funny and interesting post from its most viral section. I would like with the help of the Kaggle community to identify trends from the data provided and maybe make a comparison between the Secret Santa data and the most viral.
There are two Dataframes included and they are almost identical in the number of columns:
The first Dataframe is Imgur Most Viral posts. This contains many of the posts that were labelled as Viral by The Imgur community and team using specific algorithms to track number of likes and dislikes across multiple platforms. The posts might be videos, gifs, pictures or just text.
The second Dataframe is Imgur Secret Santa Tag. Secret Santa is an annual Imgur tradition where members can sign up to send gifts to and receive gifts from other members during the Christmas holiday.This contains many of the posts that were tagged with Secret Santa by the Imgur community. The posts might be videos, gifs, pictures or just text. There is a (is_viral) column in this Dataframe that is not available in the Most Viral Dataframe since all of the posts there are viral.
Feature | Type | Dataset | Description |
---|---|---|---|
account_id | object | Imgur_Viral/imgur_secret_santa | Unique Account ID per member |
comment_count | float64 | Imgur_Viral/imgur_secret_santa | Number of comments made in the post |
datetime | float64 | Imgur_Viral/imgur_secret_santa | TimeStamp containing Date and Time Details |
downs | float64 | Imgur_Viral/imgur_secret_santa | Number of dislikes for the post |
favorite_count | float64 | Imgur_Viral/imgur_secret_santa | Number of user that marked the post as a favourite |
id | object | Imgur_Viral/imgur_secret_santa | Uniqe Post ID. Even if it was posted by the same member, different posts will have different IDs |
images_count | float64 | Imgur_Viral/imgur_secret_santa | Number of images included in the post |
points | float64 | Imgur_Viral/imgur_secret_santa | Each post will have calculated points based on (ups - downs) |
score | float64 | Imgur_Viral/imgur_secret_santa | Ticket number |
tags | object | Imgur_Viral/imgur_secret_santa | Tags are sub albums that the post will show under |
title | object | Imgur_Viral/imgur_secret_santa | Title of the post |
ups | float64 | Imgur_Viral/imgur_secret_santa | Number of likes for the post |
views | float64 | Imgur_Viral/imgur_secret_santa | Number of people that viewed the post |
is_most_viral | boolean | imgur_secret_santa | If the post is viral or not |
I would like to thank imgur for providing an API that made collecting data easier from its website. With their help we might be able to better understand certain trends that emerge from its community
There is no problem to solve from this data, but it just a fun way to explore and learn more about programming and analyzing data. I hope you enjoy playing with the data as much as I did collecting it and browsing the website
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a dataset of Tor cell file extracted from browsing simulation using Tor Browser. The simulations cover both desktop and mobile webpages. The data collection process was using WFP-Collector tool (https://github.com/irsyadpage/WFP-Collector). All the neccessary configuration to perform the simulation as detailed in the tool repository.The webpage URL is selected by using the first 100 website based on: https://dataforseo.com/free-seo-stats/top-1000-websites.Each webpage URL is visited 90 times for each deskop and mobile browsing mode.
The dataset provides information on greened waste collection sites located in the city of Vilnius.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 8. Technical characteristics of tools as identified from the review, survey, and tool developers’ websites/direct contact. Dataset that describes the technical characteristics of the electronic tools as identified from the systematic review (2010–2020), survey or from review of software websites or contact with software developers. Where no data were found on a particular characteristic of a tool, “don’t know” was entered in the database and where a tool only had one function (data collection or management or analysis), “NA” for not applicable was added to the relevant columns. The Samaritan’s Purse Reporting System was excluded from this database on request from the organisation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset aggregates information about 191 research data repositories that were shut down. The data collection was based on the registry of research data repositories re3data and a comprehensive content analysis of repository websites and related materials. Documented in the dataset are the period in which a repository was active, the risks resulting in its shutdown, and the repositories taking over custody of the data after.
The dataset collection titled 'index_100km' is a valuable resource comprising one or multiple tables of related data. The data within these tables is meticulously organized in a structured format of columns and rows for easy understanding and analysis. The data for these tables has been sourced from the website of 'Lantmäteriet' (The Land Survey) in Sweden. This ensures the data's authenticity, given the reputation of the source. The dataset collection is crucial for various analyses and can be significantly useful for researchers and professionals in various fields.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of 22 data set of 50+ requirements each, expressed as user stories.
The dataset has been created by gathering data from web sources and we are not aware of license agreements or intellectual property rights on the requirements / user stories. The curator took utmost diligence in minimizing the risks of copyright infringement by using non-recent data that is less likely to be critical, by sampling a subset of the original requirements collection, and by qualitatively analyzing the requirements. In case of copyright infringement, please contact the dataset curator (Fabiano Dalpiaz, f.dalpiaz@uu.nl) to discuss the possibility of removal of that dataset [see Zenodo's policies]
The data sets have been originally used to conduct experiments about ambiguity detection with the REVV-Light tool: https://github.com/RELabUU/revv-light
This collection has been originally published in Mendeley data: https://data.mendeley.com/datasets/7zbk8zsd8y/1
The following text provides a description of the datasets, including links to the systems and websites, when available. The datasets are organized by macro-category and then by identifier.
g02-federalspending.txt
(2018) originates from early data in the Federal Spending Transparency project, which pertain to the website that is used to share publicly the spending data for the U.S. government. The website was created because of the Digital Accountability and Transparency Act of 2014 (DATA Act). The specific dataset pertains a system called DAIMS or Data Broker, which stands for DATA Act Information Model Schema. The sample that was gathered refers to a sub-project related to allowing the government to act as a data broker, thereby providing data to third parties. The data for the Data Broker project is currently not available online, although the backend seems to be hosted in GitHub under a CC0 1.0 Universal license. Current and recent snapshots of federal spending related websites, including many more projects than the one described in the shared collection, can be found here.
g03-loudoun.txt
(2018) is a set of extracted requirements from a document, by the Loudoun County Virginia, that describes the to-be user stories and use cases about a system for land management readiness assessment called Loudoun County LandMARC. The source document can be found here and it is part of the Electronic Land Management System and EPlan Review Project - RFP RFQ issued in March 2018. More information about the overall LandMARC system and services can be found here.
g04-recycling.txt
(2017) concerns a web application where recycling and waste disposal facilities can be searched and located. The application operates through the visualization of a map that the user can interact with. The dataset has obtained from a GitHub website and it is at the basis of a students' project on web site design; the code is available (no license).
g05-openspending.txt
(2018) is about the OpenSpending project (www), a project of the Open Knowledge foundation which aims at transparency about how local governments spend money. At the time of the collection, the data was retrieved from a Trello board that is currently unavailable. The sample focuses on publishing, importing and editing datasets, and how the data should be presented. Currently, OpenSpending is managed via a GitHub repository which contains multiple sub-projects with unknown license.
g11-nsf.txt
(2018) refers to a collection of user stories referring to the NSF Site Redesign & Content Discovery project, which originates from a publicly accessible GitHub repository (GPL 2.0 license). In particular, the user stories refer to an early version of the NSF's website. The user stories can be found as closed Issues.
g08-frictionless.txt
(2016) regards the Frictionless Data project, which offers an open source dataset for building data infrastructures, to be used by researchers, data scientists, and data engineers. Links to the many projects within the Frictionless Data project are on GitHub (with a mix of Unlicense and MIT license) and web. The specific set of user stories has been collected in 2016 by GitHub user @danfowler and are stored in a Trello board.
g14-datahub.txt
(2013) concerns the open source project DataHub, which is currently developed via a GitHub repository (the code has Apache License 2.0). DataHub is a data discovery platform which has been developed over multiple years. The specific data set is an initial set of user stories, which we can date back to 2013 thanks to a comment therein.
g16-mis.txt
(2015) is a collection of user stories that pertains a repository for researchers and archivists. The source of the dataset is a public Trello repository. Although the user stories do not have explicit links to projects, it can be inferred that the stories originate from some project related to the library of Duke University.
g17-cask.txt
(2016) refers to the Cask Data Application Platform (CDAP). CDAP is an open source application platform (GitHub, under Apache License 2.0) that can be used to develop applications within the Apache Hadoop ecosystem, an open-source framework which can be used for distributed processing of large datasets. The user stories are extracted from a document that includes requirements regarding dataset management for Cask 4.0, which includes the scenarios, user stories and a design for the implementation of these user stories. The raw data is available in the following environment.
g18-neurohub.txt
(2012) is concerned with the NeuroHub platform, a neuroscience data management, analysis and collaboration platform for researchers in neuroscience to collect, store, and share data with colleagues or with the research community. The user stories were collected at a time NeuroHub was still a research project sponsored by the UK Joint Information Systems Committee (JISC). For information about the research project from which the requirements were collected, see the following record.
g22-rdadmp.txt
(2018) is a collection of user stories from the Research Data Alliance's working group on DMP Common Standards. Their GitHub repository contains a collection of user stories that were created by asking the community to suggest functionality that should part of a website that manages data management plans. Each user story is stored as an issue on the GitHub's page.
g23-archivesspace.txt
(2012-2013) refers to ArchivesSpace: an open source, web application for managing archives information. The application is designed to support core functions in archives administration such as accessioning; description and arrangement of processed materials including analog, hybrid, and
born digital content; management of authorities and rights; and reference service. The application supports collection management through collection management records, tracking of events, and a growing number of administrative reports. ArchivesSpace is open source and its
Simplify your research data collection with the help of the research data repository managed by the Terrestrial Ecosystem Research Network. Our collection of ecosystem data includes ecoacustics, bio acoustics, lead area index information and much more.
The TERN research data collection provides analysis-ready environment data that facilitates a wide range of ecological research projects undertaken by established and emerging scientists from Australia and around the world. The resources which we provide support scientific investigation in a wide array of environment and climate research fields along with decision-making initiatives.
Open access ecosystem data collections via the TERN Data Discovery Portal and sub-portals:
Access all TERN Environment Data
Discover datasets published by TERN’s observing platforms and collaborators. Search geographically, then browse, query and extract the data via the TERN Data Discovery Portal.
Search EcoPlots data
Search, integrate and access Australia’s plot-based ecology survey data.
Download ausplotsR
Extract, prepare, visualise and analyse TERN Ecosystem Surveillance monitoring data in R.
Search EcoImages
Search and download Leaf Area Index (LAI), Phenocam and Photopoint images.
Tools that support the discovery, anaylsis and re-use of data:
Visualise the data
We’ve teamed up with ANU to provide 50 landscape and ecosystem datasets presented graphically.
Access CoESRA Virtual Desktop
A virtual desktop environment that enables users to create, execute and share environmental data simulations.
Submit data with SHaRED
Our user friendly tool to upload your data securely to our environment database so you can contribute to Australia’s ecological research.
The Soil and Landscape Grid of Australia provides relevant, consistent, comprehensive, nation-wide data in an easily-accessible format. It provides detailed digital maps of the country’s soil and landscape attributes at a finer resolution than ever before in Australia.
The annual Australia’s Environment products summarise a large amount of observations on the trajectory of our natural resources and ecosystems. Use the data explorer to view and download maps, accounts or charts by region and land use type. The website also has national summary reports and report cards for different types of administrative and geographical regions.
TERN’s ausplotsR is an R Studio package for extracting, preparing, visualising and analysing TERN’s Ecosystem Surveillance monitoring data. Users can use the package to directly access plot-based data on vegetation and soils across Australia, with simple function calls to extract the data and merge them into species occurrence matrices for analysis or to calculate things like basal area and fractional cover.
The Australian Cosmic-Ray Neutron Soil Moisture Monitoring Network (CosmOz) delivers soil moisture data for 16 sites over an area of about 30 hectares to depths in the soil of between 10 to 50 cm. In 2020, the CosmOz soil moisture network, which is led by CSIRO, is set to be expanded to 23 sites.
The TERN Mangrove Data Portal provides a diverse range of historical and contemporary remotely-sensed datasets on extent and change of mangrove ecosystems across Australia. It includes multi-scale field measurements of mangrove floristics, structure and biomass, a diverse range of airborne imagery collected since the 1950s, and multispectral and hyperspectral imagery captured by drones, aircraft and satellites.
The TERN Wetlands and Riparian Zones Data Portal provides access to relevant national to local remotely-sensed datasets and also facilitates the collation and collection of on-ground data that support validation.
ecocloud provides easy access to large volumes of curated ecosystem science data and tools, a computing platform and resources and tools for innovative research. ecocloud gives you 10GB of persistent storage to keep your code/notebooks so they are ready to go when you start up a server (R or Python environment). It uses the JupyterLabs interface, which includes connections to GitHub, Google Drive and Dropbox.
Our research data collection makes it easier for scientists and researchers to investigate and answer their questions by providing them with open data, research and management tools, infrastructure, and site-based research tools.
The TERN data portal provides open access ecosystem data. Our tools support data discovery, analysis, and re-use. The services which we provide facilitate research, education, and management. We maintain a network of monitoring site and sensor data streams for long-term research as part of our research data repository.
https://www.icpsr.umich.edu/web/ICPSR/studies/6358/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/6358/terms
In 1986, the Bureau of Justice Assistance (BJA) funded a demonstration project of intensive supervision programs (ISPs), alternatives to control sanctions that involve community sanctions and emphasize stringent conditions and close monitoring of convicted offenders. The primary intent of the demonstration project was to determine the effects of participation in an ISP program on the subsequent behavior of offenders and to test the feasibility of the ISP's stated objectives: (1) to reduce recidivism by providing a seemingly cost-effective alternative to imprisonment, and (2) to provide an intermediate punishment between incarceration and regular probation that allows the punishment to fit the crime. Fourteen sites in nine states participated in the project and each of the selected sites was funded for 18 to 24 months. Individual agencies in each site tailored their ISP programs to their local needs, resources, and contexts, developed their own eligibility criteria, and determined whether probationers met those criteria. While the individual ISP projects differed, each site was required to follow identical procedures regarding random assignment, data collection, and overall program evaluation. Data collection instruments that differed in the amount of drug-related questions asked were used for the six- and twelve-month reviews. The "non-drug" data collection instrument, used in Contra Costa, Ventura, and Los Angeles counties, CA, Marion County, OR, and Milwaukee, WI, gathered drug data only on the number of monthly drug and alcohol tests given to offenders. The "drug" data collection instrument was distributed in Atlanta, Macon, and Waycross, GA, Seattle, WA, Santa Fe, NM, Des Moines, IA, and Winchester, VA. Variables regarding drug use included the number of drug tests ordered, the number of drug tests taken, and the number of positives for alcohol, cocaine, heroin, uppers, downers, quaaludes, LSD/hallucinogens, PCP, marijuana/hashish, and "other". The drug questions on the instrument used in Dallas and Houston, TX, were the same as those asked at the drug sites. Once a site determined that an offender was eligible for inclusion, RAND staff randomly assigned the offender to either the experimental ISP program (prison diversion, enhanced probation, or enhanced parole) or to a control sanction (prison, routine probation, or parole). Assignment periods began in January 1987 and some sites continued to accept cases through January 1990. Each offender was followed for a period of one year, beginning on the day of assignment to the experimental or control program. The six-month and twelve-month review data contain identical variables: the current status of the offender (prison, ISP, or terminated), record of each arrest and/or technical violation, its disposition, and sentence or sanction. Information was also recorded for each month during the follow-up regarding face-to-face contacts, phone and collateral contacts, monitoring and record checks, community service hours, days on electronic surveillance (if applicable), contacts between client and community sponsor, number and type of counseling sessions and training, days in paid employment and earnings, number of drug and alcohol tests taken, and amount of restitution, fines, court costs, and probation fees paid. Background variables include sex, race, age at assignment, prior criminal history, drug use and treatment history, type of current offense, sentence characteristics, conditions imposed, and various items relating to risk of recidivism and need for treatment. For the two Texas sites, information on each arrest and/or technical violation, its disposition, and sentence or sanction was recorded in separate recidivism files (Parts 10 and 17). Dates were converted by RAND to time-lapse variables for the public release files that comprise this data collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of data collection, abstraction, and management at audit sites A–G.
TagX Web Browsing Clickstream Data: Unveiling Digital Behavior Across North America and EU Unique Insights into Online User Behavior TagX Web Browsing clickstream Data offers an unparalleled window into the digital lives of 1 million users across North America and the European Union. This comprehensive dataset stands out in the market due to its breadth, depth, and stringent compliance with data protection regulations. What Makes Our Data Unique?
Extensive Geographic Coverage: Spanning two major markets, our data provides a holistic view of web browsing patterns in developed economies. Large User Base: With 300K active users, our dataset offers statistically significant insights across various demographics and user segments. GDPR and CCPA Compliance: We prioritize user privacy and data protection, ensuring that our data collection and processing methods adhere to the strictest regulatory standards. Real-time Updates: Our clickstream data is continuously refreshed, providing up-to-the-minute insights into evolving online trends and user behaviors. Granular Data Points: We capture a wide array of metrics, including time spent on websites, click patterns, search queries, and user journey flows.
Data Sourcing: Ethical and Transparent Our web browsing clickstream data is sourced through a network of partnered websites and applications. Users explicitly opt-in to data collection, ensuring transparency and consent. We employ advanced anonymization techniques to protect individual privacy while maintaining the integrity and value of the aggregated data. Key aspects of our data sourcing process include:
Voluntary user participation through clear opt-in mechanisms Regular audits of data collection methods to ensure ongoing compliance Collaboration with privacy experts to implement best practices in data anonymization Continuous monitoring of regulatory landscapes to adapt our processes as needed
Primary Use Cases and Verticals TagX Web Browsing clickstream Data serves a multitude of industries and use cases, including but not limited to:
Digital Marketing and Advertising:
Audience segmentation and targeting Campaign performance optimization Competitor analysis and benchmarking
E-commerce and Retail:
Customer journey mapping Product recommendation enhancements Cart abandonment analysis
Media and Entertainment:
Content consumption trends Audience engagement metrics Cross-platform user behavior analysis
Financial Services:
Risk assessment based on online behavior Fraud detection through anomaly identification Investment trend analysis
Technology and Software:
User experience optimization Feature adoption tracking Competitive intelligence
Market Research and Consulting:
Consumer behavior studies Industry trend analysis Digital transformation strategies
Integration with Broader Data Offering TagX Web Browsing clickstream Data is a cornerstone of our comprehensive digital intelligence suite. It seamlessly integrates with our other data products to provide a 360-degree view of online user behavior:
Social Media Engagement Data: Combine clickstream insights with social media interactions for a holistic understanding of digital footprints. Mobile App Usage Data: Cross-reference web browsing patterns with mobile app usage to map the complete digital journey. Purchase Intent Signals: Enrich clickstream data with purchase intent indicators to power predictive analytics and targeted marketing efforts. Demographic Overlays: Enhance web browsing data with demographic information for more precise audience segmentation and targeting.
By leveraging these complementary datasets, businesses can unlock deeper insights and drive more impactful strategies across their digital initiatives. Data Quality and Scale We pride ourselves on delivering high-quality, reliable data at scale:
Rigorous Data Cleaning: Advanced algorithms filter out bot traffic, VPNs, and other non-human interactions. Regular Quality Checks: Our data science team conducts ongoing audits to ensure data accuracy and consistency. Scalable Infrastructure: Our robust data processing pipeline can handle billions of daily events, ensuring comprehensive coverage. Historical Data Availability: Access up to 24 months of historical data for trend analysis and longitudinal studies. Customizable Data Feeds: Tailor the data delivery to your specific needs, from raw clickstream events to aggregated insights.
Empowering Data-Driven Decision Making In today's digital-first world, understanding online user behavior is crucial for businesses across all sectors. TagX Web Browsing clickstream Data empowers organizations to make informed decisions, optimize their digital strategies, and stay ahead of the competition. Whether you're a marketer looking to refine your targeting, a product manager seeking to enhance user experience, or a researcher exploring digital trends, our cli...