When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place at any time" as an answer. 55 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our
The Easiest Way to Collect Data from the Internet Download anything you see on the internet into spreadsheets within a few clicks using our ready-made web crawlers or a few lines of code using our APIs
We have made it as simple as possible to collect data from websites
Easy to Use Crawlers Amazon Product Details and Pricing Scraper Amazon Product Details and Pricing Scraper Get product information, pricing, FBA, best seller rank, and much more from Amazon.
Google Maps Search Results Google Maps Search Results Get details like place name, phone number, address, website, ratings, and open hours from Google Maps or Google Places search results.
Twitter Scraper Twitter Scraper Get tweets, Twitter handle, content, number of replies, number of retweets, and more. All you need to provide is a URL to a profile, hashtag, or an advance search URL from Twitter.
Amazon Product Reviews and Ratings Amazon Product Reviews and Ratings Get customer reviews for any product on Amazon and get details like product name, brand, reviews and ratings, and more from Amazon.
Google Reviews Scraper Google Reviews Scraper Scrape Google reviews and get details like business or location name, address, review, ratings, and more for business and places.
Walmart Product Details & Pricing Walmart Product Details & Pricing Get the product name, pricing, number of ratings, reviews, product images, URL other product-related data from Walmart.
Amazon Search Results Scraper Amazon Search Results Scraper Get product search rank, pricing, availability, best seller rank, and much more from Amazon.
Amazon Best Sellers Amazon Best Sellers Get the bestseller rank, product name, pricing, number of ratings, rating, product images, and more from any Amazon Bestseller List.
Google Search Scraper Google Search Scraper Scrape Google search results and get details like search rank, paid and organic results, knowledge graph, related search results, and more.
Walmart Product Reviews & Ratings Walmart Product Reviews & Ratings Get customer reviews for any product on Walmart.com and get details like product name, brand, reviews, and ratings.
Scrape Emails and Contact Details Scrape Emails and Contact Details Get emails, addresses, contact numbers, social media links from any website.
Walmart Search Results Scraper Walmart Search Results Scraper Get Product details such as pricing, availability, reviews, ratings, and more from Walmart search results and categories.
Glassdoor Job Listings Glassdoor Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Glassdoor.
Indeed Job Listings Indeed Job Listings Scrape job details such as job title, salary, job description, location, company name, number of reviews, and ratings from Indeed.
LinkedIn Jobs Scraper Premium LinkedIn Jobs Scraper Scrape job listings on LinkedIn and extract job details such as job title, job description, location, company name, number of reviews, and more.
Redfin Scraper Premium Redfin Scraper Scrape real estate listings from Redfin. Extract property details such as address, price, mortgage, redfin estimate, broker name and more.
Yelp Business Details Scraper Yelp Business Details Scraper Scrape business details from Yelp such as phone number, address, website, and more from Yelp search and business details page.
Zillow Scraper Premium Zillow Scraper Scrape real estate listings from Zillow. Extract property details such as address, price, Broker, broker name and more.
Amazon product offers and third party sellers Amazon product offers and third party sellers Get product pricing, delivery details, FBA, seller details, and much more from the Amazon offer listing page.
Realtor Scraper Premium Realtor Scraper Scrape real estate listings from Realtor.com. Extract property details such as Address, Price, Area, Broker and more.
Target Product Details & Pricing Target Product Details & Pricing Get product details from search results and category pages such as pricing, availability, rating, reviews, and 20+ data points from Target.
Trulia Scraper Premium Trulia Scraper Scrape real estate listings from Trulia. Extract property details such as Address, Price, Area, Mortgage and more.
Amazon Customer FAQs Amazon Customer FAQs Get FAQs for any product on Amazon and get details like the question, answer, answered user name, and more.
Yellow Pages Scraper Yellow Pages Scraper Get details like business name, phone number, address, website, ratings, and more from Yellow Pages search results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Afghanistan Internet Usage: Search Engine Market Share: All Platforms: WEB.DE data was reported at 0.000 % in 03 Sep 2024. This stayed constant from the previous number of 0.000 % for 02 Sep 2024. Afghanistan Internet Usage: Search Engine Market Share: All Platforms: WEB.DE data is updated daily, averaging 0.000 % from Jul 2024 (Median) to 03 Sep 2024, with 24 observations. The data reached an all-time high of 0.090 % in 30 Aug 2024 and a record low of 0.000 % in 03 Sep 2024. Afghanistan Internet Usage: Search Engine Market Share: All Platforms: WEB.DE data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Afghanistan – Table AF.SC.IU: Internet Usage: Search Engine Market Share.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Poland Internet Usage: Search Engine Market Share: Tablet: WEB.DE data was reported at 0.000 % in 01 Jul 2024. This stayed constant from the previous number of 0.000 % for 30 Jun 2024. Poland Internet Usage: Search Engine Market Share: Tablet: WEB.DE data is updated daily, averaging 0.000 % from Jun 2024 (Median) to 01 Jul 2024, with 13 observations. The data reached an all-time high of 0.090 % in 27 Jun 2024 and a record low of 0.000 % in 01 Jul 2024. Poland Internet Usage: Search Engine Market Share: Tablet: WEB.DE data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Poland – Table PL.SC.IU: Internet Usage: Search Engine Market Share.
The population share with mobile internet access in North America was forecast to increase between 2024 and 2029 by in total 2.9 percentage points. This overall increase does not happen continuously, notably not in 2028 and 2029. The mobile internet penetration is estimated to amount to 84.21 percent in 2029. Notably, the population share with mobile internet access of was continuously increasing over the past years.The penetration rate refers to the share of the total population having access to the internet via a mobile broadband connection.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the population share with mobile internet access in countries like Caribbean and Europe.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Internet Usage: Search Engine Market Share: Tablet: WEB.DE data was reported at 0.100 % in 11 Mar 2025. This records an increase from the previous number of 0.000 % for 10 Mar 2025. Internet Usage: Search Engine Market Share: Tablet: WEB.DE data is updated daily, averaging 0.000 % from Mar 2025 (Median) to 11 Mar 2025, with 5 observations. The data reached an all-time high of 0.100 % in 11 Mar 2025 and a record low of 0.000 % in 10 Mar 2025. Internet Usage: Search Engine Market Share: Tablet: WEB.DE data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Italy – Table IT.SC.IU: Internet Usage: Search Engine Market Share.
Success.ai’s Online Search Trends Data API empowers businesses, marketers, and product teams to stay ahead by monitoring real-time online search behaviors of over 700 million users worldwide. By tapping into continuously updated, AI-validated data, you can track evolving consumer interests, pinpoint emerging keywords, and better understand buyer intent.
This intelligence allows you to refine product positioning, anticipate market shifts, and deliver hyper-relevant campaigns. Backed by our Best Price Guarantee, Success.ai’s solution provides the valuable insight needed to outpace competitors, adapt to changing market dynamics, and consistently meet consumer expectations.
Why Choose Success.ai’s Online Search Trends Data API?
Real-Time Global Insights
AI-Validated Accuracy
Continuous Data Updates
Ethical and Compliant
Data Highlights:
Key Features of the Online Search Trends Data API:
On-Demand Trend Analysis
Advanced Filtering and Segmentation
Real-Time Validation and Reliability
Scalable and Flexible Integration
Strategic Use Cases:
Product Development and Innovation
Content Marketing and SEO
Market Entry and Expansion
Advertising and Campaign Optimization
Why Choose Success.ai?
Best Price Guarantee
Seamless Integration
Data Accuracy with AI Validation
Customizable and Scalable Solutions
Additional APIs for Enhanced Functionality:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States - Total Revenue for Internet Publishing and Broadcasting and Web Search Portals, Establishments Subject to Federal Income Tax, Employer Firms was 338743.00000 Mil. of $ in January of 2022, according to the United States Federal Reserve. Historically, United States - Total Revenue for Internet Publishing and Broadcasting and Web Search Portals, Establishments Subject to Federal Income Tax, Employer Firms reached a record high of 338743.00000 in January of 2022 and a record low of 9071.00000 in January of 2002. Trading Economics provides the current actual value, an historical data chart and related indicators for United States - Total Revenue for Internet Publishing and Broadcasting and Web Search Portals, Establishments Subject to Federal Income Tax, Employer Firms - last updated from the United States Federal Reserve on March of 2025.
Success.ai's Web and Search Trends Intent Data equips businesses with the cutting-edge capability to track and analyze online behaviors and search engine data effectively. This service is essential for understanding current market trends, optimizing advertising strategies, and enhancing B2B marketing efforts. By capturing and analyzing data from across the web, including search engine behaviors and purchase intent signals, Success.ai offers invaluable insights that can drastically improve your strategic outcomes.
Harness the Power of Web Search Data: Gain access to vast amounts of web search data to understand what your potential customers are searching for and how they interact with the web. This information is crucial for refining SEO strategies, improving website content, and creating more engaging user experiences.
Advanced Analysis of Online Search Trends: Stay ahead of the competition by leveraging detailed insights into online search trends. Success.ai helps you identify emerging trends, monitor industry movements, and anticipate market changes with precision.
Drive Marketing and Advertising Success: Utilize detailed search trend data to tailor your marketing and advertising campaigns. By understanding the specific interests and needs of your target audience, you can create more effective campaigns that resonate with potential customers and result in higher conversion rates.
B2B Intent Data to Fuel Sales Strategies: Our B2B intent data provides a deep dive into the purchase intentions of businesses, helping sales teams prioritize leads that show a high likelihood of conversion. This targeted approach ensures that your sales efforts are focused and efficient.
Key Benefits of Choosing Success.ai:
Use Cases for Success.ai's Web Search and Intent Data:
Get started with Success.ai today to leverage our advanced web and search trends intent data, and take your business to new heights with insights that drive real results.
Contact us now to learn more about our services and how we can help you capitalize on the latest online trends, for the best possible price.
DataForSEO Labs API offers three powerful keyword research algorithms and historical keyword data:
• Related Keywords from the “searches related to” element of Google SERP. • Keyword Suggestions that match the specified seed keyword with additional words before, after, or within the seed key phrase. • Keyword Ideas that fall into the same category as specified seed keywords. • Historical Search Volume with current cost-per-click, and competition values.
Based on in-market categories of Google Ads, you can get keyword ideas from the relevant Categories For Domain and discover relevant Keywords For Categories. You can also obtain Top Google Searches with AdWords and Bing Ads metrics, product categories, and Google SERP data.
You will find well-rounded ways to scout the competitors:
• Domain Whois Overview with ranking and traffic info from organic and paid search. • Ranked Keywords that any domain or URL has positions for in SERP. • SERP Competitors and the rankings they hold for the keywords you specify. • Competitors Domain with a full overview of its rankings and traffic from organic and paid search. • Domain Intersection keywords for which both specified domains rank within the same SERPs. • Subdomains for the target domain you specify along with the ranking distribution across organic and paid search. • Relevant Pages of the specified domain with rankings and traffic data. • Domain Rank Overview with ranking and traffic data from organic and paid search. • Historical Rank Overview with historical data on rankings and traffic of the specified domain from organic and paid search. • Page Intersection keywords for which the specified pages rank within the same SERP.
All DataForSEO Labs API endpoints function in the Live mode. This means you will be provided with the results in response right after sending the necessary parameters with a POST request.
The limit is 2000 API calls per minute, however, you can contact our support team if your project requires higher rates.
We offer well-rounded API documentation, GUI for API usage control, comprehensive client libraries for different programming languages, free sandbox API testing, ad hoc integration, and deployment support.
We have a pay-as-you-go pricing model. You simply add funds to your account and use them to get data. The account balance doesn't expire.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Checkmynet.lu is a measurement tool to test speed and quality of internet connections. It was published by ILR (Luxembourg Institute of Regulation, www.ilr.lu). The published data, accessible via API or simple *.csv download, contains the test results such as download speed, upload speed, operator, equipment used, GPS coordinates, a.s.o. Checkmynet.lu is independent, crowd-sourced, open-source and open-data based solution: • Designed to measure availability, quality and neutrality of the internet • Generates and processes all results objectively, securely and transparently • Tests 150+ parameters: speed, Quality of Service & Quality of Experience • Runs on Android, iOS, web browsers • Displays results on a map with several filter options
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
United States - Producer Price Index by Industry: Internet Publishing and Web Search Portals: Internet Publishing and Web Search Portals - Display and Other Advertising Sales was 87.20000 Index Dec 2009=100 in October of 2020, according to the United States Federal Reserve. Historically, United States - Producer Price Index by Industry: Internet Publishing and Web Search Portals: Internet Publishing and Web Search Portals - Display and Other Advertising Sales reached a record high of 109.60000 in November of 2010 and a record low of 68.60000 in April of 2014. Trading Economics provides the current actual value, an historical data chart and related indicators for United States - Producer Price Index by Industry: Internet Publishing and Web Search Portals: Internet Publishing and Web Search Portals - Display and Other Advertising Sales - last updated from the United States Federal Reserve on March of 2025.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Forecast: Internet Publishing and Broadcasting and Web Search Portals Industry Gross Output in the US 2024 - 2028 Discover more data with ReportLinker!
On the background of these requirements for sensor calibration, intercalibration and product validation, the subgroup on Calibration and Validation of the Committee on Earth Observing System (CEOS) formulated the following recommendation during the plenary session held in China at the end of 2004, with the goal of setting-up and operating an internet based system to provide sensor data, protocols and guidelines for these purposes: Background: Reference Datasets are required to support the understanding of climate change and quality assure operational services by Earth Observing satellites. The data from different sensors and the resulting synergistic data products require a high level of accuracy that can only be obtained through continuous traceable calibration and validation activities. Requirement: Initiate an activity to document a reference methodology to predict Top of Atmosphere (TOA) radiance for which currently flying and planned wide swath sensors can be intercompared, i.e. define a standard for traceability. Also create and maintain a fully accessible web page containing, on an instrument basis, links to all instrument characteristics needed for intercomparisons as specified above, ideally in a common format. In addition, create and maintain a database (e.g. SADE) of instrument data for specific vicarious calibration sites, including site characteristics, in a common format. Each agency is responsible for providing data for their instruments in this common format. Recommendation : The required activities described above should be supported for an implementation period of two years and a maintenance period over two subsequent years. The CEOS should encourage a member agency to accept the lead role in supporting this activity. CEOS should request all member agencies to support this activity by providing appropriate information and data in a timely manner. Pseudo-Invariant Calibration Sites (PICS): Mauritania 2 is one of six CEOS reference Pseudo-Invariant Calibration Sites (PICS) that are CEOS Reference Test Sites. Besides the nominally good site characteristics (temporal stability, uniformity, homogeneity, etc.), these six PICS were selected by also taking into account their heritage and the large number of datasets from multiple instruments that already existed in the EO archives and the long history of characterization performed over these sites. The PICS have high reflectance and are usually made up of sand dunes with climatologically low aerosol loading and practically no vegetation. Consequently, these PICS can be used to evaluate the long-term stability of instrument and facilitate inter-comparison of multiple instruments.
This dataset was created by Tetiana Klimonova
It contains the following files:
Delve into Serpstat's comprehensive website ratings, providing an in-depth analysis of websites across 1K+ industries and 229 countries. Our datasets offer a wealth of valuable metrics, including domain global rating, domain rating within the category, and domain category (e.g., Shopping/Apparel/Footwear). Gain insights into domain estimated search traffic, domain visibility (an indicator of domain visibility in the top 20 Google results), number of domain SEO keywords, number of referring domains, number of backlinks, and Serpstat Domain Rank (a domain authority indicator).
With these robust metrics, our datasets empower businesses to make informed decisions and optimize their online strategies effectively. Whether you're conducting competitor analysis, market research, or SEO optimization, Serpstat's website ratings provide the comprehensive insights needed to drive success in today's digital landscape. Plus, our datasets are refreshed on demand, ensuring that you always have access to the most up-to-date information for strategic decision-making.
Datos brings to market anonymized, at scale, сonsolidated privacy-secured datasets with granularity rarely found in market. Datos offers access to the desktop and mobile browsing behavior for millions of users across the globe, packaged into clean, easy to understand data products and reports for use by our clients.
The Datos Search Events Feed is an event-level accounting of what searches are being executed on specific properties potentially supporting both organic and paid results, and what clicks, if any, result from those searches. This feed can be delivered on a daily basis, delivering the previous day’s data, and can be filtered by any of the fields, but the most common ways our customers tend to segment this feed is by:
Country Code (focus on a specific market)
Search Engine (focus on activity within a specific domain or set of domains)
Target Domain (focus on specific landing domains to understand inbound keywords)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Evolution of the Manosphere Across the Web
We make available data related to subreddit and standalone forums from the manosphere.
We also make available Perspective API annotations for all posts.
You can find the code in GitHub.
Please cite this paper if you use this data:
@article{ribeiroevolution2021, title={The Evolution of the Manosphere Across the Web}, author={Ribeiro, Manoel Horta and Blackburn, Jeremy and Bradlyn, Barry and De Cristofaro, Emiliano and Stringhini, Gianluca and Long, Summer and Greenberg, Stephanie and Zannettou, Savvas}, booktitle = {{Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM'21)}}, year={2021} }
We make available data for forums and for relevant subreddits (56 of them, as described in subreddit_descriptions.csv). These are available, 1 line per post in each subreddit Reddit in /ndjson/reddit.ndjson. A sample for example is:
{ "author": "Handheld_Gaming", "date_post": 1546300852, "id_post": "abcusl", "number_post": 9.0, "subreddit": "Braincels", "text_post": "Its been 2019 for almost 1 hour And I am at a party with 120 people, half of them being foids. The last year had been the best in my life. I actually was happy living hope because I was redpilled to the death.
Now that I am blackpilled I see that I am the shortest of all men and that I am the only one with a recessed jaw.
Its over. Its only thanks to my age old friendship with chads and my social skills I had developed in the past year that a lot of men like me a lot as a friend.
No leg lengthening syrgery is gonna save me. Ignorance was a bliss. Its just horror now seeing that everyone can make out wirth some slin hoe at the party.
I actually feel so unbelivably bad for turbomanlets. Life as an unattractive manlet is a pain, I cant imagine the hell being an ugly turbomanlet is like. I would have roped instsntly if I were one. Its so unfair.
Tallcels are fakecels and they all can (and should) suck my cock.
If I were 17cm taller my life would be a heaven and I would be the happiest man alive.
Just cope and wait for affordable body tranpslants.", "thread": "t3_abcusl" }
We here describe the .sqlite and .ndjson files that contain the data from the following forums.
(avfm) --- https://d2ec906f9aea-003845.vbulletin.net (incels) --- https://incels.co/ (love_shy) --- http://love-shy.com/lsbb/ (redpilltalk) --- https://redpilltalk.com/ (mgtow) --- https://www.mgtow.com/forums/ (rooshv) --- https://www.rooshvforum.com/ (pua_forum) --- https://www.pick-up-artist-forum.com/ (the_attraction) --- http://www.theattractionforums.com/
The files are in folders /sqlite/ and /ndjson.
2.1 .sqlite
All the tables in the sqlite. datasets follow a very simple {key:value} format. Each key is a thread name (for example /threads/housewife-is-like-a-job.123835/) and each value is a python dictionary or a list. This file contains three tables:
idx each key is the relative address to a thread and maps to a post. Each post is represented by a dict:
"type": (list) in some forums you can add a descriptor such as
[RageFuel] to each topic, and you may also have special
types of posts, like sticked/pool/locked posts.
"title": (str) title of the thread;
"link": (str) link to the thread;
"author_topic": (str) username that created the thread;
"replies": (int) number of replies, may differ from number of
posts due to difference in crawling date;
"views": (int) number of views;
"subforum": (str) name of the subforum;
"collected": (bool) indicates if raw posts have been collected;
"crawled_idx_at": (str) datetime of the collection.
processed_posts each key is the relative address to a thread and maps to a list with posts (in order). Each post is represented by a dict:
"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.
raw_posts each key is the relative address to a thread and maps to a list with unprocessed posts (in order). Each post is represented by a dict:
"post_raw": (binary) raw html binary; "crawled_at": (str) datetime of the collection.
2.2 .ndjson
Each line consists of a json object representing a different comment with the following fields:
"author": (str) author's username; "resume_author": (str) author's little description; "joined_author": (str) date author joined; "messages_author": (int) number of messages the author has; "text_post": (str) text of the main post; "number_post": (int) number of the post in the thread; "id_post": (str) unique post identifier (depends), for sure unique within thread; "id_post_interaction": (list) list with other posts ids this post quoted; "date_post": (str) datetime of the post, "links": (tuple) nice tuple with the url parsed, e.g. ('https', 'www.youtube.com', '/S5t6K9iwcdw'); "thread": (str) same as key; "crawled_at": (str) datetime of the collection.
We also run each post and reddit post through perspective, the files are located in the /perspective/ folder. They are compressed with gzip. One example output
{ "id_post": 5200, "hate_output": { "text": "I still can\u2019t wrap my mind around both of those articles about these c~~~s sleeping with poor Haitian Men. Where\u2019s the uproar?, where the hell is the outcry?, the \u201cpig\u201d comments or the \u201ccreeper comments\u201d. F~~~ing hell, if roles were reversed and it was an article about Men going to Europe where under 18 sex in legal, you better believe they would crucify the writer of that article and DEMAND an apology by the paper that wrote it.. This is exactly what I try and explain to people about the double standards within our modern society. A bunch of older women, wanna get their kicks off by sleeping with poor Men, just before they either hit or are at menopause age. F~~~ing unreal, I\u2019ll never forget going to Sweden and Norway a few years ago with one of my buddies and his girlfriend who was from there, the legal age of consent in Norway is 16 and in Sweden it\u2019s 15. I couldn\u2019t believe it, but my friend told me \u201c hey, it\u2019s normal here\u201d . Not only that but the age wasn\u2019t a big different in other European countries as well. One thing i learned very quickly was how very Misandric Sweden as well as Denmark were.", "TOXICITY": 0.6079781, "SEVERE_TOXICITY": 0.53744453, "INFLAMMATORY": 0.7279288, "PROFANITY": 0.58842486, "INSULT": 0.5511079, "OBSCENE": 0.9830818, "SPAM": 0.17009115 } }
A nice way to read some of the files of the dataset is using SqliteDict, for example:
from sqlitedict import SqliteDict processed_posts = SqliteDict("./data/forums/incels.sqlite", tablename="processed_posts")
for key, posts in processed_posts.items(): for post in posts: # here you could do something with each post in the dataset pass
Additionally, we provide two .sqlite files that are helpers used in the analyses. These are related to reddit, and not to the forums! They are:
channel_dict.sqlite a sqlite where each key corresponds to a subreddit and values are lists of dictionaries users who posted on it, along with timestamps.
author_dict.sqlite a sqlite where each key corresponds to an author and values are lists of dictionaries of the subreddits they posted on, along with timestamps.
These are used in the paper for the migration analyses.
Although we did our best to clean the data and be consistent across forums, this is not always possible. In the following subsections we talk about the particularities of each forum, directions to improve the parsing which were not pursued as well as give some examples on how things work in each forum.
6.1 incels
Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.
types: for the incel forums the special types associated with each thread in the idx table are “Sticky”, “Pool”, “Closed”, and the custom types added by users, such as [LifeFuel]. These last ones are all in brackets. You can see some examples of these in the on the example thread page.
quotes: quotes in this forum were quite nice and thus, all quotations are deterministic.
6.2 LoveShy
Check out an archived version of the front page, the thread page and a post page, as well as a dump of the data stored for a thread page and a post page.
types: no types were parsed. There are some rules in the forum, but not significant.
quotes: quotes were obtained from exact text+author match, or author match + a jaccard
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AIT Log Data Sets
This repository contains synthetic log data suitable for evaluation of intrusion detection systems. The logs were collected from four independent testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by Landauer et al. (2020) [1]. Please refer to the paper for more detailed information on automatic testbed generation and cite it if the data is used for academic publications. In brief, each testbed simulates user accesses to a webserver that runs Horde Webmail and OkayCMS. The duration of the simulation is six days. On the fifth day (2020-03-04) two attacks are launched against each web server.
The archive AIT-LDS-v1_0.zip contains the directories "data" and "labels".
The data directory is structured as follows. Each directory mail.
Setup details of the web servers:
OS: Debian Stretch 9.11.6
Services:
Apache2
PHP7
Exim 4.89
Horde 5.2.22
OkayCMS 2.3.4
Suricata
ClamAV
MariaDB
Setup details of user machines:
OS: Ubuntu Bionic
Services:
Chromium
Firefox
User host machines are assigned to web servers in the following way:
mail.cup.com is accessed by users from host machines user-{0, 1, 2, 6}
mail.spiral.com is accessed by users from host machines user-{3, 5, 8}
mail.insect.com is accessed by users from host machines user-{4, 9}
mail.onion.com is accessed by users from host machines user-{7, 10}
The following attacks are launched against the web servers (different starting times for each web server, please check the labels for exact attack times):
Attack 1: multi-step attack with sequential execution of the following attacks:
nmap scan
nikto scan
smtp-user-enum tool for account enumeration
hydra brute force login
webshell upload through Horde exploit (CVE-2019-9858)
privilege escalation through Exim exploit (CVE-2019-10149)
Attack 2: webshell injection through malicious cookie (CVE-2019-16885)
Attacks are launched from the following user host machines. In each of the corresponding directories user-
user-6 attacks mail.cup.com
user-5 attacks mail.spiral.com
user-4 attacks mail.insect.com
user-7 attacks mail.onion.com
The log data collected from the web servers includes
Apache access and error logs
syscall logs collected with the Linux audit daemon
suricata logs
exim logs
auth logs
daemon logs
mail logs
syslogs
user logs
Note that due to their large size, the audit/audit.log files of each server were compressed in a .zip-archive. In case that these logs are needed for analysis, they must first be unzipped.
Labels are organized in the same directory structure as logs. Each file contains two labels for each log line separated by a comma, the first one based on the occurrence time, the second one based on similarity and ordering. Note that this does not guarantee correct labeling for all lines and that no manual corrections were conducted.
Version history and related data sets:
AIT-LDS-v1.0: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.
AIT-LDS-v1.1: Removed carriage return of line endings in audit.log files.
AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.
Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU project GUARD (833456).
If you use the dataset, please cite the following publication:
[1] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]
Collecting network information on political elites using conventional methods such as surveys and text records is challenging in authoritarian and/or conflict-ridden states. I introduce a data collection method for elite networks using scraping algorithms to capture public co-appearances at political and social events. Validity checks using existing data show the method effectively replicates interaction-based networks but not networks based on behavioral similarities; in both cases, measurement error remains a concern. Applying the method to Nigeria illustrates that patronage---measured in terms of public connectivity---does not drive national-oil-company appointments. Given that theories of elite behavior aim to understand individual-level interactions, the applicability of data using this technique is well-suited to situations where intrusive data collection is costly or prohibitive.
When asked about "Attitudes towards the internet", most Mexican respondents pick "It is important to me to have mobile internet access in any place at any time" as an answer. 55 percent did so in our online survey in 2024. Looking to gain valuable insights about users of internet providers worldwide? Check out our