Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.
Below are the values each column has. The column names are pretty self-explanatory. website_url: URL link of the website. cleaned_website_text: the cleaned text content extracted from the
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Website dataset designed to facilitate the development of models for URL-based website classification.
2) Data Utilization (1) Website data has characteristics that: • This dataset is crucial for training models that can automatically classify websites based on their URL structures. (2) Website data can be used to: • Enhancing cybersecurity measures by detecting malicious websites. • Improving content filtering systems for safer browsing experiences.
Facebook
TwitterThe Internet is a vast space that, while hosting a plethora of resources, also serves as a breeding ground for malicious activities. URLs are often leveraged as a primary tool by adversaries to conduct various types of cyber attacks. In response, the cybersecurity community has developed numerous techniques, with URL blacklisting being a prevalent method. However, this reactive approach falls short against the constantly evolving landscape of new malicious URLs.
This dataset aims to contribute to the proactive detection and categorization of URLs by analyzing their lexical features. It facilitates the development and testing of models capable of distinguishing between benign and malicious (malware) URLs. The dataset is divided into three main parts: training, validation, and testing sets, encompassing a broad spectrum of data points for comprehensive analysis.
The URLs are categorized into two classes: - Benign(Good): URLs that are deemed safe and do not host any form of malicious content. - Malware(Bad): URLs associated with malicious websites, including those that distribute malware, phishing attempts, or other harmful activities.
The benign URLs were meticulously collected from Alexa's top-ranked websites, ensuring a representation of commonly visited and trusted domains. On the other hand, the malware URLs were curated from various sources known for listing active and dangerous URLs. Each URL underwent rigorous verification to ensure its correct classification, providing a reliable basis for model training and testing.
This dataset is pivotal for researchers and cybersecurity practitioners aiming to devise effective strategies for early detection of malicious URLs. By employing lexical analysis and machine learning techniques, it is possible to identify potentially harmful URLs before they can impact users. Such proactive measures are essential in the ongoing battle against cyber threats, enhancing the overall security posture of online environments.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
DMOZ is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a passionate, global community of volunteer editors. It was historically known as the Open Directory Project (ODP).
DMOZ was founded in the spirit of the Open Source movement and is the only major directory that is 100% free. There is not, nor will there ever be, a cost to submit a site to the directory, and/or to use the directory's data. Its data is made available for free to anyone who agrees to comply with our free use license.
DMOZ is the most widely distributed data base of Web content classified by humans. It serves as input to the Web's largest and most popular search engines and portals, including AOL Search, Google, Lycos, HotBot, and hundreds of others.
Reference
Facebook
Twitter
According to our latest research, the global Web Categorization Services market size reached USD 2.41 billion in 2024, demonstrating strong momentum driven by the imperative need for advanced digital content management and security solutions. The market is expected to expand at a robust CAGR of 13.6% during the forecast period, reaching a projected value of USD 7.09 billion by 2033. Growth in this sector is primarily fueled by the increasing sophistication of cyber threats, the proliferation of web-based applications, and the growing adoption of cloud-based technologies across diverse industry verticals.
A significant growth factor for the Web Categorization Services market is the exponential rise in online content and the corresponding necessity for organizations to manage, filter, and secure web traffic efficiently. As businesses increasingly migrate operations to digital platforms, the sheer volume of web content—ranging from benign corporate resources to potentially harmful or non-compliant material—necessitates sophisticated categorization solutions. Web categorization services leverage artificial intelligence and machine learning algorithms to analyze and classify millions of web pages in real time, enabling organizations to enforce content policies, enhance productivity, and mitigate the risks associated with inappropriate or malicious sites. This capability is particularly crucial in sectors such as BFSI, healthcare, and government, where regulatory compliance and data protection are paramount.
Another pivotal driver behind market growth is the rising emphasis on brand safety and digital advertising effectiveness. As enterprises allocate larger portions of their marketing budgets to online channels, the risk of brand advertisements appearing alongside inappropriate or harmful content has become a critical concern. Web categorization services play a vital role in ensuring brand safety by dynamically assessing the context and category of web pages where ads are displayed, thus protecting brand reputation and optimizing ad targeting. This trend is further amplified by the increasing adoption of programmatic advertising and real-time bidding platforms, which rely heavily on accurate web categorization to maximize campaign ROI and minimize reputational risk.
The surge in remote work and the widespread adoption of bring-your-own-device (BYOD) policies have also contributed to the expansion of the Web Categorization Services market. With employees accessing corporate networks from diverse locations and devices, organizations face heightened challenges in maintaining security and compliance. Web categorization solutions enable IT departments to implement granular access controls, monitor user activity, and prevent data breaches by blocking access to malicious or unauthorized websites. The integration of web categorization with existing security information and event management (SIEM) systems further enhances threat intelligence and incident response capabilities, making these services indispensable in the modern enterprise security stack.
In the context of securing web environments, the role of a Secure Web Gateway (SWG) is becoming increasingly pivotal. As organizations strive to protect their networks from sophisticated cyber threats, SWGs offer a comprehensive solution by filtering unwanted software and malware from user-initiated web traffic. They ensure that only safe and compliant content is accessed, thereby safeguarding sensitive data and maintaining regulatory compliance. The integration of SWGs with web categorization services enhances the overall security posture by providing real-time threat intelligence and advanced content filtering capabilities. This synergy is particularly beneficial for enterprises with distributed workforces and BYOD policies, where maintaining a secure web environment is critical to operational integrity.
Regionally, North America continues to dominate the Web Categorization Services market, accounting for the largest share in 2024. This leadership is attributed to the early adoption of advanced cybersecurity solutions, the presence of major technology vendors, and stringent regulatory frameworks governing data privacy and digital content. However, the Asia Pacific region is anticipated to exhibit the highest CAGR
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The USDA-NRCS Soil Series Classification Database contains the taxonomic classification of each soil series identified in the United States, Territories, Commonwealths, and Island Nations served by USDA-NRCS. Along with the taxonomic classification, the database contains other information about the soil series, such as office of responsibility, series status, dates of origin and establishment, and geographic areas of usage. The database is maintained by the soils staff of the NRCS MLRA Soil Survey Region Offices across the country. Additions and changes are continually being made, resulting from on going soil survey work and refinement of the soil classification system. As the database is updated, the changes are immediately available to the user, so the data retrieved is always the most current. The Web access to this soil classification database provides capabilities to view the contents of individual series records, to query the database on any data element and produce a report with the selected soils, or to produce national reports with all soils in the database. The standard reports available allow the user to display the soils by series name or by taxonomic classification. The SC database was migrated into the NASIS database with version 6.2. Resources in this dataset:Resource Title: Website Pointer to Soil Series Classification Database (SC). File Name: Web Page, url: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/soils/survey/class/data/?cid=nrcs142p2_053583 Supports the following queries:
Facebook
TwitterThe training dataset consisting of 20 million pairs of product offers referring to the same products categorized into 25 product categories. We also created a categorization gold standard by manually verifying more than 2000 clusters of offers belonging to 25 different product categories.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Web Categorization Services market size reached USD 3.42 billion in 2024, with a robust year-on-year growth trajectory. The market is expected to expand at a CAGR of 14.2% during the forecast period, projecting a value of USD 9.03 billion by 2033. This growth is primarily driven by the increasing need for advanced content filtering, regulatory compliance, and enhanced cybersecurity measures across diverse industries. The rising adoption of cloud-based solutions and the proliferation of digital content are further fueling market expansion, as organizations worldwide prioritize secure and efficient web usage management.
One of the chief growth drivers for the Web Categorization Services market is the escalating frequency and sophistication of cyber threats. Enterprises are under mounting pressure to safeguard their digital assets and ensure compliance with evolving regulatory frameworks. Web categorization services play a pivotal role in this context by enabling organizations to classify and filter web content, thereby minimizing exposure to malicious or inappropriate sites. The surge in remote working and the widespread digital transformation initiatives have further accentuated the need for robust web categorization solutions, as businesses seek to protect their distributed workforce and sensitive data from cyber risks.
Another significant factor propelling market growth is the increasing emphasis on brand safety and ad targeting within the digital advertising ecosystem. As brands allocate larger budgets to online advertising, the importance of ensuring that ads are displayed on appropriate and safe web pages has become paramount. Web categorization services empower advertisers and publishers to effectively filter out unsuitable or harmful content, thereby preserving brand reputation and maximizing campaign ROI. The integration of artificial intelligence and machine learning technologies into these services is enhancing their accuracy and efficiency, making them indispensable tools in the digital marketing landscape.
The growing need for regulatory compliance across sectors such as BFSI, healthcare, and government is also bolstering the adoption of web categorization services. Stringent data protection laws and industry-specific regulations require organizations to monitor and control online activities diligently. Web categorization solutions facilitate compliance by providing granular visibility into web usage and enabling the enforcement of acceptable use policies. As regulatory landscapes become increasingly complex, the demand for automated, scalable, and customizable web categorization services is expected to surge, driving sustained market growth over the forecast period.
From a regional perspective, North America continues to dominate the Web Categorization Services market, accounting for the largest share in 2024. This leadership is attributed to the high digital adoption rates, presence of major technology vendors, and stringent cybersecurity regulations in the region. Meanwhile, Asia Pacific is witnessing the fastest growth, fueled by rapid digitalization, expanding internet user base, and increasing investments in cybersecurity infrastructure. Europe also holds a significant market share, driven by robust regulatory frameworks such as GDPR and a strong focus on data privacy. The Middle East & Africa and Latin America are emerging as promising markets, supported by growing awareness of cybersecurity and rising adoption of digital technologies.
The Component segment of the Web Categorization Services market is bifurcated into software and services, each playing a critical role in the overall value proposition offered to end-users. The software segment comprises standalone and integrated solutions that automate the process of classifying and filtering web content. These solutions leverage advanced algorithms, artificial intelligence, and machine learning to deliver precise and real-time categorization. The growing complexity of web content and the need for scalable, customizable solutions have spurred significant investments in software development, resulting in continuous innovation and feature enhancements.
On the other hand, the services segment encompasses a range of offerings, including consulting, implementation, suppo
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the base training dataset for training a IAB taxonomy text/website classification model. We are actively using it at https://front-page.com for domain/website classification and search query enhancement.
Fremium APIs are available at https://www.rest-apis.com/
If you need test data to run your model agains the counterpart data set is https://www.kaggle.com/datasets/bpmtips/target-domains-for-iab-textwebsite-classification
Have Fun !!
Let us know if you exceed the accuracy of our model.
uploaded a new version with only english language sites.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the full set of code and data for the CompuCrawl database. The database contains the archived websites of publicly traded North American firms listed in the Compustat database between 1996 and 2020\u2014representing 11,277 firms with 86,303 firm/year observations and 1,617,675 webpages in the final cleaned and selected set.The files are ordered by moment of use in the work flow. For example, the first file in the list is the input file for code files 01 and 02, which create and update the two tracking files "scrapedURLs.csv" and "URLs_1_deeper.csv" and which write HTML files to its folder. "HTML.zip" is the resultant folder, converted to .zip for ease of sharing. Code file 03 then reads this .zip file and is therefore below it in the ordering.The full set of files, in order of use, is as follows:Compustat_2021.xlsx: The input file containing the URLs to be scraped and their date range.01 Collect frontpages.py: Python script scraping the front pages of the list of URLs and generating a list of URLs one page deeper in the domains.URLs_1_deeper.csv: List of URLs one page deeper on the main domains.02 Collect further pages.py: Python script scraping the list of URLs one page deeper in the domains.scrapedURLs.csv: Tracking file containing all URLs that were accessed and their scraping status.HTML.zip: Archived version of the set of individual HTML files.03 Convert HTML to plaintext.py: Python script converting the individual HTML pages to plaintext.TXT_uncleaned.zip: Archived version of the converted yet uncleaned plaintext files.input_categorization_allpages.csv: Input file for classification of pages using GPT according to their HTML title and URL.04 GPT application.py: Python script using OpenAI\u2019s API to classify selected pages according to their HTML title and URL.categorization_applied.csv: Output file containing classification of selected pages.exclusion_list.xlsx: File containing three sheets: 'gvkeys' containing the GVKEYs of duplicate observations (that need to be excluded), 'pages' containing page IDs for pages that should be removed, and 'sentences' containing (sub-)sentences to be removed.05 Clean and select.py: Python script applying data selection and cleaning (including selection based on page category), with setting and decisions described at the top of the script. This script also combined individual pages into one combined observation per GVKEY/year.metadata.csv: Metadata containing information on all processed HTML pages, including those not selected.TXT_cleaned.zip: Archived version of the selected and cleaned plaintext page files. This file serves as input for the word embeddings application.TXT_combined.zip: Archived version of the combined plaintext files at the GVKEY/year level. This file serves as input for the data description using topic modeling.06 Topic model.R: R script that loads up the combined text data from the folder stored in "TXT_combined.zip", applies further cleaning, and estimates a 125-topic model.TM_125.RData: RData file containing the results of the 125-topic model.loadings125.csv: CSV file containing the loadings for all 125 topics for all GVKEY/year observations that were included in the topic model.125_topprob.xlsx: Overview of top-loading terms for the 125 topic model.07 Word2Vec train and align.py: Python script that loads the plaintext files in the "TXT_cleaned.zip" archive to train a series of Word2Vec models and subsequently align them in order to compare word embeddings across time periods.Word2Vec_models.zip: Archived version of the saved Word2Vec models, both unaligned and aligned.08 Word2Vec work with aligned models.py: Python script which loads the trained Word2Vec models to trace the development of the embeddings for the terms \u201csustainability\u201d and \u201cprofitability\u201d over time.99 Scrape further levels down.py: Python script that can be used to generate a list of unscraped URLs from the pages that themselves were one level deeper than the front page.URLs_2_deeper.csv: CSV file containing unscraped URLs from the pages that themselves were one level deeper than the front page.For those only interested in downloading the final database of texts, the files "HTML.zip", "TXT_uncleaned.zip", "TXT_cleaned.zip", and "TXT_combined.zip" contain the full set of HTML pages, the processed but uncleaned texts, the selected and cleaned texts, and combined and cleaned texts at the GVKEY/year level, respectively.The following webpage contains answers to frequently asked questions: https://haans-mertens.github.io/faq/. More information on the database and the underlying project can be found here: https://haans-mertens.github.io/ and the following article: \u201cThe Internet Never Forgets: A Four-Step Scraping Tutorial, Codebase, and Database for Longitudinal Organizational Website Data\u201d, by Richard F.J. Haans and Marc J. Mertens in Organizational Research Methods. The full paper can be accessed here.
Facebook
TwitterThe Reddit Subreddit Dataset by Dataplex offers a comprehensive and detailed view of Reddit’s vast ecosystem, now enhanced with appended AI-generated columns that provide additional insights and categorization. This dataset includes data from over 2.1 million subreddits, making it an invaluable resource for a wide range of analytical applications, from social media analysis to market research.
Dataset Overview:
This dataset includes detailed information on subreddit activities, user interactions, post frequency, comment data, and more. The inclusion of AI-generated columns adds an extra layer of analysis, offering sentiment analysis, topic categorization, and predictive insights that help users better understand the dynamics of each subreddit.
2.1 Million Subreddits with Enhanced AI Insights: The dataset covers over 2.1 million subreddits and now includes AI-enhanced columns that provide: - Sentiment Analysis: AI-driven sentiment scores for posts and comments, allowing users to gauge community mood and reactions. - Topic Categorization: Automated categorization of subreddit content into relevant topics, making it easier to filter and analyze specific types of discussions. - Predictive Insights: AI models that predict trends, content virality, and user engagement, helping users anticipate future developments within subreddits.
Sourced Directly from Reddit:
All social media data in this dataset is sourced directly from Reddit, ensuring accuracy and authenticity. The dataset is updated regularly, reflecting the latest trends and user interactions on the platform. This ensures that users have access to the most current and relevant data for their analyses.
Key Features:
Use Cases:
Data Quality and Reliability:
The Reddit Subreddit Dataset emphasizes data quality and reliability. Each record is carefully compiled from Reddit’s vast database, ensuring that the information is both accurate and up-to-date. The AI-generated columns further enhance the dataset's value, providing automated insights that help users quickly identify key trends and sentiments.
Integration and Usability:
The dataset is provided in a format that is compatible with most data analysis tools and platforms, making it easy to integrate into existing workflows. Users can quickly import, analyze, and utilize the data for various applications, from market research to academic studies.
User-Friendly Structure and Metadata:
The data is organized for easy navigation and analysis, with metadata files included to help users identify relevant subreddits and data points. The AI-enhanced columns are clearly labeled and structured, allowing users to efficiently incorporate these insights into their analyses.
Ideal For:
This dataset is an essential resource for anyone looking to understand the intricacies of Reddit's vast ecosystem, offering the data and AI-enhanced insights needed to drive informed decisions and strategies across various fields. Whether you’re tracking emerging trends, analyzing user behavior, or conduc...
Facebook
TwitterOften we find in situation to classify businesses and companies across a standard taxonomy. This dataset comes with pre-classified companies along with data scraped from the website.
The scraped data from the website includes, 1. Category: The target label into which the company is classified 2. website: The website of the company / business 3. company_name: The company / business name 4. homepage_text : Visible homepage text 5. h1: The heading 1 tags from the html of the home page 6. h2: The heading 2 tags from the html of the home page 7. h3: The heading 3 tags from the html of the home page 8. nav_link_text: The visible titles of navigation links on the homepage (Ex: Home, Services, Product, About Us, Contact Us) 9. meta_keywords: The meta keywords in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp) 10 meta_description: The meta description in the header of the page html for SEO (More info: https://www.w3schools.com/tags/tag_meta.asp)
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterCATH Domain Classification List (latest release) - protein structural domains classified into CATH hierarchy.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A confusion matrix can be used to compare a machine’s predictions against human classification. We can use confusion matrices to understand the consumption segments that the classifier is struggling to distinguish between. A confusion matrix for our XGBoost classification of web-scraped clothing data is available in this data download.
Facebook
TwitterWebsite visit data with URLs, categories, timestamps, and anonymized unique device identifiers.
Over 50 million unique devices per day. 1 billion+ raw signals per month with historical raw data available.
This data can be combined with demographic and lifestyle data to provide a richer view of the anonymous users/devices.
Intended for training ML and AI models.
Facebook
Twitter
According to our latest research, the global Web Filtering Software market size reached USD 5.42 billion in 2024 and is anticipated to grow at a robust CAGR of 14.1% during the forecast period. By 2033, the market is projected to achieve a value of USD 18.14 billion. This significant growth is driven by increasing concerns about cybersecurity threats, the proliferation of internet-connected devices, and the need for organizations to comply with stringent regulatory requirements. The rising adoption of cloud computing and the expansion of remote workforces are further fueling the demand for advanced web filtering solutions worldwide.
A primary growth factor for the web filtering software market is the escalating frequency and sophistication of cyberattacks targeting both enterprises and individuals. As organizations digitize their operations and employees access corporate resources remotely, the attack surface has expanded considerably. Threat actors are leveraging advanced phishing schemes, malware, and ransomware campaigns, making it essential for businesses to deploy comprehensive web filtering solutions. These tools help block access to malicious websites, prevent data exfiltration, and enforce acceptable use policies, thereby reducing the risk of breaches and safeguarding sensitive information. The increasing awareness of these threats among enterprises and the public sector is translating into higher investments in web filtering technologies.
Another significant driver is the growing emphasis on regulatory compliance across various industries, especially those handling sensitive data such as BFSI, healthcare, and government. Regulations like the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), and the Payment Card Industry Data Security Standard (PCI DSS) require organizations to implement robust security controls, including web filtering, to protect personal and financial data. Non-compliance can result in hefty fines and reputational damage, prompting organizations to adopt advanced web filtering software as part of their broader cybersecurity strategy. This regulatory push is particularly pronounced in developed regions, but is also gaining traction in emerging markets as digitalization accelerates.
The surge in remote and hybrid work models has further accelerated the adoption of web filtering software. With employees accessing corporate networks from various locations and devices, traditional perimeter-based security approaches are no longer sufficient. Organizations are increasingly turning to cloud-based web filtering solutions that offer centralized policy management, real-time threat intelligence, and seamless scalability. These solutions enable businesses to maintain consistent security postures regardless of where employees are located, ensuring productivity and data protection in a highly distributed work environment. The flexibility and cost-effectiveness of cloud-based offerings are attracting both large enterprises and small and medium enterprises (SMEs).
Web Categorization Services play a crucial role in enhancing the effectiveness of web filtering software. By classifying websites into various categories, these services enable organizations to implement more granular and precise filtering policies. This categorization helps in blocking access to inappropriate or harmful content while allowing access to legitimate and necessary resources. As the internet continues to grow and evolve, the ability to accurately categorize web content is becoming increasingly important for businesses and institutions aiming to maintain a secure and productive online environment. The integration of Web Categorization Services with web filtering solutions ensures that organizations can adapt to new threats and content trends, providing a dynamic and responsive approach to cybersecurity.
From a regional perspective, North America currently dominates the web filtering software market, accounting for the largest revenue share in 2024. The regionÂ’s leadership is attributed to the high adoption of digital technologies, a strong focus on cybersecurity, and the presence of leading market players. However, Asia Pacific is expected to witness the fastest growth during the forecast period, driven by rapid digital transformation in
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The AIToolBuzz — 16,763 AI Tools Dataset is a comprehensive collection of publicly available information on artificial intelligence tools and platforms curated from AIToolBuzz.com.
It compiles detailed metadata about each tool, including name, description, category, founding year, technologies used, website, and operational status.
The dataset serves as a foundation for AI trend analysis, product discovery, market research, and NLP-based categorization projects.
It enables researchers, developers, and analysts to explore the evolution of AI tools, detect emerging sectors, and study keyword trends across industries.
| Column | Description |
|---|---|
| Name | Tool’s official name |
| Link | URL of its page on AIToolBuzz |
| Logo | Direct logo image URL |
| Category | Functional domain (e.g., Communication, Marketing, Development) |
| Primary Task | Main purpose or capability |
| Keywords | Comma-separated tags describing tool functions and industries |
| Year Founded | Year of company/tool inception |
| Short Description | Concise summary of the tool |
| Country | Headquarters or operating country |
| industry | Industry classification |
| technologies | Key technologies or frameworks associated |
| Website | Official product/company website |
| Website Status | Website availability (Active / Error / Not Reachable / etc.) |
| Name | Category | Year Founded | Country | Website Status |
|---|---|---|---|---|
| ChatGPT | Communication and Support | 2022 | Estonia | Active |
| Claude | Operations and Management | 2023 | United States | Active |
requests + BeautifulSoup, extracting metadata from each tool’s public page. CC BY 4.0 recommended). If you use this dataset, please cite as:
AIToolBuzz — 16,763 AI Tools (Complete Directory with Metadata).
Kaggle. https://aitoolbuzz.com
You are free to share and adapt the data for research or analysis with proper attribution to AIToolBuzz.com as the original source.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a list of 186 Digital Humanities projects leveraging information visualisation methods. Each project has been classified according to visualisation and interaction techniques, narrativity and narrative solutions, domain, methods for the representation of uncertainty and interpretation, and the employment of critical and custom approaches to visually represent humanities data.
Classification schema: categories and columns
The project_id column contains unique internal identifiers assigned to each project. Meanwhile, the last_access column records the most recent date (in DD/MM/YYYY format) on which each project was reviewed based on the web address specified in the url column.The remaining columns can be grouped into descriptive categories aimed at characterising projects according to different aspects:
Narrativity. It reports the presence of narratives employing information visualisation techniques. Here, the term narrative encompasses both author-driven linear data stories and more user-directed experiences where the narrative sequence is composed of user exploration [1]. We define 2 columns to identify projects using visualisation techniques in narrative, or non-narrative sections. Both conditions can be true for projects employing visualisations in both contexts. Columns:
non_narrative (boolean)
narrative (boolean)
Domain. The humanities domain to which the project is related. We rely on [2] and the chapters of the first part of [3] to abstract a set of general domains. Column:
domain (categorical):
History and archaeology
Art and art history
Language and literature
Music and musicology
Multimedia and performing arts
Philosophy and religion
Other: both extra-list domains and cases of collections without a unique or specific thematic focus.
Visualisation of uncertainty and interpretation. Buiding upon the frameworks proposed by [4] and [5], a set of categories was identified, highlighting a distinction between precise and impressional communication of uncertainty. Precise methods explicitly represent quantifiable uncertainty such as missing, unknown, or uncertain data, precisely locating and categorising it using visual variables and positioning. Two sub-categories are interactive distinction, when uncertain data is not visually distinguishable from the rest of the data but can be dynamically isolated or included/excluded categorically through interaction techniques (usually filters); and visual distinction, when uncertainty visually “emerges” from the representation by means of dedicated glyphs and spatial or visual cues and variables. On the other hand, impressional methods communicate the constructed and situated nature of data [6], exposing the interpretative layer of the visualisation and indicating more abstract and unquantifiable uncertainty using graphical aids or interpretative metrics. Two sub-categories are: ambiguation, when the use of graphical expedients—like permeable glyph boundaries or broken lines—visually convey the ambiguity of a phenomenon; and interpretative metrics, when expressive, non-scientific, or non-punctual metrics are used to build a visualisation. Column:
uncertainty_interpretation (categorical):
Interactive distinction
Visual distinction
Ambiguation
Interpretative metrics
Critical adaptation. We identify projects in which, for what concerns at least a visualisation, the following criteria are fulfilled: 1) avoid uncritical repurposing of prepackaged, generic-use, or ready-made solutions; 2) being tailored and unique to reflect the peculiarities of the phenomena at hand; 3) avoid extreme simplifications to embraces and depict complexity promoting time-spending visualisation-based inquiry. Column:
critical_adaptation (boolean)
Non-temporal visualisation techniques. We adopt and partially adapt the terminology and definitions from [7]. A column is defined for each type of visualisation and accounts for its presence within a project, also including stacked layouts and more complex variations. Columns and inclusion criteria:
plot (boolean): visual representations that map data points onto a two-dimensional coordinate system.
cluster_or_set (bool): sets or cluster-based visualisations used to unveil possible inter-object similarities.
map (boolean): geographical maps used to show spatial insights. While we do not specify the variants of maps (e.g., pin maps, dot density maps, flow maps, etc.), we make an exception for maps where each data point is represented by another visualisation (e.g., a map where each data point is a pie chart) by accounting for the presence of both in their respective columns.
network (boolean): visual representations highlighting relational aspects through nodes connected by links or edges.
hierarchical_diagram (boolean): tree-like structures such as tree diagrams, radial trees, but also dendrograms. They differ from networks for their strictly hierarchical structure and absence of closed connection loops.
treemap (boolean): still hierarchical, but highlighting quantities expressed by means of area size. It also includes circle packing variants.
word_cloud (boolean): clouds of words, where each instance’s size is proportional to its frequency in a related context
bars (boolean): includes bar charts, histograms, and variants. It coincides with “bar charts” in [7] but with a more generic term to refer to all bar-based visualisations.
line_chart (boolean): the display of information as sequential data points connected by straight-line segments.
area_chart (boolean): similar to a line chart but with a filled area below the segments. It also includes density plots.
pie_chart (boolean): circular graphs divided into slices which can also use multi-level solutions.
plot_3d (boolean): plots that use a third dimension to encode an additional variable.
proportional_area (boolean): representations used to compare values through area size. Typically, using circle- or square-like shapes.
other (boolean): it includes all other types of non-temporal visualisations that do not fall into the aforementioned categories.
Temporal visualisations and encodings. In addition to non-temporal visualisations, a group of techniques to encode temporality is considered in order to enable comparisons with [7]. Columns:
timeline (boolean): the display of a list of data points or spans in chronological order. They include timelines working either with a scale or simply displaying events in sequence. As in [7], we also include structured solutions resembling Gantt chart layouts.
temporal_dimension (boolean): to report when time is mapped to any dimension of a visualisation, with the exclusion of timelines. We use the term “dimension” and not “axis” as in [7] as more appropriate for radial layouts or more complex representational choices.
animation (boolean): temporality is perceived through an animation changing the visualisation according to time flow.
visual_variable (boolean): another visual encoding strategy is used to represent any temporality-related variable (e.g., colour).
Interaction techniques. A set of categories to assess affordable interaction techniques based on the concept of user intent [8] and user-allowed data actions [9]. The following categories roughly match the “processing”, “mapping”, and “presentation” actions from [9] and the manipulative subset of methods of the “how” an interaction is performed in the conception of [10]. Only interactions that affect the visual representation or the aspect of data points, symbols, and glyphs are taken into consideration. Columns:
basic_selection (boolean): the demarcation of an element either for the duration of the interaction or more permanently until the occurrence of another selection.
advanced_selection (boolean): the demarcation involves both the selected element and connected elements within the visualisation or leads to brush and link effects across views. Basic selection is tacitly implied.
navigation (boolean): interactions that allow moving, zooming, panning, rotating, and scrolling the view but only when applied to the visualisation and not to the web page. It also includes “drill” interactions (to navigate through different levels or portions of data detail, often generating a new view that replaces or accompanies the original) and “expand” interactions generating new perspectives on data by expanding and collapsing nodes.
arrangement (boolean): methods to organise visualisation elements (symbols, glyphs, etc.) or multi-visualisation layouts spatially through drag and drop or according to a criterion via more automatic triggers.
change (boolean): visual encoding alterations involving different aspects of visualisation as a whole: the same content is presented with another visualisation technique; the change involves symbols or glyphs aspect (colour, size, shape, etc.); the visualisation type is unaltered, but the layout variant changes (e.g., to stacked layouts); or other changes like axes inversion and scale modifications. The presence of all the visualisation techniques involved in a change is reported.
visualisation_filter (boolean): filters to exclude or include visualisation elements with respect to defined criteria, without reloading or generating a new visualisation. Unlike options triggering the fetch of new data to alter the visualisation content, filters seamlessly operate on existing visual elements.
collection_filter (boolean): the interaction with visualised elements acts as a filter for a related collection or list of items (e.g., clicking a region on a map filters a list of items according to spatial metadata).
aggregation (boolean): changes to the granularity of visual elements according to a
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
arXiv publications dataset with simulated citation relationshipshttps://github.com/jacekmiecznikowski/neo4index App evaluates scientific reasearch impact using author-level metrics (h-index and more)This collection contains data aquired from arXiv.org via OAI2 protocol.arXiv does not provide citations metadata so this data was pseudo-randomly simulated.We evaluated scientific reasearch impact using six popular author-level metrics:* h-index,* m quotient,* e-index,* m-index,* r-index,* ar-index.Sourcehttps://arxiv.org/help/bulk_data (downloaded: 2018-03-23; over 1.3 million publications)Files* arxiv_bulk_metadata_2018-03-23.tar.gz - file downloaded using oai-harvester contains metadata of all arXiv publications to date.* categories.csv - file contains categories from arXiv with category-subcategory division* publications.csv - file contains information about articles like: id, title, abstract, url, categories and date* authors.csv - file contains authors data like first name, last name and id of published article* citations.csv - file contains simulated relationships between all publications using arxivCite* indices.csv - file contains 6 author-level metrics calculated on database using neo4indexStatisticsh-index Average = 3.5836524733724495m quotient Average = 0.5831426366846965e-index Average = 7.9260187734579075m-index Average = 29.436844659143155r-index Average = 8.931101630575293ar-index Average = 3.5439082808721025h-index Median = 1.0m quotient Median = 0.4167e-index Median = 5.3852m-index Median = 17.0r-index Median = 5.831ar-index Median = 2.7928h-index Mode = 1.0m quotient Mode = 1.0e-index Mode = 0.0m-index Mode = 0.0r-index Mode = 0.0ar-index Mode = 0.0
Facebook
Twitterhttps://choosealicense.com/licenses/mpl-2.0/https://choosealicense.com/licenses/mpl-2.0/
Lightspeed Categories
This is a database of the categories of the Lightspeed Filter. You can use this with the Lightspeed API, which will provide you the category number for a URL.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by scraping different websites and then classifying them into different categories based on the extracted text.
Below are the values each column has. The column names are pretty self-explanatory. website_url: URL link of the website. cleaned_website_text: the cleaned text content extracted from the