https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |
The global number of internet users in was forecast to continuously increase between 2024 and 2029 by in total 1.3 billion users (+23.66 percent). After the fifteenth consecutive increasing year, the number of users is estimated to reach 7 billion users and therefore a new peak in 2029. Notably, the number of internet users of was continuously increasing over the past years.Depicted is the estimated number of individuals in the country or region at hand, that use the internet. As the datasource clarifies, connection quality and usage frequency are distinct aspects, not taken into account here.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of internet users in countries like the Americas and Asia.
A list of domains - updated weekly. Each domain is parsed out in the following fields:
This is not a list of just registered domains but rather domains that has - at some point - returned a valid web response. The dataset can be used as a building block for building other web-based data sets.
The global number of smartphone users in was forecast to continuously increase between 2024 and 2029 by in total 1.8 billion users (+42.62 percent). After the ninth consecutive increasing year, the smartphone user base is estimated to reach 6.1 billion users and therefore a new peak in 2029. Notably, the number of smartphone users of was continuously increasing over the past years.Smartphone users here are limited to internet users of any age using a smartphone. The shown figures have been derived from survey data that has been processed to estimate missing demographics.The shown data are an excerpt of Statista's Key Market Indicators (KMI). The KMI are a collection of primary and secondary indicators on the macro-economic, demographic and technological environment in up to 150 countries and regions worldwide. All indicators are sourced from international and national statistical offices, trade associations and the trade press and they are processed to generate comparable data sets (see supplementary notes under details for more information).Find more key insights for the number of smartphone users in countries like Australia & Oceania and Asia.
Revolutionize Customer Engagement with Our Comprehensive Ecommerce Data
Our Ecommerce Data is designed to elevate your customer engagement strategies, providing you with unparalleled insights and precision targeting capabilities. With over 61 million global contacts, this dataset goes beyond conventional data, offering a unique blend of shopping cart links, business emails, phone numbers, and LinkedIn profiles. This comprehensive approach ensures that your marketing strategies are not just effective but also highly personalized, enabling you to connect with your audience on a deeper level.
What Makes Our Ecommerce Data Stand Out?
Unique Features for Enhanced Targeting
Our Ecommerce Data is distinguished by its depth and precision. Unlike many other datasets, it includes shopping cart links—a rare and valuable feature that provides you with direct insights into consumer behavior and purchasing intent. This information allows you to tailor your marketing efforts with unprecedented accuracy. Additionally, the integration of business emails, phone numbers, and LinkedIn profiles adds multiple layers to traditional contact data, enriching your understanding of clients and enabling more personalized engagement.
Robust and Reliable Data Sourcing
We pride ourselves on our dual-sourcing strategy that ensures the highest levels of data accuracy and relevance:
Primary Use Cases Across Industries
Our Ecommerce Data is versatile and can be leveraged across various industries for multiple applications: - Precision Targeting in Marketing: Create personalized marketing campaigns based on detailed shopping cart activities, ensuring that your outreach resonates with individual customer preferences. - Sales Enrichment: Sales teams can benefit from enriched client profiles that include comprehensive contact information, enabling them to connect with key decision-makers more effectively. - Market Research and Analytics: Research and analytics departments can use this data for in-depth market studies and trend analyses, gaining valuable insights into consumer behavior and market dynamics.
Global Coverage for Comprehensive Engagement
Our Ecommerce Data spans across the globe, providing you with extensive reach and the ability to engage with customers in diverse regions: - North America: United States, Canada, Mexico - Europe: United Kingdom, Germany, France, Italy, Spain, Netherlands, Sweden, and more - Asia: China, Japan, India, South Korea, Singapore, Malaysia, and more - South America: Brazil, Argentina, Chile, Colombia, and more - Africa: South Africa, Nigeria, Kenya, Egypt, and more - Australia and Oceania: Australia, New Zealand - Middle East: United Arab Emirates, Saudi Arabia, Israel, Qatar, and more
Comprehensive Employee and Revenue Size Information
Our dataset also includes detailed information on: - Employee Size: Whether you’re targeting small businesses or large corporations, our data covers all employee sizes, from startups to global enterprises. - Revenue Size: Gain insights into companies across various revenue brackets, enabling you to segment the market more effectively and target your efforts where they will have the most impact.
Seamless Integration into Broader Data Offerings
Our Ecommerce Data is not just a standalone product; it is a critical piece of our broader data ecosystem. It seamlessly integrates with our comprehensive suite of business and consumer datasets, offering you a holistic approach to data-driven decision-making: - Tailored Packages: Choose customized data packages that meet your specific business needs, combining Ecommerce Data with other relevant datasets for a complete view of your market. - Holistic Insights: Whether you are looking for industry-specific details or a broader market overview, our integrated data solutions provide you with the insights necessary to stay ahead of the competition and make informed business decisions.
Elevate Your Business Decisions with Our Ecommerce Data
In essence, our Ecommerce Data is more than just a collection of contacts—it’s a strategic tool designed to give you a competitive edge in understanding and engaging your target audience. By leveraging the power of this comprehensive dataset, you can elevate your business decisions, enhance customer interactions, and navigate the digital landscape with confi...
By Throwback Thursday [source]
Here are some tips on how to make the most out of this dataset:
Data Exploration:
- Begin by understanding the structure and contents of the dataset. Evaluate the number of rows (sites) and columns (attributes) available.
- Check for missing values or inconsistencies in data entry that may impact your analysis.
- Assess column descriptions to understand what information is included in each attribute.
Geographical Analysis:
- Leverage geographical features such as latitude and longitude coordinates provided in this dataset.
- Plot these sites on a map using any mapping software or library like Google Maps or Folium for Python. Visualizing their distribution can provide insights into patterns based on location, climate, or cultural factors.
Analyzing Attributes:
- Familiarize yourself with different attributes available for analysis. Possible attributes include Name, Description, Category, Region, Country, etc.
- Understand each attribute's format and content type (categorical, numerical) for better utilization during data analysis.
Exploring Categories & Regions:
- Look at unique categories mentioned in the Category column (e.g., Cultural Site, Natural Site) to explore specific interests. This could help identify clusters within particular heritage types across countries/regions worldwide.
- Analyze regions with high concentrations of heritage sites using data visualizations like bar plots or word clouds based on frequency counts.
Identify Trends & Patterns:
- Discover recurring themes across various sites by analyzing descriptive text attributes such as names and descriptions.
- Identify patterns and correlations between attributes by performing statistical analysis or utilizing machine learning techniques.
Comparison:
- Compare different attributes to gain a deeper understanding of the sites.
- For example, analyze the number of heritage sites per country/region or compare the distribution between cultural and natural heritage sites.
Additional Data Sources:
- Use this dataset as a foundation to combine it with other datasets for in-depth analysis. There are several sources available that provide additional data on UNESCO World Heritage Sites, such as travel blogs, official tourism websites, or academic research databases.
Remember to cite this dataset appropriately if you use it in
- Travel Planning: This dataset can be used to identify and plan visits to UNESCO World Heritage sites around the world. It provides information about the location, category, and date of inscription for each site, allowing users to prioritize their travel destinations based on personal interests or preferences.
- Cultural Preservation: Researchers or organizations interested in cultural preservation can use this dataset to analyze trends in UNESCO World Heritage site listings over time. By studying factors such as geographical distribution, types of sites listed, and inscription dates, they can gain insights into patterns of cultural heritage recognition and protection.
- Statistical Analysis: The dataset can be used for statistical analysis to explore various aspects related to UNESCO World Heritage sites. For example, it could be used to examine the correlation between a country's economic indicators (such as GDP per capita) and the number or type of World Heritage sites it possesses. This analysis could provide insights into the relationship between economic development and cultural preservation efforts at a global scale
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Throwback Thursday.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.
The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.
WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.
Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.
Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.
We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.
The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942
Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.
posts
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
crosspost_parent_id | TEXT | The ID of the original Reddit post if this post is a crosspost. |
post_id | TEXT | Unique identifier for the Reddit post. |
created_at | TIMESTAMP | The timestamp when the post was created. |
updated_at | TIMESTAMP | The timestamp when the post was last updated. |
language_code | TEXT | The language code of the post. |
score | INTEGER | The score (upvotes minus downvotes) of the post. |
upvote_ratio | REAL | The ratio of upvotes to total votes. |
gildings | INTEGER | Number of awards (gildings) received by the post. |
num_comments | INTEGER | Number of comments on the post. |
comments
Column Name | Type | Description |
---|---|---|
subreddit_id | TEXT | The unique identifier for the subreddit. |
post_id | TEXT | The ID of the Reddit post the comment belongs to. |
parent_id | TEXT | The ID of the parent comment (if a reply). |
comment_id | TEXT | Unique identifier for the comment. |
created_at | TIMESTAMP | The timestamp when the comment was created. |
last_modified_at | TIMESTAMP | The timestamp when the comment was last modified. |
score | INTEGER | The score (upvotes minus downvotes) of the comment. |
upvote_ratio | REAL | The ratio of upvotes to total votes for the comment. |
gilded | INTEGER | Number of awards (gildings) received by the comment. |
postlinks
Column Name | Type | Description |
---|---|---|
post_id | TEXT | Unique identifier for the Reddit post. |
end_processed_valid | INTEGER | Whether the extracted URL from the post resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the Reddit post. |
final_valid | INTEGER | Whether the final URL from the post resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final URL. |
final_url | TEXT | The final URL after redirections. |
redirected | INTEGER | Indicator of whether the posted URL was redirected (1) or not (0). |
in_title | INTEGER | Indicator of whether the link appears in the post title (1) or post body (0). |
commentlinks
Column Name | Type | Description |
---|---|---|
comment_id | TEXT | Unique identifier for the Reddit comment. |
end_processed_valid | INTEGER | Whether the extracted URL from the comment resolves to a valid URL. |
end_processed_url | TEXT | The extracted URL from the comment. |
final_valid | INTEGER | Whether the final URL from the comment resolves to a valid URL after redirections. |
final_status | INTEGER | HTTP status code of the final |
Non-anonymized subset of the databases used in the paper "Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace" (Christin, 2013). In this dataset, textual information (item name, description, or feedback text) and handles have not been anonymized and are thus available. We don't expect any private identifiers or other PII to be present in the data, which was collected from a publicly available website -- the Silk Road anonymous marketplace -- for a few months in 2012.
For less restricted usage terms, please consider the anonymized version, which is also available without any restrictions. This non-anonymized dataset should only be requested if your project MUST rely on full textual descriptions of items and/or feedback.
Christin (2013) Traveling the Silk Road: A measurement analysis of a large anonymous online marketplace. To appear in Proceedings of the 22nd International World Wide Web Conference (WWW'13). Rio de Janeiro, Brazil. May 2013.
Data from Fortune 500's 2023 ranking.
Includes data on top 1000 companies w/ additional info (Stock symbol/*ticker*, CEO name).
Update (New dataset): 2024 Fortune 1000 Companies
From Investopedia:
The Fortune 1000 is an annual list of the 1000 largest American companies maintained by the popular magazine Fortune Fortune ranks the eligible companies by revenue generated from core operations, discounted operations, and consolidated subsidiaries Since revenue is the basis for inclusion, every company is authorized to operate in the United States and files a 10-K or comparable financial statement with a government agency -- .
Fortune magazine publishes this list every year and some lists can be found from different sources. From looking at this year's available datasets, some features were missing or could not be found. This was built from scraping the standard features as well as what's included on Company Info (such as CEO, Ticker and website) from the Fortune magazine website. Details on how the data was generated can be found on this notebook where a few of the features were also visualized.
The source code from the 2023 fortune 500 Ranking includes 1000 companies. A reference page (slug) to additional info is included for each companies which were also scrapped to complete the dataset.
Available formats: csv, parquet
Features are follows:
[Note: References to datatypes are relevant when using the parquet file; Labels refer to the original website names]
Patterns of educational attainment vary greatly across countries, and across population groups within countries. In some countries, virtually all children complete basic education whereas in others large groups fall short. The primary purpose of this database, and the associated research program, is to document and analyze these differences using a compilation of a variety of household-based data sets: Demographic and Health Surveys (DHS); Multiple Indicator Cluster Surveys (MICS); Living Standards Measurement Study Surveys (LSMS); as well as country-specific Integrated Household Surveys (IHS) such as Socio-Economic Surveys.As shown at the website associated with this database, there are dramatic differences in attainment by wealth. When households are ranked according to their wealth status (or more precisely, a proxy based on the assets owned by members of the household) there are striking differences in the attainment patterns of children from the richest 20 percent compared to the poorest 20 percent.In Mali in 2012 only 34 percent of 15 to 19 year olds in the poorest quintile have completed grade 1 whereas 80 percent of the richest quintile have done so. In many countries, for example Pakistan, Peru and Indonesia, almost all the children from the wealthiest households have completed at least one year of schooling. In some countries, like Mali and Pakistan, wealth gaps are evident from grade 1 on, in other countries, like Peru and Indonesia, wealth gaps emerge later in the school system.The EdAttain website allows a visual exploration of gaps in attainment and enrollment within and across countries, based on the international database which spans multiple years from over 120 countries and includes indicators disaggregated by wealth, gender and urban/rural location. The database underlying that site can be downloaded from here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Worldwide Soundscapes project is a global, open inventory of spatio-temporally replicated soundscape datasets. This Zenodo entry comprises the data tables that constitute its (meta-)database, as well as their description.
The overview of all sampling sites can be found on the corresponding project on ecoSound-web, as well as a demonstration collection containing selected recordings. More information on the project can be found here and on ResearchGate.
The audio recording criteria justifying inclusion into the meta-database are:
The individual columns of the provided data tables are described in the following. Data tables are linked through primary keys; joining them will result in a database.
datasets
datasets-sites
sites
deployments
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.
Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.
User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.
Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.
GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.
Market Intelligence and Consumer Behaviuor: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.
High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.
Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.
Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
How much time do people spend on social media? As of 2025, the average daily social media usage of internet users worldwide amounted to 141 minutes per day, down from 143 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of 3 hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in the U.S. was just 2 hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively. People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general. During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cyber attacks are a growing concern for small businesses during COVID-19 . Be Protected While You Work. Upgrade Your Small Business's Virus Protection Today! Before going for a Cyber security solutions for small to mid-sized businesses deliver enterprise-level protection.
Download this (Checklist for a Small Firm's Cybersecurity Program 2020-2021) data set to deploy secure functioning of various aspects of your small business including, employee data, website and more.This checklist is provided to
assist small member firms with limited resources to establish a cybersecurity program to identify and assess cybersecurity threats,
protect assets from cyber intrusions,
detect when their systems and assets have been compromised,
plan for the response when a compromise occurs and implement a plan to recover lost, stolen or unavailable assets.
Train employees in security principles.
Protect information, computers, and networks from malware attacks.
Provide firewall security for your Internet connection.
Create a mobile device action plan.
Make backup copies of important business data and information.
Learn about the threats and how to protect your website.
Protect Your Small Business site.
Learn the basics for protecting your business web sites from cyber attacks at WP Hacked Help Blog
Created With Inputs From Security Experts at WP Hacked Help - Pioneer In WordPress Malware Removal & Security
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
There are a lot of unknowns when running an E-commerce store, even when you have analytics to guide your decisions.
Users are an important factor in an e-commerce business. This is especially true in a C2C-oriented store, since they are both the suppliers (by uploading their products) AND the customers (by purchasing other user's articles).
This dataset aims to serve as a benchmark for an e-commerce fashion store. Using this dataset, you may want to try and understand what you can expect of your users and determine in advance how your grows may be.
If you think this kind of dataset may be useful or if you liked it, don't forget to show your support or appreciation with an upvote/comment. You may even include how you think this dataset might be of use to you. This way, I will be more aware of specific needs and be able to adapt my datasets to suits more your needs.
This dataset is part of a preview of a much larger dataset. Please contact me for more.
What is inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
The data was scraped from a successful online C2C fashion store with over 9M registered users. The store was first launched in Europe around 2009 then expanded worldwide.
Visitors vs Users: Visitors do not appear in this dataset. Only registered users are included. "Visitors" cannot purchase an article but can view the catalog.
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Questions you might want to answer using this dataset:
For other licensing options, contact me.
GLOBE is a project to develop the best available 30-arc-second (nominally 1 kilometer) global digital elevation data set. This version of GLOBE contains data from 11 sources, and 17 combinations of source and lineage. It continues much in the tradition of the National Geophysical Data Center's TerrainBase (FGDC 1090), as TerrainBase served as a generally lower-resolution prototype of GLOBE data management and compilation techniques. The GLOBE mosaic has been compiled onto CD-ROMs for the international user community. It is also available from the World Wide Web (linked from the online linkage noted above and anonymous ftp. Improvements to the global model are anticipated, as appropriate data and/or methods are made available. In addition, individual contributions to GLOBE (several areas have more than one candidate) should become available at the same website. GLOBE may be used for technology development, such as helping plan infrastructure for cellular communications networks, other public works, satellite data processing, and environmental monitoring and analysis. GLOBE prototypes (and probably GLOBE itself after its release) have been used to help develop terrain avoidance systems for aircraft. In all cases, GLOBE data should be treated as any potentially useful but guaranteed imperfect data set. Mission- or life-critical applications should consider the documented artifacts, as well as likely undocumented imperfections, in the data.
Click Web Traffic Combined with Transaction Data: A New Dimension of Shopper Insights
Consumer Edge is a leader in alternative consumer data for public and private investors and corporate clients. Click enhances the unparalleled accuracy of CE Transact by allowing investors to delve deeper and browse further into global online web traffic for CE Transact companies and more. Leverage the unique fusion of web traffic and transaction datasets to understand the addressable market and understand spending behavior on consumer and B2B websites. See the impact of changes in marketing spend, search engine algorithms, and social media awareness on visits to a merchant’s website, and discover the extent to which product mix and pricing drive or hinder visits and dwell time. Plus, Click uncovers a more global view of traffic trends in geographies not covered by Transact. Doubleclick into better forecasting, with Click.
Consumer Edge’s Click is available in machine-readable file delivery and enables: • Comprehensive Global Coverage: Insights across 620+ brands and 59 countries, including key markets in the US, Europe, Asia, and Latin America. • Integrated Data Ecosystem: Click seamlessly maps web traffic data to CE entities and stock tickers, enabling a unified view across various business intelligence tools. • Near Real-Time Insights: Daily data delivery with a 5-day lag ensures timely, actionable insights for agile decision-making. • Enhanced Forecasting Capabilities: Combining web traffic indicators with transaction data helps identify patterns and predict revenue performance.
Use Case: Analyze Year Over Year Growth Rate by Region
Problem A public investor wants to understand how a company’s year-over-year growth differs by region.
Solution The firm leveraged Consumer Edge Click data to: • Gain visibility into key metrics like views, bounce rate, visits, and addressable spend • Analyze year-over-year growth rates for a time period • Breakout data by geographic region to see growth trends
Metrics Include: • Spend • Items • Volume • Transactions • Price Per Volume
Inquire about a Click subscription to perform more complex, near real-time analyses on public tickers and private brands as well as for industries beyond CPG like: • Monitor web traffic as a leading indicator of stock performance and consumer demand • Analyze customer interest and sentiment at the brand and sub-brand levels
Consumer Edge offers a variety of datasets covering the US, Europe (UK, Austria, France, Germany, Italy, Spain), and across the globe, with subscription options serving a wide range of business needs.
Consumer Edge is the Leader in Data-Driven Insights Focused on the Global Consumer
https://www.icpsr.umich.edu/web/ICPSR/studies/38050/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/38050/terms
Launched on April 28, 2009, Kickstarter is a Public Benefit Corporation based in Brooklyn, New York. It is a global crowdfunding platform that helps to fund new creative projects and ideas through direct support from individuals (backers) from around the world who pledge money to bring these projects and ideas to life. Kickstarter supports many different kinds of projects. Everything from films, games, and music to art, design, and technology. Funding on Kickstarter is based on the all-or-nothing model. Backers who pledge their support towards a particular project won't be charged unless the funding goal has been reached. Successfully funded projects reward their backers with one-of-a-kind experiences, e.g., limited editions, or copies of the creative work being produced. This study includes three datasets: (1) Kickstarter Project (public-use file), (2) Backer Location file, and (3) Kickstarter Project (restricted-use file). The public-use Kickstarter Project dataset contains detailed information about all successful and unsuccessful Kickstarter projects (N=610,015) from 2009-2023, including the project category and subcategory, project location (city, state (for U.S.-based projects), and country), funding goal in original and U.S. currencies, amount pledged in dollars, and the number of backers for each project. The restricted file adds the project title, 150-character project description, and the URL for the project on the Kickstarter site. The Backer Location dataset includes information about backers' country and state and the total amount pledged for each geographic location.
Our tabular dataset offers comprehensive B2B contact information extracted from import and export trades designed to fuel lead generation efforts. With meticulous field-checking processes, our data is a reliable resource for businesses seeking to expand their networks and explore new trade opportunities.
Each entry in our dataset undergoes rigorous validation protocols to ensure accuracy and completeness. Our quality control measures include cross-referencing multiple sources, verifying contact details, and validating trade information against authoritative databases. Maintaining high data integrity standards guarantees that our clients receive actionable insights to drive their business strategies forward.
The dataset encompasses many industries, capturing import and export trades across diverse sectors and regions. Our dataset provides a panoramic view of global trade dynamics from manufacturing to technology, agriculture to healthcare. With detailed information on products, quantities, and trading partners, businesses can identify promising leads, forge strategic partnerships, and capitalize on emerging market trends.
Our dataset offers substantial coverage in terms of scale, encompassing millions of trade transactions and B2B contacts worldwide. Whether clients seek to explore new markets, source reliable suppliers, or connect with potential buyers, our dataset is a valuable asset for informed decision-making.
On the data marketplace, we offer flexible licensing options tailored to meet the diverse needs of our clients. Whether they require a subset of data for targeted campaigns or the entire dataset for comprehensive market analysis, we provide customizable solutions to accommodate varying requirements.
Our commitment to transparency and data privacy ensures that clients can confidently leverage our dataset, knowing that their information is handled with the utmost care and security. We adhere to stringent data protection regulations and industry best practices, safeguarding sensitive information and fostering trust among our clientele.
Our tabular dataset of import and export trades B2B contacts represents a goldmine of opportunities for businesses seeking to expand their global footprint. With unparalleled accuracy, breadth, and flexibility, it is a cornerstone for successful lead generation and strategic decision-making in today's dynamic marketplace.
Fields: - First Name - Last Name - Title - Company - Company Name for Emails - Email - Seniority - Departments - First Phone - Corporate Phone - Employees - Industry - Person Linkedin Url - Website - Company Linkedin Url - Facebook Url - City - State - Country - Company Address - Company City - Company State - Company Country - Company Phone - Technologies - Annual Revenue
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset consists of the top 50 most visited websites in the world, as well as the category and principal country/territory for each site. The data provides insights into which sites are most popular globally, and what type of content is most popular in different parts of the world
This dataset can be used to track the most popular websites in the world over time. It can also be used to compare website popularity between different countries and categories
- To track the most popular websites in the world over time
- To see how website popularity changes by region
- To find out which website categories are most popular
Dataset by Alexa Internet, Inc. (2019), released on Kaggle under the Open Data Commons Public Domain Dedication and License (ODC-PDDL)
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: df_1.csv | Column name | Description | |:--------------------------------|:---------------------------------------------------------------------| | Site | The name of the website. (String) | | Domain Name | The domain name of the website. (String) | | Category | The category of the website. (String) | | Principal country/territory | The principal country/territory where the website is based. (String) |