63 datasets found
  1. Media Web Reputation Ranking - SCImago

    • kaggle.com
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Jalaali (2025). Media Web Reputation Ranking - SCImago [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/media-web-reputation-ranking-scimago
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ali Jalaali
    Description

    Using four metrics—**Authority Score, Referring Domains, Citation Flow, and Trust Flow**—with an equal weight of 25%, SCImago constructs an overall indicator that reflects media websites’ digital reputation. The results define their relative position in the ranking and permit a comparison of digital development and leadership.

    ☢️❓The entire dataset is obtained from public and open-access data of SCImago Media Rankings

  2. Data from: Journal Ranking Dataset

    • kaggle.com
    Updated Aug 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abir (2023). Journal Ranking Dataset [Dataset]. https://www.kaggle.com/datasets/xabirhasan/journal-ranking-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2023
    Dataset provided by
    Kaggle
    Authors
    Abir
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Journals & Ranking

    An academic journal or research journal is a periodical publication in which research articles relating to a particular academic discipline is published, according to Wikipedia. Currently, there are more than 25,000 peer-reviewed journals that are indexed in citation index databases such as Scopus and Web of Science. These indexes are ranked on the basis of various metrics such as CiteScore, H-index, etc. The metrics are calculated from yearly citation data of the journal. A lot of efforts are given to make a metric that reflects the journal's quality.

    Journal Ranking Dataset

    This is a comprehensive dataset on the academic journals coving their metadata information as well as citation, metrics, and ranking information. Detailed data on their subject area is also given in this dataset. The dataset is collected from the following indexing databases: - Scimago Journal Ranking - Scopus - Web of Science Master Journal List

    The data is collected by scraping and then it was cleaned, details of which can be found in HERE.

    Key Features

    • Rank: Overall rank of journal (derived from sorted SJR index).
    • Title: Name or title of journal.
    • OA: Open Access or not.
    • Country: Country of origin.
    • SJR-index: A citation index calculated by Scimago.
    • CiteScore: A citation index calculated by Scopus.
    • H-index: Hirsh index, the largest number h such that at least h articles in that journal were cited at least h times each.
    • Best Quartile: Top Q-index or quartile a journal has in any subject area.
    • Best Categories: Subject areas with top quartile.
    • Best Subject Area: Highest ranking subject area.
    • Best Subject Rank: Rank of the highest ranking subject area.
    • Total Docs.: Total number of documents of the journal.
    • Total Docs. 3y: Total number of documents in the past 3 years.
    • Total Refs.: Total number of references of the journal.
    • Total Cites 3y: Total number of citations in the past 3 years.
    • Citable Docs. 3y: Total number of citable documents in the past 3 years.
    • Cites/Doc. 2y: Total number of citations divided by the total number of documents in the past 2 years.
    • Refs./Doc.: Total number of references divided by the total number of documents.
    • Publisher: Name of the publisher company of the journal.
    • Core Collection: Web of Science core collection name.
    • Coverage: Starting year of coverage.
    • Active: Active or inactive.
    • In-Press: Articles in press or not.
    • ISO Language Code: Three-letter ISO 639 code for language.
    • ASJC Codes: All Science Journal Classification codes for the journal.

    Rest of the features provide further details on the journal's subject area or category: - Life Sciences: Top level subject area. - Social Sciences: Top level subject area. - Physical Sciences: Top level subject area. - Health Sciences: Top level subject area. - 1000 General: ASJC main category. - 1100 Agricultural and Biological Sciences: ASJC main category. - 1200 Arts and Humanities: ASJC main category. - 1300 Biochemistry, Genetics and Molecular Biology: ASJC main category. - 1400 Business, Management and Accounting: ASJC main category. - 1500 Chemical Engineering: ASJC main category. - 1600 Chemistry: ASJC main category. - 1700 Computer Science: ASJC main category. - 1800 Decision Sciences: ASJC main category. - 1900 Earth and Planetary Sciences: ASJC main category. - 2000 Economics, Econometrics and Finance: ASJC main category. - 2100 Energy: ASJC main category. - 2200 Engineering: ASJC main category. - 2300 Environmental Science: ASJC main category. - 2400 Immunology and Microbiology: ASJC main category. - 2500 Materials Science: ASJC main category. - 2600 Mathematics: ASJC main category. - 2700 Medicine: ASJC main category. - 2800 Neuroscience: ASJC main category. - 2900 Nursing: ASJC main category. - 3000 Pharmacology, Toxicology and Pharmaceutics: ASJC main category. - 3100 Physics and Astronomy: ASJC main category. - 3200 Psychology: ASJC main category. - 3300 Social Sciences: ASJC main category. - 3400 Veterinary: ASJC main category. - 3500 Dentistry: ASJC main category. - 3600 Health Professions: ASJC main category.

  3. Esports Performance Rankings and Results

    • kaggle.com
    Updated Dec 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Esports Performance Rankings and Results [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-collegiate-esports-performance-with-bu/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Esports Performance Rankings and Results

    Performance Rankings and Results from Multiple Esports Platforms

    By [source]

    About this dataset

    This dataset provides a detailed look into the world of competitive video gaming in universities. It covers a wide range of topics, from performance rankings and results across multiple esports platforms to the individual team and university rankings within each tournament. With an incredible wealth of data, fans can discover statistics on their favorite teams or explore the challenges placed upon university gamers as they battle it out to be the best. Dive into the information provided and get an inside view into the world of collegiate esports tournaments as you assess all things from Match ID, Team 1, University affiliations, Points earned or lost in each match and special Seeds or UniSeeds for exceptional teams. Of course don't forget about exploring all the great Team Names along with their corresponding websites for further details on stats across tournaments!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Download Files First, make sure you have downloaded the CS_week1, CS_week2, CS_week3 and seeds datasets on Kaggle. You will also need to download the currentRankings file for each week of competition. All files should be saved using their originally assigned name in order for your analysis tools to read them properly (ie: CS_week1.csv).

    Understand File Structure Once all data has been collected and organized into separate files on your desktop/laptop computer/mobile device/etc., it's time to become familiar with what type of information is included in each file. The main folder contains three main data files: week1-3 and seedings. The week1-3 contain teams matched against one another according to university, point score from match results as well as team name and website URL associated with university entry; whereas the seedings include a ranking system amongst university entries which are accompanied by information regarding team names, website URLs etc.. Furthermore, there is additional file featured which contains currentRankings scores for each individual player/teams for an first given period of competition (ie: first week).

    Analyzing Data Now that everything is set up on your end it’s time explore! You can dive deep into trends amongst universities or individual players in regards to specific match performances or standings overall throughout weeks of competition etc… Furthermore you may also jumpstart insights via further creation of graphs based off compiled date from sources taken from BUECTracker dataset! For example let us say we wanted compare two universities- let's say Harvard University v Cornell University - against one another since beginning of event i we shall extract respective points(column),dates(column)(found under result tab) ,regions(csilluminating North America vs Europe etc)general stats such as maps played etc.. As well any other custom ideas which would come along in regards when dealing with similar datasets!

    Research Ideas

    • Analyze the performance of teams and identify areas for improvement for better performance in future competitions.
    • Assess which esports platforms are the most popular among gamers.
    • Gain a better understanding of player rankings across different regions, based on rankings system, to create targeted strategies that could boost individual players' scoring potential or team overall success in competitive gaming events

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: CS_week1.csv | Column name | Description | |:---------------|:----------------------------------------------| | Match ID | Unique identifier for each match. (Integer) | | Team 1 | Name of the first team in the match. (String) | | University | University associated with the team. (String) |

    File: CS_week1_currentRankings.csv | Column name | Description | |:--------------|:-----------------------------------------------------------|...

  4. Most visited websites by hierachycal categories

    • kaggle.com
    Updated Sep 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natanael de Souza Figueiredo (2020). Most visited websites by hierachycal categories [Dataset]. https://www.kaggle.com/natanael127/most-visited-websites-by-hierachycal-categories/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Natanael de Souza Figueiredo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Alexa Internet was founded in April 1996 by Brewster Kahle and Bruce Gilliat. The company's name was chosen in homage to the Library of Alexandria of Ptolemaic Egypt, drawing a parallel between the largest repository of knowledge in the ancient world and the potential of the Internet to become a similar store of knowledge. (from Wikipedia)

    The categories list was going out by September, 17h, 2020. So I would like to save it. https://support.alexa.com/hc/en-us/articles/360051913314

    This dataset was elaborated by this python script (V2.0): https://github.com/natanael127/dump-alexa-ranking

    Content

    The sites are grouped in 17 macro categories and this tree ends having more than 360.000 nodes. Subjects are very organized and each of them has its own rank of most accessed domains. So, even the keys of a sub-dictionary may be a good small dataset to use.

    Acknowledgements

    Thank you my friend André (https://github.com/andrerclaudio) by helping me with tips of Google Colaboratory and computational power to get the data until our deadline.

    Inspiration

    Alexa ranking was inspired by Library of Alexandria. In the modern world, it may be a good start for AI know more about many, many subjects of the world.

  5. P

    Alexa Domains Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaac Corley; Jonathan Lwowski; Justin Hoffman (2001). Alexa Domains Dataset [Dataset]. https://paperswithcode.com/dataset/gagan-bhatia
    Explore at:
    Dataset updated
    Feb 1, 2001
    Authors
    Isaac Corley; Jonathan Lwowski; Justin Hoffman
    Description

    This dataset is composed of the URLs of the top 1 million websites. The domains are ranked using the Alexa traffic ranking which is determined using a combination of the browsing behavior of users on the website, the number of unique visitors, and the number of pageviews. In more detail, unique visitors are the number of unique users who visit a website on a given day, and pageviews are the total number of user URL requests for the website. However, multiple requests for the same website on the same day are counted as a single pageview. The website with the highest combination of unique visitors and pageviews is ranked the highest

  6. o

    Data set of the article: Using Machine Learning for Web Page Classification...

    • explore.openaire.eu
    Updated Jan 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goran Matošević; Jasminka Dobša; Dunja Mladenić (2021). Data set of the article: Using Machine Learning for Web Page Classification in Search Engine Optimization [Dataset]. http://doi.org/10.5281/zenodo.4416123
    Explore at:
    Dataset updated
    Jan 4, 2021
    Authors
    Goran Matošević; Jasminka Dobša; Dunja Mladenić
    Description

    Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization" Abstract of the article: This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.

  7. Empirical Analysis of Ranking Models for an Adaptable Dataset Search:...

    • figshare.com
    zip
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme; Marco Antonio Casanova (2023). Empirical Analysis of Ranking Models for an Adaptable Dataset Search: complementary material [Dataset]. http://doi.org/10.6084/m9.figshare.5620651.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme; Marco Antonio Casanova
    License

    https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html

    Description

    This repository contains performance measures of dataset ranking models.- Usage: from Results/src run Python results m1 m2 ...such that mi can be omitted, or be any element of the list of model labels ['bayesian-12C', 'bayesian-5L', 'bayesian-5L12C', 'cos-12C', 'cos-5L', 'cos-5L5C', 'j48-12C', 'j48-5L', 'j48-5L5C', 'jrip-12C', 'jrip-5L', 'jrip-5L5C', 'sn-12C', 'sn-5L', 'sn-5L12C']. Results of selected models will be plotted in a 2D line plot. If no model is provided all models will be listed.

  8. Data set of the article: Ranking by relevance and citation counts, a...

    • zenodo.org
    bin
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristòfol Rovira; Cristòfol Rovira; Lluís Codina; Lluís Codina; Frederic Guerrero-Solé; Frederic Guerrero-Solé; Carlos Lopezosa; Carlos Lopezosa (2020). Data set of the article: Ranking by relevance and citation counts, a comparative study: Google Scholar, Microsoft Academic, WoS and Scopus [Dataset]. http://doi.org/10.5281/zenodo.3381151
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristòfol Rovira; Cristòfol Rovira; Lluís Codina; Lluís Codina; Frederic Guerrero-Solé; Frederic Guerrero-Solé; Carlos Lopezosa; Carlos Lopezosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data of investigation published in the article "Ranking by relevance and citation counts, a comparative study: Google Scholar, Microsoft Academic, WoS and Scopus".

    Abstract of the article:

    Search engine optimization (SEO) constitutes the set of methods designed to increase the visibility of, and the number of visits to, a web page by means of its ranking on the search engine results pages. Recently, SEO has also been applied to academic databases and search engines, in a trend that is in constant growth. This new approach, known as academic SEO (ASEO), has generated a field of study with considerable future growth potential due to the impact of open science. The study reported here forms part of this new field of analysis. The ranking of results is a key aspect in any information system since it determines the way in which these results are presented to the user. The aim of this study is to analyse and compare the relevance ranking algorithms employed by various academic platforms to identify the importance of citations received in their algorithms. Specifically, we analyse two search engines and two bibliographic databases: Google Scholar and Microsoft Academic, on the one hand, and Web of Science and Scopus, on the other. A reverse engineering methodology is employed based on the statistical analysis of Spearman’s correlation coefficients. The results indicate that the ranking algorithms used by Google Scholar and Microsoft are the two that are most heavily influenced by citations received. Indeed, citation counts are clearly the main SEO factor in these academic search engines. An unexpected finding is that, at certain points in time, WoS used citations received as a key ranking factor, despite the fact that WoS support documents claim this factor does not intervene.

  9. i

    Data from: A dataset on the evaluation of the accessibility of the home...

    • ieee-dataport.org
    • observatorio-cientifico.ua.es
    • +2more
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milton Campoverde Molina (2021). A dataset on the evaluation of the accessibility of the home pages of the web portals of Ecuadorian higher education institutions ranked in Webometrics [Dataset]. https://ieee-dataport.org/documents/dataset-evaluation-accessibility-home-pages-web-portals-ecuadorian-higher-education
    Explore at:
    Dataset updated
    Aug 26, 2021
    Authors
    Milton Campoverde Molina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Ecuador
    Description

    this research aims to evaluate the accessibility of the home pages of the web portals of the Ecuadorian higher education institutions ranked in the Webometrics with the Web Content Accessibility Guidelines (WCAG) 2.1 of the World Wide Web Consortium.

  10. Dataset covidgilance signals

    • zenodo.org
    bin, csv +3
    Updated Sep 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaudinat Arnaud; Gaudinat Arnaud (2020). Dataset covidgilance signals [Dataset]. http://doi.org/10.5281/zenodo.4048460
    Explore at:
    csv, tsv, bin, text/x-python, txtAvailable download formats
    Dataset updated
    Sep 25, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gaudinat Arnaud; Gaudinat Arnaud
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Research datasets about top signals for covid 19 (coronavirus) for study into Google Trends (GT) and with SEO metrics

    Website

    The study is currently published on https://covidgilance.org website (in french)

    Datasets description

    covid signals -> |selection| -> 4 dataset -> |serp.py| -> 4 serp datasets -> |aggregate_serp.pl| -> 4 aggregated dataset of serp -> |prepare datasets| -> 4 ranked top seo dataset

    Original lists of signals (mainly covid symptoms) - dataset

    Description: contain the original relevant list of signals for covid19 (here list of queries where you can see, in GT, a relevant signal during the covid 19 period of time)
    Name: covid_signal_list.tsv

    List of content:

    - id: unique id for the topic
    - topic-fr: name of the topic in French
    - topic-en: name of the topic in English
    - topic-id: GT topic id
    - keyword fr: one or several keywords in French for GT
    - keyword en: one or several keywords in English for GT
    - fr-topic-url-12M: link to 12-months French query topic in GT in France
    - en-topic-url-12M: link to 12-months English query topic in GT in US
    - fr-url-12M: link to 12-months French queries in GT in France
    - en-url-12M: link to 12-months English queries topic in GT in US
    - fr-topic-url-5M: link to 5-months French query topic in GT in France
    - en-topic-url-5M: link to 5-months English query topic in GT in US
    - fr-url-5M: link to 5-months French queries in GT in France
    - en-url-5M: link to 5-months English queries topic in GT in US

    Tool to get SERP of covid signals - tool

    Description: query google with a list of covid signals and obtain a list of serps in csv (tsv in fact) file format
    Name: serper.py

    python serper.py

    SERP files - datasets

    Description Serp results for 4 datesets of queries Names: simple version of covid signals from google.ch in French: serp_signals_20_ch_fr.csv
    simple version of covid signals from google.com in English: serp_signals_20_en.csv
    amplified version of covid signals from google.ch in French: serp_signals_covid_20_ch_fr.csv
    amplified version of covid signals from google.com in English: serp_signals_covid_20_en.csv

    amplified version means that for each query we create two queries one with the keywords "covid" and one with "coronavirus"

    Tool to aggregate SERP results - tool

    Description: load csv serp data and aggregate the data to create a new csv file where each line is a website and each column is a query. Name: aggregate_serp.pl

    `perl aggregate_serp.pl> aggregated_signals_20_en.csv

    datasets of top website from the SERP results - dataset

    Description a aggregated version of the SERP where each line is a website and each column a query
    Names:
    aggregated_signals_20_ch_fr.csv
    aggregated_signals_20_en.csv
    aggregated_signals_covid_20_ch_fr.csv
    aggregated_signals_covid_20_en.csv

    List of content:

    - domain: domain name of the website
    - signal 1: Position of the query 1 (signal 1) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - signal ...: Position of the query (signal) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - signal n: Position of the query n (signal n) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - total: average position (total of all position /divided by the number of queries)
    - missing: Total number of missing results in the SERP for this website

    datasets ranked top seo - dataset

    Description a ranked (by weighted average position) version of the aggregated version of the SERP where each line is a website and each column a query. TOP 20 have more information about the type and HONcode validity (from the date of collect: September 2020)

    Names:
    ranked_signals_20_ch_fr.csv
    ranked_signals_20_en.csv
    ranked_signals_covid_20_ch_fr.csv
    ranked_signals_covid_20_en.csv

    List of content:

    - domain: domain name of the website
    - signal 1: Position of the query 1 (signal 1) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - signal ...: Position of the query (signal) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - signal n: Position of the query n (signal n) in the SERP where 30 indicates arbitrary that this website is not present in the SERP
    - avg position: average position (total of all position /divided by the number of queries)
    - nb missing: Total number of missing results in the SERP for this website
    - % presence: % of presence
    - weighted avg postion: combination of avg position and % of presence for final ranking
    - honcode: status of the Honcode certificate for this website (none/valid/expired)
    - type: type of the website (health, gov, edu or media)

  11. Z

    Data from: Webis-Web-Archive-17

    • data.niaid.nih.gov
    • webis.de
    • +2more
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stein, Benno (2024). Webis-Web-Archive-17 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1002203
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Kiesel, Johannes
    Potthast, Martin
    Hagen, Matthias
    Kneist, Florian
    Stein, Benno
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

  12. Traces captured by visiting the top 1500 website

    • kaggle.com
    zip
    Updated Aug 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DNS_dataset (2021). Traces captured by visiting the top 1500 website [Dataset]. https://www.kaggle.com/jacksontang16/traces-captured-by-visiting-the-top-1500-website
    Explore at:
    zip(5852806 bytes)Available download formats
    Dataset updated
    Aug 25, 2021
    Authors
    DNS_dataset
    Description

    Dataset

    This dataset was created by DNS_dataset

    Contents

  13. d

    Best Virtual Data Rooms 2024 Dataset

    • dataroom-providers.org
    Updated Sep 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataroom Providers (2018). Best Virtual Data Rooms 2024 Dataset [Dataset]. https://dataroom-providers.org/
    Explore at:
    Dataset updated
    Sep 6, 2018
    Dataset authored and provided by
    Dataroom Providers
    Description

    Best virtual data rooms 2024 dataset is created to provide the data room users and M&A specialists with detailed information on the best virtual data rooms. The dataset contains the descriptions of each dataroom solution and their ratings.

  14. Yahoo-Learning-to-Rank-Challenge

    • huggingface.co
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yahoo-Research (2024). Yahoo-Learning-to-Rank-Challenge [Dataset]. https://huggingface.co/datasets/YahooResearch/Yahoo-Learning-to-Rank-Challenge
    Explore at:
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Yahoo!https://tw.yahoo.com/
    Yahoo! Research
    Authors
    Yahoo-Research
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Yahoo! Learning to Rank Challenge, version 1.0

    Machine learning has been successfully applied to web search ranking and the goal of this dataset to benchmark such machine learning algorithms. The dataset consists of features extracted from (query,url) pairs along with relevance judgments. The queries, ulrs and features descriptions are not given, only the feature values are. There are two datasets in this distribution: a large one and a small one. Each dataset is divided in 3 sets:… See the full description on the dataset page: https://huggingface.co/datasets/YahooResearch/Yahoo-Learning-to-Rank-Challenge.

  15. P

    MSLR WEB30K Dataset

    • paperswithcode.com
    Updated Apr 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Qin; Tie-Yan Liu (2025). MSLR WEB30K Dataset [Dataset]. https://paperswithcode.com/dataset/mslr-web30k
    Explore at:
    Dataset updated
    Apr 14, 2025
    Authors
    Tao Qin; Tie-Yan Liu
    Description

    The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

    (1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

    (2) The features are basically extracted by us, and are those widely used in the research community.

    In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

  16. d

    Real Estate Data | Property Listing, Sold Properties, Rankings, Agent...

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grepsr, Real Estate Data | Property Listing, Sold Properties, Rankings, Agent Datasets | Global Coverage | For Competitive Property Pricing and Investment [Dataset]. https://datarade.ai/data-products/real-estate-property-data-grepsr-grepsr
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Grepsr
    Area covered
    Malaysia, Kazakhstan, Congo (Democratic Republic of the), Iraq, Australia, Spain, South Sudan, Tonga, Holy See, Kuwait
    Description

    Extract detailed property data points — address, URL, prices, floor space, overview, parking, agents, and more — from any real estate listings. The Rankings data contains the ranking of properties as they come in the SERPs of different property listing sites. Furthermore, with our real estate agents' data, you can directly get in touch with the real estate agents/brokers via email or phone numbers.

    A. Usecase/Applications possible with the data:

    1. Property pricing - accurate property data for real estate valuation. Gather information about properties and their valuations from Federal, State, or County level websites. Monitor the real estate market across the country and decide the best time to buy or sell based on data

    2. Secure your real estate investment - Monitor foreclosures and auctions to identify investment opportunities. Identify areas within special economic and opportunity zones such as QOZs - cross-map that with commercial or residential listings to identify leads. Ensure the safety of your investments, property, and personnel by analyzing crime data prior to investing.

    3. Identify hot, emerging markets - Gather data about rent, demographic, and population data to expand retail and e-commerce businesses. Helps you drive better investment decisions.

    4. Profile a building’s retrofit history - a building permit is required before the start of any construction activity of a building, such as changing the building structure, remodeling, or installing new equipment. Moreover, many large cities provide public datasets of building permits in history. Use building permits to profile a city’s building retrofit history.

    5. Study market changes - New construction data helps measure and evaluate the size, composition, and changes occurring within the housing and construction sectors.

    6. Finding leads - Property records can reveal a wealth of information, such as how long an owner has currently lived in a home. US Census Bureau data and City-Data.com provide profiles of towns and city neighborhoods as well as demographic statistics. This data is available for free and can help agents increase their expertise in their communities and get a feel for the local market.

    7. Searching for Targeted Leads - Focusing on small, niche areas of the real estate market can sometimes be the most efficient method of finding leads. For example, targeting high-end home sellers may take longer to develop a lead, but the payoff could be greater. Or, you may have a special interest or background in a certain type of home that would improve your chances of connecting with potential sellers. In these cases, focused data searches may help you find the best leads and develop relationships with future sellers.

    How does it work?

    • Analyze sample data
    • Customize parameters to suit your needs
    • Add to your projects
    • Contact support for further customization
  17. Dataset Search WebApp

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme (2023). Dataset Search WebApp [Dataset]. http://doi.org/10.6084/m9.figshare.5217958.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Angelo Batista Neves Júnior; Luiz André Portes Paes Leme
    License

    https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html

    Description

    Despite the fact that extensive list of open datasets are available in catalogues, most of the data publishers still connects their datasets to other popular datasets, such as DBpedia5, Freebase 6 and Geonames7. Although the linkage with popular datasets would allow us to explore external resources, it would fail to cover highly specialized information. Catalogues of linked data describe the content of datasets in terms of the update periodicity, authors, SPARQL endpoints, linksets with other datasets, amongst others, as recommended by W3C VoID Vocabulary. However, catalogues by themselves do not provide any explicit information to help the URI linkage process.Searching techniques can rank available datasets SI according to the probability that it will be possible to define links between URIs of SI and a given dataset T to be published, so that most of the links, if not all, could be found by inspecting the most relevant datasets in the ranking. dataset-search is a tool for searching datasets for linkage.

  18. A

    ‘QS World University Rankings 2017 - 2022’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘QS World University Rankings 2017 - 2022’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-qs-world-university-rankings-2017-2022-7fc4/d793e726/?iid=007-103&v=presentation
    Explore at:
    Dataset updated
    Aug 1, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘QS World University Rankings 2017 - 2022’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/padhmam/qs-world-university-rankings-2017-2022 on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    QS World University Rankings is an annual publication of global university rankings by Quacquarelli Symonds. The QS ranking receives approval from the International Ranking Expert Group (IREG), and is viewed as one of the three most-widely read university rankings in the world. QS publishes its university rankings in partnership with Elsevier.

    Content

    This dataset contains university data from the year 2017 to 2022. It has a total of 15 features. - university - name of the university - year - year of ranking - rank_display - rank given to the university - score - score of the university based on the six key metrics mentioned above - link - link to the university profile page on QS website - country - country in which the university is located - city - city in which the university is located - region - continent in which the university is located - logo - link to the logo of the university - type - type of university (public or private) - research_output - quality of research at the university - student_faculty_ratio - number of students assigned to per faculty - international_students - number of international students enrolled at the university - size - size of the university in terms of area - faculty_count - number of faculty or academic staff at the university

    Acknowledgements

    This dataset was acquired by scraping the QS World University Rankings website with Python and Selenium. Cover Image: Source

    Inspiration

    Some of the questions that can be answered with this dataset, 1. What makes a best ranked university? 2. Does the location of a university play a role in its ranking? 3. What do the best universities have in common? 4. How important is academic research for a university? 5. Which country is preferred by international students?

    --- Original source retains full ownership of the source dataset ---

  19. Cross-language corpora of privacy policies

    • zenodo.org
    • explore.openaire.eu
    • +1more
    csv, zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci (2023). Cross-language corpora of privacy policies [Dataset]. http://doi.org/10.5281/zenodo.7729546
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jun 17, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Ciclosi; Francesco Ciclosi; Silvia Vidor; Silvia Vidor; Fabio Massacci; Fabio Massacci
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of three different privacy policy corpora (in English and Italian) composed of 81 unique privacy policy texts spanning the period 2018-2021. This dataset makes available an example of three corpora of privacy policies. The first corpus is the English-language corpus, the original used in the study by Tang et al. [2]. The other two are cross-language corpora built (one, the source corpus, in English, and the other, the replication corpus, in Italian, which is the language of a potential replication study) from the first corpus.

    The policies were collected from:

    1. the Alexa top 10 Italy and U.S. websites rank;
    2. the Play Store apps rank in the "most profitable games" category of the Play Store for Italy and the U.S.

    We manually analyzed the Alexa top 10 Italy websites as of November 2021. Analogously, we analyzed selected apps that, in the same period, had ranked better in the "most profitable games" category of the Play Store for Italy.

    All the privacy policies are ANSI-encoded text files and have been manually read and verified.
    The dataset is helpful as a starting point for building comparable cross-language privacy policies corpora. The availability of these comparable cross-language privacy policies corpora helps replicate studies in different languages.
    Details on the methodology can be found in the accompanying paper.

    The available files are as follows:

    • policies-texts.zip --> contains a directory of text files with the policy texts. File names are the SHA1 hashes of the policy text.
    • policy-metadata.csv --> Contains a CSV file with the metadata for each privacy policy.

    This dataset is the original dataset used in the publication [1]. The original English U.S. corpus is described in the publication [2].

    [1] F. Ciclosi, S. Vidor and F. Massacci. "Building cross-language corpora for human understanding of privacy policies." Workshop on Digital Sovereignty in Cyber Security: New Challenges in Future Vision. Communications in Computer and Information Science. Springer International Publishing, 2023, In press.

    [2] J. Tang, H. Shoemaker, A. Lerner, and E. Birrell. Defining Privacy: How Users Interpret Technical Terms in Privacy Policies. Proceedings on Privacy Enhancing Technologies, 3:70–94, 2021.

  20. Data articles in journals

    • zenodo.org
    bin, csv, txt
    Updated Sep 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro (2023). Data articles in journals [Dataset]. http://doi.org/10.5281/zenodo.7458466
    Explore at:
    bin, txt, csvAvailable download formats
    Dataset updated
    Sep 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carlota Balsa-Sanchez; Carlota Balsa-Sanchez; Vanesa Loureiro; Vanesa Loureiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Last Version: 4

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2022/12/15

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
    File list:

    - data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published
    - data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 4th version
    - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
    - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

    Version: 3

    Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

    Date of data collection: 2022/10/28

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
    File list:

    - data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published
    - data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 3rd version
    - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
    - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

    Erratum - Data articles in journals Version 3:

    Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2
    Data -- ISSN 2306-5729 -- JCR (JIF) n/a
    Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

    Version: 2

    Author: Francisco Rubio, Universitat Politècnia de València.

    Date of data collection: 2020/06/23

    General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers.
    File list:

    - data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published
    - data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

    Relationship between files: both files have the same information. Two different formats are offered to improve reuse

    Type of version of the dataset: final processed version

    Versions of the files: 2nd version
    - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types
    - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

    Total size: 32 KB

    Version 1: Description

    This dataset contains a list of journals that publish data articles, code, software articles and database articles.

    The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals.
    Acknowledgements:
    Xaquín Lores Torres for his invaluable help in preparing this dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ali Jalaali (2025). Media Web Reputation Ranking - SCImago [Dataset]. https://www.kaggle.com/datasets/alijalali4ai/media-web-reputation-ranking-scimago
Organization logo

Media Web Reputation Ranking - SCImago

Quality, Influence & Trustworthiness, as Reputation of Global Media Websites

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Jalaali
Description

Using four metrics—**Authority Score, Referring Domains, Citation Flow, and Trust Flow**—with an equal weight of 25%, SCImago constructs an overall indicator that reflects media websites’ digital reputation. The results define their relative position in the ranking and permit a comparison of digital development and leadership.

☢️❓The entire dataset is obtained from public and open-access data of SCImago Media Rankings

Search
Clear search
Close search
Google apps
Main menu