Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for the forthcoming publication 'Contested Components: Studying Interface Enrichment as a Form of Content Moderation on Google and Bing'.
Datasets contain information on SERP components for Google Bing when querying 2000 controversial and 914 non-controversial questions.
Files include:
question_data.csv
: Information on questions sourced from 4chan and leftychan boards in November 2024. Columns include the counts per board (/fit/, /b/, /pol/. /int/, /k/, /lgbt/, and /leftypol/), categorization as controversial/non-controverial, and toxicity scores determined by Perspective API.serp_components.csv
: Information on the SERP data gathered using Zoekplaatje. Collected on 24 November 2024.screenshots.zip
: Screenshots of all SERPs. Note that at times, expanding the AI Overview box on Google resulted in the search bar overlaying the generated text.component_analysis.ipynb
: Code for analyzing the data.
Search engine |
Component name |
Count |
Example |
|
organic |
19,470 | |
Bing |
organic |
18,677 | |
|
related-questions |
1,534 | |
Bing |
related-queries |
1,425 | |
Bing |
info-card |
1,320 |
(each card is its own info-card component) |
|
related-queries |
1,289 | |
Bing |
organic-answer |
1,140 |
(often summarised through AI-assisted means) |
Bing |
video-widget |
776 | |
Bing |
organic-showcase |
752 | |
Bing |
related-questions |
725 | |
|
ai-overview |
499 | |
Bing |
organic-wiki-widget |
271 | |
|
did-you-mean |
223 | |
Bing |
related-queries-carousel |
219 | |
Bing |
info-card-image |
136 | |
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains the data produced for the dissertation ""User Interface Variations in Search Engine Results Pages Across Types of Search Queries and Search Engines"". The project was conducted by student Adelaide Miranda Santos at FEUP, University of Porto, as part of the Masters in Informatics and Computing Engineering. The primary objective of this work is to study interface variations in search engine results pages (SERPs) across different search engines and types of search queries. To this end, nearly 8,000 SERPs were captured using the ORCAS-I-gold dataset across six leading web search engines: Google, Microsoft Bing, Yandex, Yahoo!, Baidu, and DuckDuckGo. For each captured SERP, the number of occurrences of each interface element was recorded. Additionally, to analyze how the language of a search query affects SERP composition in Yandex and Baidu, the original English queries were translated into Russian and Simplified Chinese." The dataset is organized in the following folders: Search Query Dataset Translation Contains the search queries from the ORCAS-I-gold dataset translated into Russian and Simplified Chinese. The translation was made using ChatGPT-4o and verified by native speakers. In addition to the translated queries, the complete original ORCAS-I-gold dataset is also included as an independent resource. SERP Captures Includes HTML files of the search engine results pages collected from Baidu, Microsoft Bing, DuckDuckGo, Google, Yahoo!, and Yandex. Each top-level subfolder is named after the respective search engine. Within each of these, there are folders named according to the language and the query intent associated with the search query. These folders contain the corresponding SERP HTML files. File names represent the search queries and may be either encoded or displayed as in the original dataset. Occurrence of Elements per SERP For each captured SERP, we recorded the frequency of each interface element. This data is organized in a relational database structure composed of the following CSV files: - elements.csv: Lists all identified SERP elements along with their corresponding IDs, categories, types, and subtypes (if applicable). - identifiers.csv: Contains the selectors or identifiers used for automatic detection of each element, along with their associated element ID, identifier ID, and the corresponding search engine ID. - intents.csv: Maps query intent names to their corresponding intent IDs. - search-engines.csv: Maps search engine names to their corresponding IDs. - main.csv: Records the frequency of each element in each captured SERP. Each row represents an observation and includes the following fields: element ID, identifier ID, search engine ID, query language, intent ID, query ID (as defined in the ORCAS-I-gold dataset), and the number of occurrences.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The current dataset is consisted of 200 search results extracted from Google and Bing engines (100 of Google and 100 of Bing). The search terms are selected from the 10 most search keywords of 2021 based on the provided data of Google Trends. The rest of the sheets include the performance of the websites according to three technical evaluation aspects. That is, SEO, Speed and Security. The performance dataset has been developed through the utilization of CheckBot crawling tool. The whole dataset can help information retrieval scientists to compare the two engines in terms of their position/ranking and their performance related to these factors.
For more information about the thinking of the of the structure of the dataset please contact the Information Management Lab of University of West Attica.
Contact Persons: Vasilis Ntararas (lb17032@uniwa.gr) , Georgios Ntimo (lb17100@uniwa.gr) and Ioannis C. Drivas (idrivas@uniwa.gr)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset was extracted for a study on the evolution of Web search engine interfaces since their appearance. The well-known list of “10 blue links” has evolved into richer interfaces, often personalized to the search query, the user, and other aspects. We used the most searched queries by year to extract a representative sample of SERP from the Internet Archive. The Internet Archive has been keeping snapshots and the respective HTML version of webpages over time and tts collection contains more than 50 billion webpages. We used Python and Selenium Webdriver, for browser automation, to visit each capture online, check if the capture is valid, save the HTML version, and generate a full screenshot. The dataset contains all the extracted captures. Each capture is represented by a screenshot, an HTML file, and a files' folder. We concatenate the initial of the search engine (G) with the capture's timestamp for file naming. The filename ends with a sequential integer "-N" if the timestamp is repeated. For example, "G20070330145203-1" identifies a second capture from Google by March 30, 2007. The first is identified by "G20070330145203". Using this dataset, we analyzed how SERP evolved in terms of content, layout, design (e.g., color scheme, text styling, graphics), navigation, and file size. We have registered the appearance of SERP features and analyzed the design patterns involved in each SERP component. We found that the number of elements in SERP has been rising over the years, demanding a more extensive interface area and larger files. This systematic analysis portrays evolution trends in search engine user interfaces and, more generally, web design. We expect this work will trigger other, more specific studies that can take advantage of the dataset we provide here. This graphic represents the diversity of captures by year and search engine (Google and Bing).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndianFoodNet-30 is created by Ritu Agarwal, Nikunj Bansal, Tanupriya Choudhury, Tanmay Sarkar & Neelu Jyothi Ahuja with a goal of building an Indian Food detection model. It contains more than 5500 images of 30 popular Indian food items.
We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.
This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
If you find our dataset useful, please cite us as:
@dataset{dataset,
author = {Agarwal, Ritu and Bansal, Nikunj and Choudhury, Tanupriya and Sarkar, Tanmay and J.Ahuja, Neelu},
year = {2023},
title = {IndianFoodNet-30 Dataset},
publisher = {Roboflow Universe},
url = {https://universe.roboflow.com/indianfoodnet/indianfoodnet},
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IndianFood-7 is created by Ritu Agarwal, Nikunj Bansal, Tanmay Sarkar, Tanupriya Choudhury and Neelu Jyothi Ahuja with a goal of building a Indian Food detection model. It contains more than 800 images of 7 popular Indian food items.
We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.
This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
bundestag.csv - UTF-8 encoded comma separated text file
This dataset contains the members of the 18th German Bundestag in the constitution of late 2016.
terms.csv - UTF-8 encoded comma separated text file
This dataset contains the unordered and pooled auto-completions for the German politicians from Bing search (http://api.bing.net/osjson.aspx), from Duck-Duck-Go (https://duckduckgo.com/ac/) and from Google search (http://clients1.google.de/complete/search). The data was crawled on (mostly) two times per day from 2017/02/03 to 2017/06/19. German language settings were used for Google and Bing, English language setting was used for Duck-Duck-Go. The API requests were sent with an IP address from Cologne, Germany.
: google, bing or ddg
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the query suggestions for 5 major German parties (terms: "afd", "csu", "dielinke", "fdp", "grüne", "spd") and ten popular politicians and party leaders (terms: "Alexander Gauland", "Alice Weidel", "Angela Merkel", "Cem Özdemir", "Christian Lindner", "Dietmar Bartsch", "Katrin Göring-Eckardt", "Martin Schulz", "Sahra Wagenknecht").
The data was crawled on (mostly) two times per day from Tue Aug 04, 2017 to Tue Oct 31, 2017. The dataset contains 20001 suggestions from Bing search (http://api.bing.net/osjson.aspx), 11935 suggestions from Duck-Duck-Go (https://duckduckgo.com/ac/) and 33521 suggestions from Google search (http://clients1.google.de/complete/search). Note, that for some terms and dates no suggestions were returned by some of the APIs.
German language settings were used for Google and Bing, English language setting was used for Duck-Duck-Go. The API requests were sent with an IP address from Cologne, Germany.
google, bing or ddg
the query term
the date and time of the API call formatted as ISO8601
the suggested query completion (the query term was removed from the suggestion)
the position of the query suggestion within the list returned by the API (ranges from 0 to 19)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Allergen30 is created by Mayank Mishra, Nikunj Bansal, Tanmay Sarkar and Tanupriya Choudhury with a goal of building a robust detection model that can assist people in avoiding possible allergic reactions.
It contains more than 6,000 images of 30 commonly used food items which can cause an adverse reaction within a human body. This dataset is one of the first research attempts in training a deep learning based computer vision model to detect the presence of such food items from images. It also serves as a benchmark for evaluating the efficacy of object detection methods in learning the otherwise difficult visual cues related to food items.
There are multiple food items pertaining to specific food intolerances which can trigger an allergic reaction. Such food intolerance primarily include Lactose, Histamine, Gluten, Salicylate, Caffeine and Ovomucoid intolerance.
https://github.com/mmayank74567/mmayank74567.github.io/blob/master/images/FoodIntol.png?raw=true" alt="Food intolerance">
The following table contains the description relating to the 30 class labels in our dataset.
S. No. | Allergen | Food label | Description |
---|---|---|---|
1 | Ovomucoid | egg | Images of egg with yolk (e.g. sunny side up eggs) |
2 | Ovomucoid | whole_egg_boiled | Images of soft and hard boiled eggs |
3 | Lactose/Histamine | milk | Images of milk in a glass |
4 | Lactose | icecream | Images of icecream scoops |
5 | Lactose | cheese | Images of swiss cheese |
6 | Lactose/ Caffeine | milk_based_beverage | Images of tea/ coffee with milk in a cup/glass |
7 | Lactose/Caffeine | chocolate | Images of chocolate bars |
8 | Caffeine | non_milk_based_beverage | Images of soft drinks and tea/coffee without milk in a cup/glass |
9 | Histamine | cooked_meat | Images of cooked meat |
10 | Histamine | raw_meat | Images of raw meat |
11 | Histamine | alcohol | Images of alcohol bottles |
12 | Histamine | alcohol_glass | Images of wine glasses with alcohol |
13 | Histamine | spinach | Images of spinach bundle |
14 | Histamine | avocado | Images of avocado sliced in half |
15 | Histamine | eggplant | Images of eggplant |
16 | Salicylate | blueberry | Images of blueberry |
17 | Salicylate | blackberry | Images of blackberry |
18 | Salicylate | strawberry | Images of strawberry |
19 | Salicylate | pineapple | Images of pineapple |
20 | Salicylate | capsicum | Images of bell pepper |
21 | Salicylate | mushroom | Images of mushrooms |
22 | Salicylate | dates | Images of dates |
23 | Salicylate | almonds | Images of almonds |
24 | Salicylate | pistachios | Images of pistachios |
25 | Salicylate | tomato | Images of tomato and tomato slices |
26 | Gluten | roti | Images of roti |
27 | Gluten | pasta | Images of one serving of penne pasta |
28 | Gluten | bread | Images of bread slices |
29 | Gluten | bread_loaf | Images of bread loaf |
30 | Gluten | pizza | Images of pizza and pizza slices |
We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.
This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset is a wheat breeding multimodal dataset, including wheat germplasm data, wheat phenotypic data, wheat planting technique data, wheat plant protection technique data, wheat seed price data, UAV remote sensing data, and weather data. The data sources are field acquisition and online public data.
Wheat germplasm data comes from the China Crop Germplasm Information Network (https://www.cgris.net/). The data on wheat planting technique and wheat plant protection technique come from search engines (Google, Bing, Baidu), and the search terms include "wheat planting technique, wheat cultivation technique, wheat plant protection technique, 小麦种植技术, 小麦栽培技术, 小麦植保技术". Wheat seed price data comes from the National Seed Market Monitoring Information Release Platform (http://202.127.45.18/) - China. UAV remote sensing data is the result of further processing after being obtained from field experiments. Weather data comes from meteorological equipment at various agricultural experimental bases and meteorological observation stations of the China Meteorological Administration.
The acquisition and processing of data are described in the relevant part of the manuscript.
This dataset will be continuously updated in the future to help breeding work be carried out efficiently and accelerate the breeding process of excellent varieties.
This repository contains pre-release files for the paper entitled “Detecting Unsigned Physical Road Incidents from Driver-view Images”, released in IEEE Transactions on Intelligent Vehicles. This image dataset contains over 12.500 images of unsigned (without warning signs) road incidents spread out over 8 classes (fire, animal on road, road collapse, snow, rain, treefall, landslides, and vehicle crashes). It consists of a zip file which contains images from the Geograph project and their attribution, as well as a SQLite database which contains URLs to images harvested from Google, Bing, and Flickr. URLs marked as '0' have since disappeared from the web. Unfortunately, due to copyright concerns we cannot make these images available. Please contact the authors directly if you are interested in using the full dataset. We make this dataset available for non-commercial purposes only. The preferred citation for this dataset will be added once the paper has been released.
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
After designing an image classification model, the next step is to create a One Piece character detector to which this classifier will be applied.
The folders contain approximately 650 images of : Luffy, Zoro, Sanji, Nami, Franky, Brook, Chopper, Usopp, Jinbei, Robin, Ace, Law, Shanks, Kurohige, Mihawk, Rayleigh.
I do not own these images.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis Query-Task-Mapping Corpus 2019 (Webis-QTM-19) comprises three benchmark datasets on the query-task-mapping problem, which consists of finding the correct task for a new query in a given task-split background query log.
It comprises three subdatasets in separate CSV files, each of which has three columns:
Further details can be found in reference:
Michael Völske, Ehsan Fatehifar, Benno Stein, and Matthias Hagen. Query-Task Mapping. In 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR 2019), July 2019. ACM.
http://doi.acm.org/10.1145/3331184.3331286
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Appendix A1-A16 report the top 10 highly cited articles in Google Patents from the Bing API search in the sixteen selected fields. It shows that there are some minority articles with many patent citations but with few or no Scopus citation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in
Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.
Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)
The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles. Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.
To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.
Dataset 2: Search Query Suggestions (suggestions.csv)
The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.
The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".
We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.
AllSides Scraper
At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.
We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.