14 datasets found

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...
zenodo.org
data.niaid.nih.gov
csv
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. http://doi.org/10.5281/zenodo.7682915
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7682915
Dataset updated
Mar 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles.
Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.
Data from: Inventory of online public databases and repositories holding...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Inventory of online public databases and repositories holding agricultural data in 2017 [Dataset]. https://catalog.data.gov/dataset/inventory-of-online-public-databases-and-repositories-holding-agricultural-data-in-2017-d4c81
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Description
United States agricultural researchers have many options for making their data available online. This dataset aggregates the primary sources of ag-related data and determines where researchers are likely to deposit their agricultural data. These data serve as both a current landscape analysis and also as a baseline for future studies of ag research data. Purpose As sources of agricultural data become more numerous and disparate, and collaboration and open data become more expected if not required, this research provides a landscape inventory of online sources of open agricultural data. An inventory of current agricultural data sharing options will help assess how the Ag Data Commons, a platform for USDA-funded data cataloging and publication, can best support data-intensive and multi-disciplinary research. It will also help agricultural librarians assist their researchers in data management and publication. The goals of this study were to establish where agricultural researchers in the United States-- land grant and USDA researchers, primarily ARS, NRCS, USFS and other agencies -- currently publish their data, including general research data repositories, domain-specific databases, and the top journals compare how much data is in institutional vs. domain-specific vs. federal platforms determine which repositories are recommended by top journals that require or recommend the publication of supporting data ascertain where researchers not affiliated with funding or initiatives possessing a designated open data repository can publish data Approach The National Agricultural Library team focused on Agricultural Research Service (ARS), Natural Resources Conservation Service (NRCS), and United States Forest Service (USFS) style research data, rather than ag economics, statistics, and social sciences data. To find domain-specific, general, institutional, and federal agency repositories and databases that are open to US research submissions and have some amount of ag data, resources including re3data, libguides, and ARS lists were analysed. Primarily environmental or public health databases were not included, but places where ag grantees would publish data were considered. Search methods We first compiled a list of known domain specific USDA / ARS datasets / databases that are represented in the Ag Data Commons, including ARS Image Gallery, ARS Nutrition Databases (sub-components), SoyBase, PeanutBase, National Fungus Collection, i5K Workspace @ NAL, and GRIN. We then searched using search engines such as Bing and Google for non-USDA / federal ag databases, using Boolean variations of “agricultural data” /“ag data” / “scientific data” + NOT + USDA (to filter out the federal / USDA results). Most of these results were domain specific, though some contained a mix of data subjects. We then used search engines such as Bing and Google to find top agricultural university repositories using variations of “agriculture”, “ag data” and “university” to find schools with agriculture programs. Using that list of universities, we searched each university web site to see if their institution had a repository for their unique, independent research data if not apparent in the initial web browser search. We found both ag specific university repositories and general university repositories that housed a portion of agricultural data. Ag specific university repositories are included in the list of domain-specific repositories. Results included Columbia University – International Research Institute for Climate and Society, UC Davis – Cover Crops Database, etc. If a general university repository existed, we determined whether that repository could filter to include only data results after our chosen ag search terms were applied. General university databases that contain ag data included Colorado State University Digital Collections, University of Michigan ICPSR (Inter-university Consortium for Political and Social Research), and University of Minnesota DRUM (Digital Repository of the University of Minnesota). We then split out NCBI (National Center for Biotechnology Information) repositories. Next we searched the internet for open general data repositories using a variety of search engines, and repositories containing a mix of data, journals, books, and other types of records were tested to determine whether that repository could filter for data results after search terms were applied. General subject data repositories include Figshare, Open Science Framework, PANGEA, Protein Data Bank, and Zenodo. Finally, we compared scholarly journal suggestions for data repositories against our list to fill in any missing repositories that might contain agricultural data. Extensive lists of journals were compiled, in which USDA published in 2012 and 2016, combining search results in ARIS, Scopus, and the Forest Service's TreeSearch, plus the USDA web sites Economic Research Service (ERS), National Agricultural Statistics Service (NASS), Natural Resources and Conservation Service (NRCS), Food and Nutrition Service (FNS), Rural Development (RD), and Agricultural Marketing Service (AMS). The top 50 journals' author instructions were consulted to see if they (a) ask or require submitters to provide supplemental data, or (b) require submitters to submit data to open repositories. Data are provided for Journals based on a 2012 and 2016 study of where USDA employees publish their research studies, ranked by number of articles, including 2015/2016 Impact Factor, Author guidelines, Supplemental Data?, Supplemental Data reviewed?, Open Data (Supplemental or in Repository) Required? and Recommended data repositories, as provided in the online author guidelines for each the top 50 journals. Evaluation We ran a series of searches on all resulting general subject databases with the designated search terms. From the results, we noted the total number of datasets in the repository, type of resource searched (datasets, data, images, components, etc.), percentage of the total database that each term comprised, any dataset with a search term that comprised at least 1% and 5% of the total collection, and any search term that returned greater than 100 and greater than 500 results. We compared domain-specific databases and repositories based on parent organization, type of institution, and whether data submissions were dependent on conditions such as funding or affiliation of some kind. Results A summary of the major findings from our data review: Over half of the top 50 ag-related journals from our profile require or encourage open data for their published authors. There are few general repositories that are both large AND contain a significant portion of ag data in their collection. GBIF (Global Biodiversity Information Facility), ICPSR, and ORNL DAAC were among those that had over 500 datasets returned with at least one ag search term and had that result comprise at least 5% of the total collection. Not even one quarter of the domain-specific repositories and datasets reviewed allow open submission by any researcher regardless of funding or affiliation. See included README file for descriptions of each individual data file in this dataset. Resources in this dataset:Resource Title: Journals. File Name: Journals.csvResource Title: Journals - Recommended repositories. File Name: Repos_from_journals.csvResource Title: TDWG presentation. File Name: TDWG_Presentation.pptxResource Title: Domain Specific ag data sources. File Name: domain_specific_ag_databases.csvResource Title: Data Dictionary for Ag Data Repository Inventory. File Name: Ag_Data_Repo_DD.csvResource Title: General repositories containing ag data. File Name: general_repos_1.csvResource Title: README and file inventory. File Name: README_InventoryPublicDBandREepAgData.txt
i
Interface Element Frequencies in Search Engine Results Pages (SERPs) Across...
rdm.inesctec.pt
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Interface Element Frequencies in Search Engine Results Pages (SERPs) Across Query Intents, Search Engines and Languages [Dataset]. https://rdm.inesctec.pt/dataset/cs-2025-006
Explore at:
Dataset updated
Jul 22, 2025
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset contains the data produced for the dissertation ""User Interface Variations in Search Engine Results Pages Across Types of Search Queries and Search Engines"". The project was conducted by student Adelaide Miranda Santos at FEUP, University of Porto, as part of the Masters in Informatics and Computing Engineering. The primary objective of this work is to study interface variations in search engine results pages (SERPs) across different search engines and types of search queries. To this end, nearly 8,000 SERPs were captured using the ORCAS-I-gold dataset across six leading web search engines: Google, Microsoft Bing, Yandex, Yahoo!, Baidu, and DuckDuckGo. For each captured SERP, the number of occurrences of each interface element was recorded. Additionally, to analyze how the language of a search query affects SERP composition in Yandex and Baidu, the original English queries were translated into Russian and Simplified Chinese." The dataset is organized in the following folders: Search Query Dataset Translation Contains the search queries from the ORCAS-I-gold dataset translated into Russian and Simplified Chinese. The translation was made using ChatGPT-4o and verified by native speakers. In addition to the translated queries, the complete original ORCAS-I-gold dataset is also included as an independent resource. SERP Captures Includes HTML files of the search engine results pages collected from Baidu, Microsoft Bing, DuckDuckGo, Google, Yahoo!, and Yandex. Each top-level subfolder is named after the respective search engine. Within each of these, there are folders named according to the language and the query intent associated with the search query. These folders contain the corresponding SERP HTML files. File names represent the search queries and may be either encoded or displayed as in the original dataset. Occurrence of Elements per SERP For each captured SERP, we recorded the frequency of each interface element. This data is organized in a relational database structure composed of the following CSV files: - elements.csv: Lists all identified SERP elements along with their corresponding IDs, categories, types, and subtypes (if applicable). - identifiers.csv: Contains the selectors or identifiers used for automatic detection of each element, along with their associated element ID, identifier ID, and the corresponding search engine ID. - intents.csv: Maps query intent names to their corresponding intent IDs. - search-engines.csv: Maps search engine names to their corresponding IDs. - main.csv: Records the frequency of each element in each captured SERP. Each row represents an observation and includes the following fields: element ID, identifier ID, search engine ID, query language, intent ID, query ID (as defined in the ORCAS-I-gold dataset), and the number of occurrences.
h
hyperlinks
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Goker Cebeci, hyperlinks [Dataset]. https://huggingface.co/datasets/goker/hyperlinks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Goker Cebeci
Description
Hyperlinks Dataset

This dataset contains subpage links, their features, and corresponding search engine rankings from Google and Bing. The data was collected as part of the research project: "Accessible Hyperlinks and Search Engine Rankings: An Empirical Investigation".

Dataset Description

This dataset is designed to facilitate research on the relationship between website accessibility, specifically hyperlink accessibility, and search engine rankings. It consists of… See the full description on the dataset page: https://huggingface.co/datasets/goker/hyperlinks.
Search Engines Comparison and Websites Performance
zenodo.org
data.niaid.nih.gov
bin
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgios Ntimo; Vasilios Ntararas; Georgios Ntimo; Vasilios Ntararas (2023). Search Engines Comparison and Websites Performance [Dataset]. http://doi.org/10.5281/zenodo.8102700
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8102700
Dataset updated
Jul 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georgios Ntimo; Vasilios Ntararas; Georgios Ntimo; Vasilios Ntararas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The current dataset is consisted of 200 search results extracted from Google and Bing engines (100 of Google and 100 of Bing). The search terms are selected from the 10 most search keywords of 2021 based on the provided data of Google Trends. The rest of the sheets include the performance of the websites according to three technical evaluation aspects. That is, SEO, Speed and Security. The performance dataset has been developed through the utilization of CheckBot crawling tool. The whole dataset can help information retrieval scientists to compare the two engines in terms of their position/ranking and their performance related to these factors.

For more information about the thinking of the of the structure of the dataset please contact the Information Management Lab of University of West Attica.

Contact Persons: Vasilis Ntararas (lb17032@uniwa.gr) , Georgios Ntimo (lb17100@uniwa.gr) and Ioannis C. Drivas (idrivas@uniwa.gr)
R
Indianfoodnet Dataset
universe.roboflow.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IndianFoodNet (2023). Indianfoodnet Dataset [Dataset]. https://universe.roboflow.com/indianfoodnet/indianfoodnet/model/1
Explore at:
zipAvailable download formats
Dataset updated
Dec 4, 2023
Dataset authored and provided by
IndianFoodNet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Indian Dishes Bounding Boxes
Description
IndianFoodNet-30

About IndianFoodNet-30

IndianFoodNet-30 is created by Ritu Agarwal, Nikunj Bansal, Tanupriya Choudhury, Tanmay Sarkar & Neelu Jyothi Ahuja with a goal of building an Indian Food detection model. It contains more than 5500 images of 30 popular Indian food items.

Data collection

We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.

Fair use

This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).

Citation

If you find our dataset useful, please cite us as: @dataset{dataset, author = {Agarwal, Ritu and Bansal, Nikunj and Choudhury, Tanupriya and Sarkar, Tanmay and J.Ahuja, Neelu}, year = {2023}, title = {IndianFoodNet-30 Dataset}, publisher = {Roboflow Universe}, url = {https://universe.roboflow.com/indianfoodnet/indianfoodnet}, }
Z
Query auto-completions for German politicians of the 18th Bundestag
data.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samokhina, Anastasiia (2020). Query auto-completions for German politicians of the 18th Bundestag [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3462045
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Schaer, Philipp
Samokhina, Anastasiia
Bonart, Malte
Heisenberg, Gernot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Germany
Description
bundestag.csv - UTF-8 encoded comma separated text file

This dataset contains the members of the 18th German Bundestag in the constitution of late 2016.

name of the politician

birthday

party membership of the politician

state of the politician

gender of the politician

age of the politician (as of 2017)

number of unique auto-completions assigned to topic: "location information"

number of unique auto-completions assigned to topic: "personal and emotional"

number of unique auto-completions assigned to topic: "politics and economics"

total number of unique auto-completions

terms.csv - UTF-8 encoded comma separated text file

This dataset contains the unordered and pooled auto-completions for the German politicians from Bing search (http://api.bing.net/osjson.aspx), from Duck-Duck-Go (https://duckduckgo.com/ac/) and from Google search (http://clients1.google.de/complete/search). The data was crawled on (mostly) two times per day from 2017/02/03 to 2017/06/19. German language settings were used for Google and Bing, English language setting was used for Duck-Duck-Go. The API requests were sent with an IP address from Cologne, Germany.

google, bing or ddg

the query term, matches the name of the politican in the file

the suggested query auto-completion
R
Indian_food Dataset
universe.roboflow.com
zip
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IndianFood (2024). Indian_food Dataset [Dataset]. https://universe.roboflow.com/indianfood/indian_food-pwzlc/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Jul 16, 2024
Dataset authored and provided by
IndianFood
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Indian Food Bounding Boxes
Description
IndianFood-7

About IndianFood-7

IndianFood-7 is created by Ritu Agarwal, Nikunj Bansal, Tanmay Sarkar, Tanupriya Choudhury and Neelu Jyothi Ahuja with a goal of building a Indian Food detection model. It contains more than 800 images of 7 popular Indian food items.

Data collection

We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.

Fair use

This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).
Wheat Breeding Multimodal Dataset
zenodo.org
scidb.cn
bin, xls
Updated Feb 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guofeng Yang; Yu Li; Yong He; Zhenjiang Zhou; Lingzhen Ye; Hui Fang; Xuping Feng; Guofeng Yang; Yu Li; Yong He; Zhenjiang Zhou; Lingzhen Ye; Hui Fang; Xuping Feng (2025). Wheat Breeding Multimodal Dataset [Dataset]. http://doi.org/10.5281/zenodo.14841928
Explore at:
bin, xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14841928
Dataset updated
Feb 11, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Guofeng Yang; Yu Li; Yong He; Zhenjiang Zhou; Lingzhen Ye; Hui Fang; Xuping Feng; Guofeng Yang; Yu Li; Yong He; Zhenjiang Zhou; Lingzhen Ye; Hui Fang; Xuping Feng
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset is a wheat breeding multimodal dataset, including wheat germplasm data, wheat phenotypic data, wheat cultivation technique data, wheat plant protection technique data, wheat seed price data, UAV remote sensing data, and experimental site weather data. The data sources are field acquisition and online public data.

The wheat germplasm data comes from the Chinese Crop Germplasm Information Network (https://www.cgris.net/). The data on wheat cultivation technique and wheat plant protection technique come from search engines (Google, Bing, Baidu), and the search terms include "wheat cultivation technique, wheat plant protection technique, 小麦栽培技术, 小麦植保技术". The wheat seed historical price data comes from the National Seed Market Monitoring Information Release Platform (http://202.127.45.18/) - China. UAV remote sensing data is the result of further processing after being obtained from field experiments. Weather data comes from meteorological equipment at various agricultural experimental bases and meteorological observation stations of the China Meteorological Administration.

The acquisition and processing of data are described in the relevant part of the manuscript.

This dataset will be continuously updated in the future to help breeding work be carried out efficiently and accelerate the breeding process of excellent varieties.

Food_new Dataset

universe.roboflow.com

zip

Updated Jul 16, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Allergen30 (2024). Food_new Dataset [Dataset]. https://universe.roboflow.com/allergen30/food_new-uuulf/dataset/2

Explore at:

zipAvailable download formats

Dataset updated

Jul 16, 2024

Dataset authored and provided by

Allergen30

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Food Bounding Boxes

Description

Allergen30

About Allergen30

Allergen30 is created by Mayank Mishra, Nikunj Bansal, Tanmay Sarkar and Tanupriya Choudhury with a goal of building a robust detection model that can assist people in avoiding possible allergic reactions.

It contains more than 6,000 images of 30 commonly used food items which can cause an adverse reaction within a human body. This dataset is one of the first research attempts in training a deep learning based computer vision model to detect the presence of such food items from images. It also serves as a benchmark for evaluating the efficacy of object detection methods in learning the otherwise difficult visual cues related to food items.

Description of class labels

There are multiple food items pertaining to specific food intolerances which can trigger an allergic reaction. Such food intolerance primarily include Lactose, Histamine, Gluten, Salicylate, Caffeine and Ovomucoid intolerance. https://github.com/mmayank74567/mmayank74567.github.io/blob/master/images/FoodIntol.png?raw=true" alt="Food intolerance">

The following table contains the description relating to the 30 class labels in our dataset.

S. No.	Allergen	Food label	Description
1	Ovomucoid	egg	Images of egg with yolk (e.g. sunny side up eggs)
2	Ovomucoid	whole_egg_boiled	Images of soft and hard boiled eggs
3	Lactose/Histamine	milk	Images of milk in a glass
4	Lactose	icecream	Images of icecream scoops
5	Lactose	cheese	Images of swiss cheese
6	Lactose/ Caffeine	milk_based_beverage	Images of tea/ coffee with milk in a cup/glass
7	Lactose/Caffeine	chocolate	Images of chocolate bars
8	Caffeine	non_milk_based_beverage	Images of soft drinks and tea/coffee without milk in a cup/glass
9	Histamine	cooked_meat	Images of cooked meat
10	Histamine	raw_meat	Images of raw meat
11	Histamine	alcohol	Images of alcohol bottles
12	Histamine	alcohol_glass	Images of wine glasses with alcohol
13	Histamine	spinach	Images of spinach bundle
14	Histamine	avocado	Images of avocado sliced in half
15	Histamine	eggplant	Images of eggplant
16	Salicylate	blueberry	Images of blueberry
17	Salicylate	blackberry	Images of blackberry
18	Salicylate	strawberry	Images of strawberry
19	Salicylate	pineapple	Images of pineapple
20	Salicylate	capsicum	Images of bell pepper
21	Salicylate	mushroom	Images of mushrooms
22	Salicylate	dates	Images of dates
23	Salicylate	almonds	Images of almonds
24	Salicylate	pistachios	Images of pistachios
25	Salicylate	tomato	Images of tomato and tomato slices
26	Gluten	roti	Images of roti
27	Gluten	pasta	Images of one serving of penne pasta
28	Gluten	bread	Images of bread slices
29	Gluten	bread_loaf	Images of bread loaf
30	Gluten	pizza	Images of pizza and pizza slices

Data collection

We used search engines (Google and Bing) to crawl and look for suitable images using JavaScript queries for each food item from the list created. The images with incomplete RGB channels were removed, and the images collected from different search engines were compiled. When downloading images from search engines, many images were irrelevant to the purpose, especially the ones with a lot of text in them. We deployed the EAST text detector to segregate such images. Finally, a comprehensive manual inspection was conducted to ensure the relevancy of images in the dataset.

Fair use

This dataset contains some copyrighted material whose use has not been specifically authorized by the copyright owners. In an effort to advance scientific research, we make this material available for academic research. If you wish to use copyrighted material in our dataset for purposes of your own that go beyond non-commercial research and academic purposes, you must obtain permission directly from the copyright owner. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for non-commercial research and educational purposes.(adapted from Christopher Thomas).

**Citatio

Apparel Dataset
kaggle.com
Updated Apr 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kais (2020). Apparel Dataset [Dataset]. https://www.kaggle.com/kaiska/apparel-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 26, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kais
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset was created in order for me to practice multi-label classification based on Jeremy Howard's FastAi lecture 3. The dataset contains 8 different clothing categories in 9 different colours. The main objective of multi-label classification is to be able to label items found in photos based on these categories.

Content

The dataset consist of 16,170 images that where scraped from Google, Bing and DuckDuckGo, includes the following categories:

Black Dress: 450 Black Pants: 870 Black Shirt: 715 Black Shoes: 766 Black Shorts: 328 Black Suit: 320 Blue Dress: 502 Blue Pants: 798 Blue Shirt: 741 Blue Shoes: 523 Blue Shorts: 299 Brown Hoodie: 188 Brown Pants: 311 Brown Shoes: 464 Green Pants: 227 Green Shirt: 230 Green Shoes: 455 Green Shorts: 135 Green Suit: 243 Pink Hoodie: 347 Pink Pants: 246 Pink Skirt: 513 Red Dress: 800 Red Hoodie: 349 Red Pants: 308 Red Shirt: 332 Red Shoes: 610 Silver Shoes: 403 Silver Skirt: 361 White Dress: 818 White Pants: 274 White Shoes: 600 White Shorts: 120 White Suit: 354 Yellow Dress: 566 Yellow Shorts: 195 Yellow Skirt: 409

Acknowledgements

While searching the internet for a good dataset to apply multilabel classification on, I stumbled upon pyimagesearch's multi-label classification with keras's article, and Adrian used a very simple and small dataset containing 3 clothing categories. But to expand on the dataset, I combined it with trolukovich's dataset and my own by scraping Google and Bing using cwerner's fastclass package.
Virtual E Dataset
figshare.com
zip
Updated Oct 25, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seung Seog Han (2017). Virtual E Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.5513407.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5513407.v2
Dataset updated
Oct 25, 2017
Dataset provided by
Figsharehttp://figshare.com/
Authors
Seung Seog Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Virtual E DatasetE dataset (3317 images) - Diagnosis predicted by CNNs (ResNet-152 + VGG-19; arithmatic mean of both outputs; training dataset: A1)We created the E dataset to assess the semisupervised learning performance by conducting a Web-based image search for “tinea,” “onychomycosis,” “nail dystrophy,” “onycholysis,” and “melanonychia” in English, Korean, and Japanese on http://google.com and http://bing.com, and downloaded a total of 15,844 images. From these images, the R-CNNs created a nail dataset of 3,317 images, since we had to discard many images because of low image resolution. The CNNs (model: ResNet-152 + VGG-19; arithmetic mean of both outputs; training dataset: A1) automatically classified images generated by the R-CNNs into six classes (760 onychomycosis, 1,316 nail dystrophy, 363 onycholysis, 185 melanonychia, 424 normal, and 269 others).
O
ANIMAL-8
opendatalab.com
zip
Updated Dec 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
openinnolab (2022). ANIMAL-8 [Dataset]. https://opendatalab.com/OpenDataLab/ANIMAL-8
Explore at:
zipAvailable download formats
Dataset updated
Dec 14, 2022
Dataset provided by
openinnolab
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data set is an image data set used to develop deep learning algorithms. The data was derived from ANIMAL-10N, where images were pulled from several online search engines, including Bing and Google, searched using predefined tags as search keywords, and then categorized by 15 recruited participants (10 undergraduate and 5 graduate students). Later, it was adapted by the Intelligent Education Center team of Shanghai Artificial Intelligence Laboratory, and now contains 8 categories (4 categories), the main format is ImageNet format. It contains four pairs of puzzling animals with a total of 39,607 images. The four pairs are: (cat, lynx), (Jaguar, cheetah), (wolf, coyote), (hamster, guinea pig).It can be applied to primary image classification algorithm training and testing.
Data from: An Automatic Method to Extract Patent Citations from Google
figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kayvan Kousha; Mike Thelwall (2023). An Automatic Method to Extract Patent Citations from Google [Dataset]. http://doi.org/10.6084/m9.figshare.1418234.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1418234.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Kayvan Kousha; Mike Thelwall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Appendix A1-A16 report the top 10 highly cited articles in Google Patents from the Bing API search in the sixteen selected fields. It shows that there are some minority articles with many patent citations but with few or no Scopus citation.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer (2023). Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions [Dataset]. http://doi.org/10.5281/zenodo.7682915

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.7682915

Dataset updated

Mar 1, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Fabian Haak; Fabian Haak; Philipp Schaer; Philipp Schaer

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

We present Qbias, two novel datasets that promote the investigation of bias in online news search as described in

Fabian Haak and Philipp Schaer. 2023. 𝑄𝑏𝑖𝑎𝑠 - A Dataset on Media Bias in Search Queries and Query Suggestions. In Proceedings of ACM Web Science Conference (WebSci’23). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3578503.3583628.

Dataset 1: AllSides Balanced News Dataset (allsides_balanced_news_headlines-texts.csv)

The dataset contains 21,747 news articles collected from AllSides balanced news headline roundups in November 2022 as presented in our publication. The AllSides balanced news feature three expert-selected U.S. news articles from sources of different political views (left, right, center), often featuring spin bias, and slant other forms of non-neutral reporting on political news. All articles are tagged with a bias label by four expert annotators based on the expressed political partisanship, left, right, or neutral. The AllSides balanced news aims to offer multiple political perspectives on important news stories, educate users on biases, and provide multiple viewpoints. Collected data further includes headlines, dates, news texts, topic tags (e.g., "Republican party", "coronavirus", "federal jobs"), and the publishing news outlet. We also include AllSides' neutral description of the topic of the articles.
Overall, the dataset contains 10,273 articles tagged as left, 7,222 as right, and 4,252 as center.

To provide easier access to the most recent and complete version of the dataset for future research, we provide a scraping tool and a regularly updated version of the dataset at https://github.com/irgroup/Qbias. The repository also contains regularly updated more recent versions of the dataset with additional tags (such as the URL to the article). We chose to publish the version used for fine-tuning the models on Zenodo to enable the reproduction of the results of our study.

Dataset 2: Search Query Suggestions (suggestions.csv)

The second dataset we provide consists of 671,669 search query suggestions for root queries based on tags of the AllSides biased news dataset. We collected search query suggestions from Google and Bing for the 1,431 topic tags, that have been used for tagging AllSides news at least five times, approximately half of the total number of topics. The topic tags include names, a wide range of political terms, agendas, and topics (e.g., "communism", "libertarian party", "same-sex marriage"), cultural and religious terms (e.g., "Ramadan", "pope Francis"), locations and other news-relevant terms. On average, the dataset contains 469 search queries for each topic. In total, 318,185 suggestions have been retrieved from Google and 353,484 from Bing.

The file contains a "root_term" column based on the AllSides topic tags. The "query_input" column contains the search term submitted to the search engine ("search_engine"). "query_suggestion" and "rank" represents the search query suggestions at the respective positions returned by the search engines at the given time of search "datetime". We scraped our data from a US server saved in "location".

We retrieved ten search query suggestions provided by the Google and Bing search autocomplete systems for the input of each of these root queries, without performing a search. Furthermore, we extended the root queries by the letters a to z (e.g., "democrats" (root term) >> "democrats a" (query input) >> "democrats and recession" (query suggestion)) to simulate a user's input during information search and generate a total of up to 270 query suggestions per topic and search engine. The dataset we provide contains columns for root term, query input, and query suggestion for each suggested query. The location from which the search is performed is the location of the Google servers running Colab, in our case Iowa in the United States of America, which is added to the dataset.

AllSides Scraper

At https://github.com/irgroup/Qbias, we provide a scraping tool, that allows for the automatic retrieval of all available articles at the AllSides balanced news headlines.

We want to provide an easy means of retrieving the news and all corresponding information. For many tasks it is relevant to have the most recent documents available. Thus, we provide this Python-based scraper, that scrapes all available AllSides news articles and gathers available information. By providing the scraper we facilitate access to a recent version of the dataset for other researchers.

Clear search

Close search

Google apps

Main menu

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query...

Data from: Inventory of online public databases and repositories holding...

Interface Element Frequencies in Search Engine Results Pages (SERPs) Across...

hyperlinks

Search Engines Comparison and Websites Performance

Indianfoodnet Dataset

IndianFoodNet-30

About IndianFoodNet-30

Data collection

Fair use

Citation

Query auto-completions for German politicians of the 18th Bundestag

Indian_food Dataset

IndianFood-7

About IndianFood-7

Data collection

Fair use

Wheat Breeding Multimodal Dataset

Food_new Dataset

Allergen30

About Allergen30

Description of class labels

Data collection

Fair use

**Citatio

Apparel Dataset

Context

Content

Acknowledgements

Virtual E Dataset

ANIMAL-8

Data from: An Automatic Method to Extract Patent Citations from Google

Data from: Qbias – A Dataset on Media Bias in Search Queries and Query Suggestions