51 datasets found

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
zenodo.org
data.niaid.nih.gov
csv
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10413068
Dataset updated
Dec 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
CT-FAN-21 corpus: A dataset for Fake News Detection
zenodo.org
Updated Oct 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl (2022). CT-FAN-21 corpus: A dataset for Fake News Detection [Dataset]. http://doi.org/10.5281/zenodo.4714517
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4714517
Dataset updated
Oct 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl; Gautam Kishore Shahi; Julia Maria Struß; Thomas Mandl
Description
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .

Citation

Please cite our work as

@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }

Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.

Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:

False - The main claim made in an article is untrue.

Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.

True - This rating indicates that the primary elements of the main claim are demonstrably true.

Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.

Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.

Input Data

The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:

Task 3a

ID- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

our rating - class of the news article as false, partially false, true, other

Task 3b

public_id- Unique identifier of the news article

Title- Title of the news article

text- Text mentioned inside the news article

domain - domain of the given news article(applicable only for task B)

Output data format

Task 3a

public_id- Unique identifier of the news article

predicted_rating- predicted class

Sample File

public_id, predicted_rating 1, false 2, true

Task 3b

public_id- Unique identifier of the news article

predicted_domain- predicted domain

Sample file

public_id, predicted_domain 1, health 2, crime

Additional data for Training

To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:

Fakenews Classification Datasets

Fake News Detection Challenge KDD 2020

FakeNewsNet

IMPORTANT!

Fake news article used for task 3b is a subset of task 3a.

We have used the data from 2010 to 2021, and the content of fake news is mixed up with several topics like election, COVID-19 etc.

Evaluation Metrics

This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.

Submission Link: https://competitions.codalab.org/competitions/31238

Related Work

Shahi GK. AMUSED: An Annotation Framework of Multi-modal Social Media Data. arXiv preprint arXiv:2010.00502. 2020 Oct 1.https://arxiv.org/pdf/2010.00502.pdf

G. K. Shahi and D. Nandini, “FakeCovid – a multilingualcross-domain fact check news dataset for covid-19,” inWorkshop Proceedings of the 14th International AAAIConference on Web and Social Media, 2020. http://workshop-proceedings.icwsm.org/abstract?id=2020_14

Shahi, G. K., Dirkson, A., & Majchrzak, T. A. (2021). An exploratory study of covid-19 misinformation on twitter. Online Social Networks and Media, 22, 100104. doi: 10.1016/j.osnem.2020.100104
k
SimilarWeb (SMWB) - Tracking Digital Trends: Will it Drive Growth?...
kappasignal.com
Updated Oct 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). SimilarWeb (SMWB) - Tracking Digital Trends: Will it Drive Growth? (Forecast) [Dataset]. https://www.kappasignal.com/2024/10/similarweb-smwb-tracking-digital-trends.html
Explore at:
Dataset updated
Oct 5, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

SimilarWeb (SMWB) - Tracking Digital Trends: Will it Drive Growth?

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
Fake and True News Dataset
figshare.com
txt
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abu Bakkar Siddik (2020). Fake and True News Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13325198.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13325198.v1
Dataset updated
Dec 3, 2020
Dataset provided by
Figsharehttp://figshare.com/
Authors
Abu Bakkar Siddik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port. Therefor some true news data required to optimize with the fake news. After that i have collect some true news from different trusted online site. Finally i have concat the Fake and True news as a single dataset for the purpose to help the Researchers further if they want to research by taken this topic.
d
Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR...
datarade.ai
.csv, .xls
Updated Jun 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Swash (2023). Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR Compliant [Dataset]. https://datarade.ai/data-products/users-searching-data-on-top-search-engines
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Jun 27, 2023
Dataset authored and provided by
Swash
Area covered
Macao, Japan, United States of America, Kuwait, Israel, Bangladesh, Honduras, Taiwan, Panama, Korea (Republic of)
Description
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.

Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.

User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.

Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.

GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.

Market Intelligence and Consumer Behaviour: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.

High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.

Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.

Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
F
Italian Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Italian Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/italian-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Italian Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Italian language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Italian. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Italian people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Italian Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Italian versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Italian Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
NIST Statistical Reference Datasets - SRD 140
catalog.data.gov
datasets.ai
+2more
Updated Jul 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2022). NIST Statistical Reference Datasets - SRD 140 [Dataset]. https://catalog.data.gov/dataset/nist-statistical-reference-datasets-srd-140-df30c
Explore at:
Dataset updated
Jul 29, 2022
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
u
Web Map Services - Catalogue - Canadian Urban Data Catalogue (CUDC)
data.urbandatacentre.ca
beta.data.urbandatacentre.ca
Updated Oct 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Web Map Services - Catalogue - Canadian Urban Data Catalogue (CUDC) [Dataset]. https://data.urbandatacentre.ca/dataset/city-toronto-web-map-services
Explore at:
Dataset updated
Oct 3, 2024
Description
The following published OGC compliant WMTS services facilitate access to live geospatial data from the City of Toronto. All WMTS services are in Web Mercator projection. Orthorectified Aerial Imagery The following dataset provides access to the most current geometrically corrected (orthorectified) aerial photography for the City of Toronto. Previous year Orthoimagery is available through the links provided below. Historic Aerial Imagery These datasets are all sourced from scans of the original black and white aerial photography. These images have not gone through the same rigorous process that current aerial imagery goes through to create a seamless orthorectified image, corrected for the changes in elevation across the City. Due to this, the spatial accuracy of these datasets varies across the City. Be aware that there are known issues with some regions of data due to issues with the source data. These datasets intended use is to show land use changes over time and other similar tasks. It is not suitable for sub-metre level accuracy feature collection and is provided “as-is”. Aerial LiDAR - Hillshade A hillshade is a hypothetical illumination of a surface by determining illumination values for each cell in a raster. It is calculated by setting a position for a hypothetical light source and calculating the illumination values of each cell in relation to neighboring cells. It can be used to greatly enhance the visualization of a surface for analysis or graphical display, especially when using transparency. The City of Toronto publishes hillshades in both bare earth (no above-ground features included), and full-feature. Bare Earth Full Feature
a
ACLED Conflict and Demonstrations Event Data
hub.arcgis.com
cacgeoportal.com
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Asia and the Caucasus GeoPortal (2024). ACLED Conflict and Demonstrations Event Data [Dataset]. https://hub.arcgis.com/maps/1bacc9e3d30f4383af61c12cbf0401d8
Explore at:
Dataset updated
May 23, 2024
Dataset authored and provided by
Central Asia and the Caucasus GeoPortal
Area covered

Description
The Armed Conflict Location & Event Data Project (ACLED) is a US-registered non-profit whose mission is to provide the highest quality real-time data on political violence and demonstrations globally. The information collected includes the type of event, its date, the location, the actors involved, a brief narrative summary, and any reported fatalities. ACLED users rely on our robust global dataset to support decision-making around policy and programming, accurately analyze political and country risk, support operational security planning, and improve supply chain management.ACLED’s transparent methodology, expert team composed of 250 individuals speaking more than 70 languages, real-time coding system, and weekly update schedule are unrivaled in the field of data collection on conflict and disorder. Global Coverage: We track political violence, demonstrations, and strategic developments around the world, covering more than 240 countries and territories.Published Weekly: Our data are collected in real time and published weekly. It is the only dataset of its kind to provide such a high update frequency, with peer datasets most often updating monthly or yearly.Historical Data: Our dataset contains at least two full years of data for all countries and territories, with more extensive coverage available for multiple regions.Experienced Researchers: Our data are coded by experienced researchers with local, country, and regional expertise and language skills.Thorough Data Collection and Sourcing: Pulling from traditional media, reports, local partner data, and verified new media, ACLED uses a tailor-made sourcing methodology for individual regions/countries.Extensive Review Process: Our data go through an exhaustive multi-stage quality assurance process to ensure their accuracy and reliability. This process includes both manual and automated error checking and contextual review.Clean, Standardized, and Validated: Our data can be easily connected with internal dashboards through our API or downloaded through the Data Export Tool on our website.Resources Available on ESRI’s Living AtlasACLED data are available through the Living Atlas for the most recent 12 month period. The data are mapped to the centroid of first administrative divisions (“admin1”) within countries (e.g., states, districts, provinces) and aggregated by month. Variables in the data include:The number of events per admin1-month, disaggregated by event type (protests, riots, battles, violence against civilians, explosions/remote violence, and strategic developments)A conservative estimate of reported fatalities per admin1-monthThe total number of distinct violent actors active in the corresponding admin1 for each monthThis Living Atlas item is a Web Map, which provides a pre-configured view of ACLED event data in a few layers:ACLED Event Counts layer: events per admin1-month, styled by predominant event type for each location.ACLED Violent Actors layer: the number of distinct violent actors per admin1-month.ACLED Fatality Estimates layer: the estimated number of fatalities from political violence per admin1-month.These layers are based on the ACLED Conflict and Demonstrations Event Data Feature Layer, which has the same data but only a basic default styling that is similar to the Event Counts layer. The Web Map layers are configured with a time-slider component to account for the multiple months of data per admin1 unit. These indicators are also available in the ACLED Conflict and Demonstrations Data Key Indicators Group Layer, which includes the same preconfigured layers but without the time-slider component or background layers.Resources Available on the ACLED WebsiteThe fully disaggregated dataset is available for download on ACLED's website including:Date (day, month, year)Actors, associated actors, and actor typesLocation information (ADMIN1, ADMIN2, ADMIN3, location and geo coordinates)A conservative fatality estimateDisorder type, event types, and sub-event typesTags further categorizing the data A notes column providing a narrative of the event For more information, please see the ACLED Codebook.To explore ACLED’s full dataset, please register on the ACLED Access Portal, following the instructions available in this Access Guide. Upon registration, you’ll receive access to ACLED data on a limited basis. Commercial users have access to 3 free data downloads company-wide with access to up to one year of historical data. Public sector users have access to 6 downloads of up to three years of historical data organization-wide. To explore options for extended access, please reach out to our Access Team (access@acleddata.com).With an ACLED license, users can also leverage ACLED’s interactive Global Dashboard and check in for weekly data updates and analysis tracking key political violence and protest trends around the world. ACLED also has several analytical tools available such as our Early Warning Dashboard, Conflict Alert System (CAST), and Conflict Index Dashboard.
F
Tamil Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Tamil Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/tamil-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Tamil Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Tamil language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Tamil. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Tamil Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Tamil versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Tamil Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
Z
SH17 Dataset for PPE Detection
data.niaid.nih.gov
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmad, Hafiz Mughees (2024). SH17 Dataset for PPE Detection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12659324
Explore at:
Dataset updated
Jul 4, 2024
Dataset authored and provided by
Ahmad, Hafiz Mughees
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We propose Safe Human dataset consisting of 17 different objects referred to as SH17 dataset. We scrapped images from the Pexels website, which offers clear usage rights for all its images, showcasing a range of human activities across diverse industrial operations.

To extract relevant images, we used multiple queries such as manufacturing worker, industrial worker, human worker, labor, etc. The tags associated with Pexels images proved reasonably accurate. After removing duplicate samples, we obtained a dataset of 8,099 images. The dataset exhibits significant diversity, representing manufacturing environments globally, thus minimizing potential regional or racial biases. Samples of the dataset are shown below.

Key features

Collected from diverse industrial environments globally

High quality images (max resolution 8192x5462, min 1920x1002)

Average of 9.38 instances per image

Includes small objects like ears and earmuffs (39,764 annotations < 1% image area, 59,025 annotations < 5% area)

Classes

Person

Head

Face

Glasses

Face-mask-medical

Face-guard

Ear

Earmuffs

Hands

Gloves

Foot

Shoes

Safety-vest

Tools

Helmet

Medical-suit

Safety-suit

The data consists of three folders,

images contains all images

labels contains labels in YOLO format for all images

voc_labels contains labels in VOC format for all images

train_files.txt contains list of all images we used for training

val_files.txt contains list of all images we used for validation

Disclaimer and Responsible Use:

This dataset, scrapped through the Pexels website, is intended for educational, research, and analysis purposes only. You may be able to use the data for training of the Machine learning models only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.

Users should adhere to Copyright Notice of Pexels when utilizing this dataset.

Legal Simplicity: All photos and videos on Pexels can be downloaded and used for free.

Allowed 👌

All photos and videos on Pexels are free to use.

Attribution is not required. Giving credit to the photographer or Pexels is not necessary but always appreciated.

You can modify the photos and videos from Pexels. Be creative and edit them as you like.

Not allowed 👎

Identifiable people may not appear in a bad light or in a way that is offensive.

Don't sell unaltered copies of a photo or video, e.g. as a poster, print or on a physical product without modifying it first.

Don't imply endorsement of your product by people or brands on the imagery.

Don't redistribute or sell the photos and videos on other stock photo or wallpaper platforms.

Don't use the photos or videos as part of your trade-mark, design-mark, trade-name, business name or service mark.

No Warranty Disclaimer:

The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.

Ethical Use:

Users are encouraged to consider the ethical implications of their analyses and the potential impact on broader community.

GitHub Page:

https://github.com/ahmadmughees/SH17dataset
d
Example Groundwater-Level Datasets and Benchmarking Results for the...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Example Groundwater-Level Datasets and Benchmarking Results for the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) Software Package [Dataset]. https://catalog.data.gov/dataset/example-groundwater-level-datasets-and-benchmarking-results-for-the-automated-regional-cor
Explore at:
Dataset updated
Oct 13, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.
R
Data from: Reorganized Dataset
universe.roboflow.com
zip
Updated Mar 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bruce baur (2023). Reorganized Dataset [Dataset]. https://universe.roboflow.com/bruce-baur/reorganized/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Mar 20, 2023
Dataset authored and provided by
bruce baur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Gui Bounding Boxes
Description
Here are a few use cases for this project:

Automated UI Testing: "Reorganized" can be employed to perform automated UI testing for web and mobile applications, helping developers quickly identify and verify the presence of specific UI elements such as buttons, fields, and images to ensure that each component is functioning properly and meets design specifications.

Accessibility Enhancement: Utilizing "Reorganized" can help in improving the accessibility of websites and applications by automatically identifying and labeling different GUI elements, enabling screen reader software to provide more accurate and detailed information for visually impaired users.

UI Design Evaluation: "Reorganized" can assist in analyzing and comparing UI designs of different applications to evaluate consistency, user experience and adherence to design principles. By identifying specific elements, it can provide insights to designers on which areas need improvement or adjustments.

Content Curation and Classification: The computer vision model can be used to analyze and sort through large collections of web pages or applications to categorize and curate content based on the presence of specific GUI elements like text, images, buttons, etc. This can be helpful in creating repositories, educational material, or designing targeted advertisements.

5.website Migration and Conversion: Using "Reorganized" can significantly speed up the process of migrating or converting websites, especially when transitioning from one content management system to another. By identifying and extracting GUI elements, it becomes easier to map these elements to a new system and ensure a seamless transfer.
F
Filipino Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Filipino Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/filipino-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
d
Success.ai | Consumer Behavior Data | In-depth Intent Data for Strategic...
datarade.ai
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Success.ai (2022). Success.ai | Consumer Behavior Data | In-depth Intent Data for Strategic Engagement – Unbeatable Prices Guaranteed [Dataset]. https://datarade.ai/data-categories/consumer-behavior-data/datasets
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Oct 27, 2022
Dataset provided by
Success.ai
Area covered
Taiwan, Bonaire, Djibouti, Andorra, Egypt, Dominican Republic, Venezuela (Bolivarian Republic of), Japan, Benin, Romania
Description
Success.ai is at the forefront of delivering precise consumer behavior insights that empower businesses to understand and anticipate customer needs more effectively. Our extensive datasets provide a deep dive into the nuances of consumer actions, preferences, and trends, enabling businesses to tailor their strategies for maximum engagement and conversion.

Explore the Multifaceted Dimensions of Consumer Behavior:

Consumer Sentiment Analysis: Decode the emotions and sentiments behind consumer interactions with brands and products to refine messaging and product offerings.

Web Activity Insights: Monitor and analyze consumer online behaviors, from browsing patterns to engagement metrics, to optimize digital strategies and user experience.

Geodemographic Segmentation: Utilize detailed demographic and geographic data to segment audiences accurately, enabling personalized marketing approaches that resonate with diverse consumer groups.

Consumer Purchasing Patterns: Understand the what, when, and why behind consumer purchases to forecast trends and align inventory and marketing efforts accordingly.

Advanced Consumer Profiling: Build detailed profiles based on consumer behavior data to target or retarget customers with precision.

Why Choose Success.ai for Consumer Behavior Data?

Comprehensive Data Integration: Seamlessly integrate our rich consumer data into your CRM systems, enhancing your data reservoir with valuable consumer insights.

Real-Time Updates and Predictive Analytics: Leverage the latest consumer behavior trends powered by AI-driven analytics to stay ahead in a rapidly changing market.

Precision and Reliability: Count on our meticulous data collection and processing methods, ensuring high accuracy and compliance with international data protection regulations.

Scalable Solutions: Whether you're a small business or a large enterprise, our flexible data solutions can be scaled to meet your specific needs and budget constraints.

Competitive Pricing: We offer the most compelling pricing in the industry, guaranteeing you get top-tier data without overspending.

Strategic Applications of Consumer Behavior Data for Business Growth:

Enhanced Email Marketing: Use detailed consumer profiles to craft personalized email campaigns that increase open rates and conversions.

Optimized Online Marketing: Apply insights from consumer web activity and search trends to fine-tune your online marketing tactics for better ROI.

Effective B2B Lead Generation: Identify and engage potential business clients by understanding their industry-specific behaviors and preferences.

Robust Sales Data Enrichment: Enrich your sales strategies with deep behavioral insights, turning cold calls into informed discussions and increasing sales success.

Dynamic Competitive Intelligence: Stay competitive by monitoring how consumer behaviors are shifting in your industry and adjust your strategies proactively.

Empower Your Business with Actionable Consumer Insights from Success.ai

Success.ai provides not just data, but a gateway to transformative business strategies. Our comprehensive consumer behavior insights allow you to make informed decisions, personalize customer interactions, and ultimately drive higher engagement and sales.

Get in touch with us today to discover how our Consumer Behavior Intent Data can revolutionize your business strategies and help you achieve your market potential.

Contact Success.ai now and start transforming data into growth. Let us show you how our unmatched data solutions can be the cornerstone of your business success.
Using a Web-Based Application to Define the Accuracy of Diagnostic Tests...
plos.figshare.com
tiff
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cherry Lim; Prapass Wannapinij; Lisa White; Nicholas P. J. Day; Ben S. Cooper; Sharon J. Peacock; Direk Limmathurotsakul (2023). Using a Web-Based Application to Define the Accuracy of Diagnostic Tests When the Gold Standard Is Imperfect [Dataset]. http://doi.org/10.1371/journal.pone.0079489
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0079489
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Cherry Lim; Prapass Wannapinij; Lisa White; Nicholas P. J. Day; Ben S. Cooper; Sharon J. Peacock; Direk Limmathurotsakul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundEstimates of the sensitivity and specificity for new diagnostic tests based on evaluation against a known gold standard are imprecise when the accuracy of the gold standard is imperfect. Bayesian latent class models (LCMs) can be helpful under these circumstances, but the necessary analysis requires expertise in computational programming. Here, we describe open-access web-based applications that allow non-experts to apply Bayesian LCMs to their own data sets via a user-friendly interface. Methods/Principal FindingsApplications for Bayesian LCMs were constructed on a web server using R and WinBUGS programs. The models provided (http://mice.tropmedres.ac) include two Bayesian LCMs: the two-tests in two-population model (Hui and Walter model) and the three-tests in one-population model (Walter and Irwig model). Both models are available with simplified and advanced interfaces. In the former, all settings for Bayesian statistics are fixed as defaults. Users input their data set into a table provided on the webpage. Disease prevalence and accuracy of diagnostic tests are then estimated using the Bayesian LCM, and provided on the web page within a few minutes. With the advanced interfaces, experienced researchers can modify all settings in the models as needed. These settings include correlation among diagnostic test results and prior distributions for all unknown parameters. The web pages provide worked examples with both models using the original data sets presented by Hui and Walter in 1980, and by Walter and Irwig in 1988. We also illustrate the utility of the advanced interface using the Walter and Irwig model on a data set from a recent melioidosis study. The results obtained from the web-based applications were comparable to those published previously. ConclusionsThe newly developed web-based applications are open-access and provide an important new resource for researchers worldwide to evaluate new diagnostic tests.
d
Job Postings Dataset for Labour Market Research and Insights
datarade.ai
Updated Sep 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2023). Job Postings Dataset for Labour Market Research and Insights [Dataset]. https://datarade.ai/data-products/job-postings-dataset-for-labour-market-research-and-insights-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Sep 20, 2023
Dataset authored and provided by
Oxylabs
Area covered
Kyrgyzstan, Jamaica, Tajikistan, Zambia, Togo, Luxembourg, Anguilla, British Indian Ocean Territory, Sierra Leone, Switzerland
Description
Introducing Job Posting Datasets: Uncover labor market insights!

Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.

Job Posting Datasets Source:

Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.

Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.

StackShare: Access StackShare datasets to make data-driven technology decisions.

Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.

Choose your preferred dataset delivery options for convenience:

Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.

Why Choose Oxylabs Job Posting Datasets:

Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.

Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.

Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.
R
Open Poetry Vision Object Detection Dataset - 512x512
public.roboflow.com
zip
Updated Apr 7, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brad Dwyer (2022). Open Poetry Vision Object Detection Dataset - 512x512 [Dataset]. https://public.roboflow.com/object-detection/open-poetry-vision/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 7, 2022
Dataset authored and provided by
Brad Dwyer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of text
Description
Overview

The Open Poetry Vision dataset is a synthetic dataset created by Roboflow for OCR tasks.

It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.

Example Image: https://i.imgur.com/sZT516a.png" alt="Example Image">

Use Cases

A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.

Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

Introducing the New Roboflow Train

What to Think About When Choosing Model Sizes

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Aljazeera News Dataset
kaggle.com
zip
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OUMAYMA EL-FAHSI (2024). Aljazeera News Dataset [Dataset]. https://www.kaggle.com/datasets/oumaymael/aljazeera-news-dataset
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 27, 2024
Authors
OUMAYMA EL-FAHSI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset includes news articles collected from Al Jazeera through web scraping. The scraping code was developed in November/December 2022, and it may need updates to accommodate changes in the website's structure since then. Users are advised to review and adapt the scraping code according to the current structure of the Al Jazeera website.

Please note that changes in website structure may impact the classes and elements used for scraping. The provided code is a starting point and may require adjustments to ensure accurate data extraction.

If you wish to scrape news articles from different categories beyond Science & Technology, Economics, or Sports, or if you need a more extensive dataset, the scraping code is available in the repository. Visit the repository to access the code and instructions for scraping additional categories or obtaining a larger dataset.

Repository Link: https://github.com/uma-oo/Aljazeera-Scraper

Dataset Structure: - Category - Title - Text of the Article

The dataset is structured with three main columns: Category, Title, and Text of the Article. Explore and utilize the dataset for various analytical and natural language processing tasks.

Feel free to explore and contribute to the code for your specific scraping needs!
d
Lunar Grid Reference System (LGRS) Artemis III Candidate Landing Site...
catalog.data.gov
data.usgs.gov
Updated Nov 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Lunar Grid Reference System (LGRS) Artemis III Candidate Landing Site Navigational Grids in Artemis Condensed Coordinate (ACC) Format [Dataset]. https://catalog.data.gov/dataset/lunar-grid-reference-system-lgrs-artemis-iii-candidate-landing-site-navigational-grids-in-
Explore at:
Dataset updated
Nov 9, 2024
Dataset provided by
U.S. Geological Survey
Description
USGS is assessing the feasibility of map projections and grid systems for lunar surface operations. We propose developing a new Lunar Transverse Mercator (LTM), the Lunar Polar Stereographic (LPS), and the Lunar Grid Reference Systems (LGRS). We have also designed additional grids designed to NASA requirements for astronaut navigation, referred to as LGRS in Artemis Condensed Coordinates (ACC). This data release includes LGRS grids finer than 25km (1km, 100m, and 10m) in ACC format. LTM, LPS, and LGRS grids are not released here but may be acceded from https://doi.org/10.5066/P13YPWQD. LTM, LPS, and LGRS are similar in design and use to the Universal Transverse Mercator (UTM), Universal Polar Stereographic (LPS), and Military Grid Reference System (MGRS), but adhere to NASA requirements. LGRS ACC format is similar in design and structure to historic Army Mapping Service Apollo orthotopophoto charts for navigation. The Lunar Transverse Mercator (LTM) projection system is a globalized set of lunar map projections that divides the Moon into zones to provide a uniform coordinate system for accurate spatial representation. It uses a Transverse Mercator projection, which maps the Moon into 45 transverse Mercator strips, each 8°, longitude, wide. These Transverse Mercator strips are subdivided at the lunar equator for a total of 90 zones. Forty-five in the northern hemisphere and forty-five in the south. LTM specifies a topocentric, rectangular, coordinate system (easting and northing coordinates) for spatial referencing. This projection is commonly used in GIS and surveying for its ability to represent large areas with high positional accuracy while maintaining consistent scale. The Lunar Polar Stereographic (LPS) projection system contains projection specifications for the Moon’s polar regions. It uses a polar stereographic projection, which maps the polar regions onto an azimuthal plane. The LPS system contains 2 zones, each zone is located at the northern and southern poles and is referred to as the LPS northern or LPS southern zone. LPS, like its equatorial counterpart LTM, specifies a topocentric, rectangular, coordinate system (easting and northing coordinates) for spatial referencing. This projection is commonly used in GIS and surveying for its ability to represent large polar areas with high positional accuracy while maintaining consistent scale across the map region. LGRS is a globalized grid system for lunar navigation supported by the LTM and LPS projections. LGRS provides an alphanumeric grid coordinate structure for both the LTM and LPS systems. This labeling structure is utilized similarly to MGRS. LGRS defines a global area grid based on latitude and longitude and a 25×25 km grid based on LTM and LPS coordinate values. Two implementations of LGRS are used as polar areas require an LPS projection and equatorial areas a Transverse Mercator. We describe the differences in the techniques and methods reported in this data release. Request McClernan et. al. (in-press) for more information. ACC is a method of simplifying LGRS coordinates and is similar in use to the Army Mapping Service Apollo orthotopophoto charts for navigation. These grids are designed to condense a full LGRS coordinate to a relative coordinate of 6 characters in length. LGRS in ACC format is completed by imposing a 1km grid within the LGRS 25km grid, then truncating the grid precision to 10m. To me the character limit, a coordinate is reported as a relative value to the lower-left corner of the 25km LGRS zone without the zone information; However, zone information can be reported. As implemented, and 25km^2 area on the lunar surface will have a set of a unique set of ACC coordinates to report locations The shape files provided in this data release are projected in the LTM or LPS PCRSs and must utilize these projections to be dimensioned correctly. LGRS ACC Grids Files and Resolution: LGRS ACC Grids in LPS portion: Amundsen_Rim 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Nobile_Rim_2 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Haworth 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Faustini_Rim_A 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache_Rim_2 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Connecting_Ridge_Extension 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Connecting_Ridge 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Nobile_Rim_1 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Peak_Near_Shackleton 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache_Rim' 'Leibnitz_Beta_Plateau 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Malapert_Massif 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache-Kocher_Massif 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile LGRS ACC Grids in LTM portion: Apollo_11 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_12 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_14 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_15 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_16 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_17 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile LTM, LPS, and LGRS PCRS shapefiles utilize either a custom transverse Mercator or polar Stereographic projection. For PCRS grids the LTM and LPS projections are recommended for all LTM, LPS, and LGRS grid sizes. See McClernan et. al. (in-press) for such projections. For GIS utilization of grid shapefiles projected in Lunar Latitude and Longitude should utilize a registered lunar geographic coordinate system (GCS) such as IAU_2015:30100 or ESRI:104903. This only applies to grids that cross multiple LTM zones. Note: All data, shapefiles require a specific projection and datum. The projection is recommended as LTM and LPS or, when needed, IAU_2015:30100 or ESRI:104903. The datum utilized must be the Jet Propulsion Laboratory (JPL) Development Ephemeris (DE) 421 in the Mean Earth (ME) Principal Axis Orientation as recommended by the International Astronomy Union (IAU) (Archinal et. al., 2008).

Facebook

Twitter

Click to copy link

Link copied

Cite

Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10413068

Dataset updated

Dec 21, 2023

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models
Research on LLM-generated datasets for text classification tasks
Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

Clear search

Close search

Google apps

Main menu

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

CT-FAN-21 corpus: A dataset for Fake News Detection

SimilarWeb (SMWB) - Tracking Digital Trends: Will it Drive Growth?...

SimilarWeb (SMWB) - Tracking Digital Trends: Will it Drive Growth?

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Fake and True News Dataset

Swash User Search and Consumer Journey Data - 1.5M Worldwide Users - GDPR...

Italian Closed Ended Question Answer Text Dataset

What’s Included

NIST Statistical Reference Datasets - SRD 140

Web Map Services - Catalogue - Canadian Urban Data Catalogue (CUDC)

ACLED Conflict and Demonstrations Event Data

Tamil Closed Ended Question Answer Text Dataset

What’s Included

SH17 Dataset for PPE Detection

Example Groundwater-Level Datasets and Benchmarking Results for the...

Data from: Reorganized Dataset

Filipino Closed Ended Question Answer Text Dataset

What’s Included

Success.ai | Consumer Behavior Data | In-depth Intent Data for Strategic...

Using a Web-Based Application to Define the Accuracy of Diagnostic Tests...

Job Postings Dataset for Labour Market Research and Insights

Open Poetry Vision Object Detection Dataset - 512x512

Overview

Use Cases

Using this Dataset

Version 5 of this dataset (classes_all_text-raw-images) has all classes remapped to be labeled as "text." This was accomplished by using Modify Classes as a preprocessing step.

Version 6 of this dataset (classes_all_text-augmented-FAST) has all classes remapped to be labeled as "text." and was trained with Roboflow's Fast Model.

Version 7 of this dataset (classes_all_text-augmented-ACCURATE) has all classes remapped to be labeled as "text." and was trained with Roboflow's Accurate Model.

About Roboflow

Aljazeera News Dataset

Lunar Grid Reference System (LGRS) Artemis III Candidate Landing Site...

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification

Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification