Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
Dataset Composition:
Intended Use:
Additional Information:
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Data Access: The data in the research collection provided may only be used for research purposes. Portions of the data are copyrighted and have commercial value as data, so you must be careful to use it only for research purposes. Due to these restrictions, the collection is not open data. Please download the Agreement at Data Sharing Agreement and send the signed form to fakenewstask@gmail.com .
Citation
Please cite our work as
@article{shahi2021overview, title={Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection}, author={Shahi, Gautam Kishore and Stru{\ss}, Julia Maria and Mandl, Thomas}, journal={Working Notes of CLEF}, year={2021} }
Problem Definition: Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other (e.g., claims in dispute) and detect the topical domain of the article. This task will run in English.
Subtask 3A: Multi-class fake news detection of news articles (English) Sub-task A would detect fake news designed as a four-class classification problem. The training data will be released in batches and roughly about 900 articles with the respective label. Given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other. Our definitions for the categories are as follows:
False - The main claim made in an article is untrue.
Partially False - The main claim of an article is a mixture of true and false information. The article contains partially true and partially false information but cannot be considered 100% true. It includes all articles in categories like partially false, partially true, mostly true, miscaptioned, misleading etc., as defined by different fact-checking services.
True - This rating indicates that the primary elements of the main claim are demonstrably true.
Other- An article that cannot be categorised as true, false, or partially false due to lack of evidence about its claims. This category includes articles in dispute and unproven articles.
Subtask 3B: Topical Domain Classification of News Articles (English) Fact-checkers require background expertise to identify the truthfulness of an article. The categorisation will help to automate the sampling process from a stream of data. Given the text of a news article, determine the topical domain of the article (English). This is a classification problem. The task is to categorise fake news articles into six topical categories like health, election, crime, climate, election, education. This task will be offered for a subset of the data of Subtask 3A.
Input Data
The data will be provided in the format of Id, title, text, rating, the domain; the description of the columns is as follows:
Task 3a
Task 3b
Output data format
Task 3a
Sample File
public_id, predicted_rating
1, false
2, true
Task 3b
Sample file
public_id, predicted_domain
1, health
2, crime
Additional data for Training
To train your model, the participant can use additional data with a similar format; some datasets are available over the web. We don't provide the background truth for those datasets. For testing, we will not use any articles from other datasets. Some of the possible source:
IMPORTANT!
Evaluation Metrics
This task is evaluated as a classification task. We will use the F1-macro measure for the ranking of teams. There is a limit of 5 runs (total and not per day), and only one person from a team is allowed to submit runs.
Submission Link: https://competitions.codalab.org/competitions/31238
Related Work
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.
Historical daily stock prices (open, high, low, close, volume)
Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)
Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)
Feature engineering based on financial data and technical indicators
Sentiment analysis data from social media and news articles
Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)
Stock price prediction
Portfolio optimization
Algorithmic trading
Market sentiment analysis
Risk management
Researchers investigating the effectiveness of machine learning in stock market prediction
Analysts developing quantitative trading Buy/Sell strategies
Individuals interested in building their own stock market prediction models
Students learning about machine learning and financial applications
The dataset may include different levels of granularity (e.g., daily, hourly)
Data cleaning and preprocessing are essential before model training
Regular updates are recommended to maintain the accuracy and relevance of the data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port. Therefor some true news data required to optimize with the fake news. After that i have collect some true news from different trusted online site. Finally i have concat the Fake and True news as a single dataset for the purpose to help the Researchers further if they want to research by taken this topic.
Unlock the Power of Behavioural Data with GDPR-Compliant Clickstream Insights.
Swash clickstream data offers a comprehensive and GDPR-compliant dataset sourced from users worldwide, encompassing both desktop and mobile browsing behaviour. Here's an in-depth look at what sets us apart and how our data can benefit your organisation.
User-Centric Approach: Unlike traditional data collection methods, we take a user-centric approach by rewarding users for the data they willingly provide. This unique methodology ensures transparent data collection practices, encourages user participation, and establishes trust between data providers and consumers.
Wide Coverage and Varied Categories: Our clickstream data covers diverse categories, including search, shopping, and URL visits. Whether you are interested in understanding user preferences in e-commerce, analysing search behaviour across different industries, or tracking website visits, our data provides a rich and multi-dimensional view of user activities.
GDPR Compliance and Privacy: We prioritise data privacy and strictly adhere to GDPR guidelines. Our data collection methods are fully compliant, ensuring the protection of user identities and personal information. You can confidently leverage our clickstream data without compromising privacy or facing regulatory challenges.
Market Intelligence and Consumer Behaviour: Gain deep insights into market intelligence and consumer behaviour using our clickstream data. Understand trends, preferences, and user behaviour patterns by analysing the comprehensive user-level, time-stamped raw or processed data feed. Uncover valuable information about user journeys, search funnels, and paths to purchase to enhance your marketing strategies and drive business growth.
High-Frequency Updates and Consistency: We provide high-frequency updates and consistent user participation, offering both historical data and ongoing daily delivery. This ensures you have access to up-to-date insights and a continuous data feed for comprehensive analysis. Our reliable and consistent data empowers you to make accurate and timely decisions.
Custom Reporting and Analysis: We understand that every organisation has unique requirements. That's why we offer customisable reporting options, allowing you to tailor the analysis and reporting of clickstream data to your specific needs. Whether you need detailed metrics, visualisations, or in-depth analytics, we provide the flexibility to meet your reporting requirements.
Data Quality and Credibility: We take data quality seriously. Our data sourcing practices are designed to ensure responsible and reliable data collection. We implement rigorous data cleaning, validation, and verification processes, guaranteeing the accuracy and reliability of our clickstream data. You can confidently rely on our data to drive your decision-making processes.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Italian Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Italian language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Italian. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Italian people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Italian Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Italian versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Italian Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
The purpose of this project is to improve the accuracy of statistical software by providing reference datasets with certified computational results that enable the objective evaluation of statistical software. Currently datasets and certified values are provided for assessing the accuracy of software for univariate statistics, linear regression, nonlinear regression, and analysis of variance. The collection includes both generated and 'real-world' data of varying levels of difficulty. Generated datasets are designed to challenge specific computations. These include the classic Wampler datasets for testing linear regression algorithms and the Simon & Lesage datasets for testing analysis of variance algorithms. Real-world data include challenging datasets such as the Longley data for linear regression, and more benign datasets such as the Daniel & Wood data for nonlinear regression. Certified values are 'best-available' solutions. The certification procedure is described in the web pages for each statistical method. Datasets are ordered by level of difficulty (lower, average, and higher). Strictly speaking the level of difficulty of a dataset depends on the algorithm. These levels are merely provided as rough guidance for the user. Producing correct results on all datasets of higher difficulty does not imply that your software will pass all datasets of average or even lower difficulty. Similarly, producing correct results for all datasets in this collection does not imply that your software will do the same for your particular dataset. It will, however, provide some degree of assurance, in the sense that your package provides correct results for datasets known to yield incorrect results for some software. The Statistical Reference Datasets is also supported by the Standard Reference Data Program.
The following published OGC compliant WMTS services facilitate access to live geospatial data from the City of Toronto. All WMTS services are in Web Mercator projection. Orthorectified Aerial Imagery The following dataset provides access to the most current geometrically corrected (orthorectified) aerial photography for the City of Toronto. Previous year Orthoimagery is available through the links provided below. Historic Aerial Imagery These datasets are all sourced from scans of the original black and white aerial photography. These images have not gone through the same rigorous process that current aerial imagery goes through to create a seamless orthorectified image, corrected for the changes in elevation across the City. Due to this, the spatial accuracy of these datasets varies across the City. Be aware that there are known issues with some regions of data due to issues with the source data. These datasets intended use is to show land use changes over time and other similar tasks. It is not suitable for sub-metre level accuracy feature collection and is provided “as-is”. Aerial LiDAR - Hillshade A hillshade is a hypothetical illumination of a surface by determining illumination values for each cell in a raster. It is calculated by setting a position for a hypothetical light source and calculating the illumination values of each cell in relation to neighboring cells. It can be used to greatly enhance the visualization of a surface for analysis or graphical display, especially when using transparency. The City of Toronto publishes hillshades in both bare earth (no above-ground features included), and full-feature. Bare Earth Full Feature
The Armed Conflict Location & Event Data Project (ACLED) is a US-registered non-profit whose mission is to provide the highest quality real-time data on political violence and demonstrations globally. The information collected includes the type of event, its date, the location, the actors involved, a brief narrative summary, and any reported fatalities. ACLED users rely on our robust global dataset to support decision-making around policy and programming, accurately analyze political and country risk, support operational security planning, and improve supply chain management.ACLED’s transparent methodology, expert team composed of 250 individuals speaking more than 70 languages, real-time coding system, and weekly update schedule are unrivaled in the field of data collection on conflict and disorder. Global Coverage: We track political violence, demonstrations, and strategic developments around the world, covering more than 240 countries and territories.Published Weekly: Our data are collected in real time and published weekly. It is the only dataset of its kind to provide such a high update frequency, with peer datasets most often updating monthly or yearly.Historical Data: Our dataset contains at least two full years of data for all countries and territories, with more extensive coverage available for multiple regions.Experienced Researchers: Our data are coded by experienced researchers with local, country, and regional expertise and language skills.Thorough Data Collection and Sourcing: Pulling from traditional media, reports, local partner data, and verified new media, ACLED uses a tailor-made sourcing methodology for individual regions/countries.Extensive Review Process: Our data go through an exhaustive multi-stage quality assurance process to ensure their accuracy and reliability. This process includes both manual and automated error checking and contextual review.Clean, Standardized, and Validated: Our data can be easily connected with internal dashboards through our API or downloaded through the Data Export Tool on our website.Resources Available on ESRI’s Living AtlasACLED data are available through the Living Atlas for the most recent 12 month period. The data are mapped to the centroid of first administrative divisions (“admin1”) within countries (e.g., states, districts, provinces) and aggregated by month. Variables in the data include:The number of events per admin1-month, disaggregated by event type (protests, riots, battles, violence against civilians, explosions/remote violence, and strategic developments)A conservative estimate of reported fatalities per admin1-monthThe total number of distinct violent actors active in the corresponding admin1 for each monthThis Living Atlas item is a Web Map, which provides a pre-configured view of ACLED event data in a few layers:ACLED Event Counts layer: events per admin1-month, styled by predominant event type for each location.ACLED Violent Actors layer: the number of distinct violent actors per admin1-month.ACLED Fatality Estimates layer: the estimated number of fatalities from political violence per admin1-month.These layers are based on the ACLED Conflict and Demonstrations Event Data Feature Layer, which has the same data but only a basic default styling that is similar to the Event Counts layer. The Web Map layers are configured with a time-slider component to account for the multiple months of data per admin1 unit. These indicators are also available in the ACLED Conflict and Demonstrations Data Key Indicators Group Layer, which includes the same preconfigured layers but without the time-slider component or background layers.Resources Available on the ACLED WebsiteThe fully disaggregated dataset is available for download on ACLED's website including:Date (day, month, year)Actors, associated actors, and actor typesLocation information (ADMIN1, ADMIN2, ADMIN3, location and geo coordinates)A conservative fatality estimateDisorder type, event types, and sub-event typesTags further categorizing the data A notes column providing a narrative of the event For more information, please see the ACLED Codebook.To explore ACLED’s full dataset, please register on the ACLED Access Portal, following the instructions available in this Access Guide. Upon registration, you’ll receive access to ACLED data on a limited basis. Commercial users have access to 3 free data downloads company-wide with access to up to one year of historical data. Public sector users have access to 6 downloads of up to three years of historical data organization-wide. To explore options for extended access, please reach out to our Access Team (access@acleddata.com).With an ACLED license, users can also leverage ACLED’s interactive Global Dashboard and check in for weekly data updates and analysis tracking key political violence and protest trends around the world. ACLED also has several analytical tools available such as our Early Warning Dashboard, Conflict Alert System (CAST), and Conflict Index Dashboard.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Tamil Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Tamil language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Tamil. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Tamil people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Tamil Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Tamil versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Tamil Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We propose Safe Human dataset consisting of 17 different objects referred to as SH17 dataset. We scrapped images from the Pexels website, which offers clear usage rights for all its images, showcasing a range of human activities across diverse industrial operations.
To extract relevant images, we used multiple queries such as manufacturing worker, industrial worker, human worker, labor, etc. The tags associated with Pexels images proved reasonably accurate. After removing duplicate samples, we obtained a dataset of 8,099 images. The dataset exhibits significant diversity, representing manufacturing environments globally, thus minimizing potential regional or racial biases. Samples of the dataset are shown below.
Key features
Collected from diverse industrial environments globally
High quality images (max resolution 8192x5462, min 1920x1002)
Average of 9.38 instances per image
Includes small objects like ears and earmuffs (39,764 annotations < 1% image area, 59,025 annotations < 5% area)
Classes
Person
Head
Face
Glasses
Face-mask-medical
Face-guard
Ear
Earmuffs
Hands
Gloves
Foot
Shoes
Safety-vest
Tools
Helmet
Medical-suit
Safety-suit
The data consists of three folders,
images contains all images
labels contains labels in YOLO format for all images
voc_labels contains labels in VOC format for all images
train_files.txt contains list of all images we used for training
val_files.txt contains list of all images we used for validation
Disclaimer and Responsible Use:
This dataset, scrapped through the Pexels website, is intended for educational, research, and analysis purposes only. You may be able to use the data for training of the Machine learning models only. Users are urged to use this data responsibly, ethically, and within the bounds of legal stipulations.
Users should adhere to Copyright Notice of Pexels when utilizing this dataset.
Legal Simplicity: All photos and videos on Pexels can be downloaded and used for free.
Allowed 👌
All photos and videos on Pexels are free to use.
Attribution is not required. Giving credit to the photographer or Pexels is not necessary but always appreciated.
You can modify the photos and videos from Pexels. Be creative and edit them as you like.
Not allowed 👎
Identifiable people may not appear in a bad light or in a way that is offensive.
Don't sell unaltered copies of a photo or video, e.g. as a poster, print or on a physical product without modifying it first.
Don't imply endorsement of your product by people or brands on the imagery.
Don't redistribute or sell the photos and videos on other stock photo or wallpaper platforms.
Don't use the photos or videos as part of your trade-mark, design-mark, trade-name, business name or service mark.
No Warranty Disclaimer:
The dataset is provided "as is," without warranty, and the creator disclaims any legal liability for its use by others.
Ethical Use:
Users are encouraged to consider the ethical implications of their analyses and the potential impact on broader community.
GitHub Page:
This data release provides two example groundwater-level datasets used to benchmark the Automated Regional Correlation Analysis for Hydrologic Record Imputation (ARCHI) software package (Levy and others, 2024). The first dataset contains groundwater-level records and site metadata for wells located on Long Island, New York (NY) and some surrounding mainland sites in New York and Connecticut. The second dataset contains groundwater-level records and site metadata for wells located in the southeastern San Joaquin Valley of the Central Valley, California (CA). For ease of exposition these are referred to as NY and CA datasets, respectively. Both datasets are formatted with column headers that can be read by the ARCHI software package within the R computing environment. These datasets were used to benchmark the imputation accuracy of three ARCHI model settings (OLS, ridge, and MOVE.1) against the widely used imputation program missForest (Stekhoven and Bühlmann, 2012). The ARCHI program was used to process the NY and CA datasets on monthly and annual timesteps, respectively, filter out sites with insufficient data for imputation, and create 200 test datasets from each of the example datasets with 5 percent of observations removed at random (herein, referred to as "holdouts"). Imputation accuracy for test datasets was assessed using normalized root mean square error (NRMSE), which is the root mean square error divided by the standard deviation of the observed holdout values. ARCHI produces prediction intervals (PIs) using a non-parametric bootstrapping routine, which were assessed by computing a coverage rate (CR) defined as the proportion of holdout observations falling within the estimated PI. The multiple regression models included with the ARCHI package (OLS and ridge) were further tested on all test datasets at eleven different levels of the p_per_n input parameter, which limits the maximum ratio of regression model predictors (p) per observations (n) as a decimal fraction greater than zero and less than or equal to one. This data release contains ten tables formatted as tab-delimited text files. The “CA_data.txt” and “NY_data.txt” tables contain 243,094 and 89,997 depth-to-groundwater measurement values (value, in feet below land surface) indexed by site identifier (site_no) and measurement date (date) for CA and NY datasets, respectively. The “CA_sites.txt” and “NY_sites.txt” tables contain site metadata for the 4,380 and 476 unique sites included in the CA and NY datasets, respectively. The “CA_NRMSE.txt” and “NY_NRMSE.txt” tables contain NRMSE values computed by imputing 200 test datasets with 5 percent random holdouts to assess imputation accuracy for three different ARCHI model settings and missForest using CA and NY datasets, respectively. The “CA_CR.txt” and “NY_CR.txt” tables contain CR values used to evaluate non-parametric PIs generated by bootstrapping regressions with three different ARCHI model settings using the CA and NY test datasets, respectively. The “CA_p_per_n.txt” and “NY_p_per_n.txt” tables contain mean NRMSE values computed for 200 test datasets with 5 percent random holdouts at 11 different levels of p_per_n for OLS and ridge models compared to training error for the same models on the entire CA and NY datasets, respectively. References Cited Levy, Z.F., Stagnitta, T.J., and Glas, R.L., 2024, ARCHI: Automated Regional Correlation Analysis for Hydrologic Record Imputation, v1.0.0: U.S. Geological Survey software release, https://doi.org/10.5066/P1VVHWKE. Stekhoven, D.J., and Bühlmann, P., 2012, MissForest—non-parametric missing value imputation for mixed-type data: Bioinformatics 28(1), 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here are a few use cases for this project:
Automated UI Testing: "Reorganized" can be employed to perform automated UI testing for web and mobile applications, helping developers quickly identify and verify the presence of specific UI elements such as buttons, fields, and images to ensure that each component is functioning properly and meets design specifications.
Accessibility Enhancement: Utilizing "Reorganized" can help in improving the accessibility of websites and applications by automatically identifying and labeling different GUI elements, enabling screen reader software to provide more accurate and detailed information for visually impaired users.
UI Design Evaluation: "Reorganized" can assist in analyzing and comparing UI designs of different applications to evaluate consistency, user experience and adherence to design principles. By identifying specific elements, it can provide insights to designers on which areas need improvement or adjustments.
Content Curation and Classification: The computer vision model can be used to analyze and sort through large collections of web pages or applications to categorize and curate content based on the presence of specific GUI elements like text, images, buttons, etc. This can be helpful in creating repositories, educational material, or designing targeted advertisements.
5.website Migration and Conversion: Using "Reorganized" can significantly speed up the process of migrating or converting websites, especially when transitioning from one content management system to another. By identifying and extracting GUI elements, it becomes easier to map these elements to a new system and ensure a seamless transfer.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
Success.ai is at the forefront of delivering precise consumer behavior insights that empower businesses to understand and anticipate customer needs more effectively. Our extensive datasets provide a deep dive into the nuances of consumer actions, preferences, and trends, enabling businesses to tailor their strategies for maximum engagement and conversion.
Explore the Multifaceted Dimensions of Consumer Behavior:
Why Choose Success.ai for Consumer Behavior Data?
Strategic Applications of Consumer Behavior Data for Business Growth:
Empower Your Business with Actionable Consumer Insights from Success.ai
Success.ai provides not just data, but a gateway to transformative business strategies. Our comprehensive consumer behavior insights allow you to make informed decisions, personalize customer interactions, and ultimately drive higher engagement and sales.
Get in touch with us today to discover how our Consumer Behavior Intent Data can revolutionize your business strategies and help you achieve your market potential.
Contact Success.ai now and start transforming data into growth. Let us show you how our unmatched data solutions can be the cornerstone of your business success.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundEstimates of the sensitivity and specificity for new diagnostic tests based on evaluation against a known gold standard are imprecise when the accuracy of the gold standard is imperfect. Bayesian latent class models (LCMs) can be helpful under these circumstances, but the necessary analysis requires expertise in computational programming. Here, we describe open-access web-based applications that allow non-experts to apply Bayesian LCMs to their own data sets via a user-friendly interface. Methods/Principal FindingsApplications for Bayesian LCMs were constructed on a web server using R and WinBUGS programs. The models provided (http://mice.tropmedres.ac) include two Bayesian LCMs: the two-tests in two-population model (Hui and Walter model) and the three-tests in one-population model (Walter and Irwig model). Both models are available with simplified and advanced interfaces. In the former, all settings for Bayesian statistics are fixed as defaults. Users input their data set into a table provided on the webpage. Disease prevalence and accuracy of diagnostic tests are then estimated using the Bayesian LCM, and provided on the web page within a few minutes. With the advanced interfaces, experienced researchers can modify all settings in the models as needed. These settings include correlation among diagnostic test results and prior distributions for all unknown parameters. The web pages provide worked examples with both models using the original data sets presented by Hui and Walter in 1980, and by Walter and Irwig in 1988. We also illustrate the utility of the advanced interface using the Walter and Irwig model on a data set from a recent melioidosis study. The results obtained from the web-based applications were comparable to those published previously. ConclusionsThe newly developed web-based applications are open-access and provide an important new resource for researchers worldwide to evaluate new diagnostic tests.
Introducing Job Posting Datasets: Uncover labor market insights!
Elevate your recruitment strategies, forecast future labor industry trends, and unearth investment opportunities with Job Posting Datasets.
Job Posting Datasets Source:
Indeed: Access datasets from Indeed, a leading employment website known for its comprehensive job listings.
Glassdoor: Receive ready-to-use employee reviews, salary ranges, and job openings from Glassdoor.
StackShare: Access StackShare datasets to make data-driven technology decisions.
Job Posting Datasets provide meticulously acquired and parsed data, freeing you to focus on analysis. You'll receive clean, structured, ready-to-use job posting data, including job titles, company names, seniority levels, industries, locations, salaries, and employment types.
Choose your preferred dataset delivery options for convenience:
Receive datasets in various formats, including CSV, JSON, and more. Opt for storage solutions such as AWS S3, Google Cloud Storage, and more. Customize data delivery frequencies, whether one-time or per your agreed schedule.
Why Choose Oxylabs Job Posting Datasets:
Fresh and accurate data: Access clean and structured job posting datasets collected by our seasoned web scraping professionals, enabling you to dive into analysis.
Time and resource savings: Focus on data analysis and your core business objectives while we efficiently handle the data extraction process cost-effectively.
Customized solutions: Tailor our approach to your business needs, ensuring your goals are met.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is a founding member of the Ethical Web Data Collection Initiative, aligning with GDPR and CCPA best practices.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Effortlessly access fresh job posting data with Oxylabs Job Posting Datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Open Poetry Vision
dataset is a synthetic dataset created by Roboflow for OCR tasks.
It combines a random image from the Open Images Dataset with text primarily sampled from Gwern's GPT-2 Poetry project. Each image in the dataset contains between 1 and 5 strings in a variety of fonts and colors randomly positioned in the 512x512 canvas. The classes correspond to the font of the text.
Example Image:
https://i.imgur.com/sZT516a.png" alt="Example Image">
A common OCR workflow is to use a neural network to isolate text for input into traditional optical character recognition software. This dataset could make a good starting point for an OCR project like business card parsing or automated paper form-processing.
Alternatively, you could try your hand using this as a neural font identification dataset. Nvidia, amongst others, have had success with this task.
Use the fork
button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.
Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
Developers reduce 50% of their code when using Roboflow's workflow, automate annotation quality assurance, save training time, and increase model reproducibility.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset includes news articles collected from Al Jazeera through web scraping. The scraping code was developed in November/December 2022, and it may need updates to accommodate changes in the website's structure since then. Users are advised to review and adapt the scraping code according to the current structure of the Al Jazeera website.
Please note that changes in website structure may impact the classes and elements used for scraping. The provided code is a starting point and may require adjustments to ensure accurate data extraction.
If you wish to scrape news articles from different categories beyond Science & Technology, Economics, or Sports, or if you need a more extensive dataset, the scraping code is available in the repository. Visit the repository to access the code and instructions for scraping additional categories or obtaining a larger dataset.
Repository Link: https://github.com/uma-oo/Aljazeera-Scraper
Dataset Structure: - Category - Title - Text of the Article
The dataset is structured with three main columns: Category, Title, and Text of the Article. Explore and utilize the dataset for various analytical and natural language processing tasks.
Feel free to explore and contribute to the code for your specific scraping needs!
USGS is assessing the feasibility of map projections and grid systems for lunar surface operations. We propose developing a new Lunar Transverse Mercator (LTM), the Lunar Polar Stereographic (LPS), and the Lunar Grid Reference Systems (LGRS). We have also designed additional grids designed to NASA requirements for astronaut navigation, referred to as LGRS in Artemis Condensed Coordinates (ACC). This data release includes LGRS grids finer than 25km (1km, 100m, and 10m) in ACC format. LTM, LPS, and LGRS grids are not released here but may be acceded from https://doi.org/10.5066/P13YPWQD. LTM, LPS, and LGRS are similar in design and use to the Universal Transverse Mercator (UTM), Universal Polar Stereographic (LPS), and Military Grid Reference System (MGRS), but adhere to NASA requirements. LGRS ACC format is similar in design and structure to historic Army Mapping Service Apollo orthotopophoto charts for navigation. The Lunar Transverse Mercator (LTM) projection system is a globalized set of lunar map projections that divides the Moon into zones to provide a uniform coordinate system for accurate spatial representation. It uses a Transverse Mercator projection, which maps the Moon into 45 transverse Mercator strips, each 8°, longitude, wide. These Transverse Mercator strips are subdivided at the lunar equator for a total of 90 zones. Forty-five in the northern hemisphere and forty-five in the south. LTM specifies a topocentric, rectangular, coordinate system (easting and northing coordinates) for spatial referencing. This projection is commonly used in GIS and surveying for its ability to represent large areas with high positional accuracy while maintaining consistent scale. The Lunar Polar Stereographic (LPS) projection system contains projection specifications for the Moon’s polar regions. It uses a polar stereographic projection, which maps the polar regions onto an azimuthal plane. The LPS system contains 2 zones, each zone is located at the northern and southern poles and is referred to as the LPS northern or LPS southern zone. LPS, like its equatorial counterpart LTM, specifies a topocentric, rectangular, coordinate system (easting and northing coordinates) for spatial referencing. This projection is commonly used in GIS and surveying for its ability to represent large polar areas with high positional accuracy while maintaining consistent scale across the map region. LGRS is a globalized grid system for lunar navigation supported by the LTM and LPS projections. LGRS provides an alphanumeric grid coordinate structure for both the LTM and LPS systems. This labeling structure is utilized similarly to MGRS. LGRS defines a global area grid based on latitude and longitude and a 25×25 km grid based on LTM and LPS coordinate values. Two implementations of LGRS are used as polar areas require an LPS projection and equatorial areas a Transverse Mercator. We describe the differences in the techniques and methods reported in this data release. Request McClernan et. al. (in-press) for more information. ACC is a method of simplifying LGRS coordinates and is similar in use to the Army Mapping Service Apollo orthotopophoto charts for navigation. These grids are designed to condense a full LGRS coordinate to a relative coordinate of 6 characters in length. LGRS in ACC format is completed by imposing a 1km grid within the LGRS 25km grid, then truncating the grid precision to 10m. To me the character limit, a coordinate is reported as a relative value to the lower-left corner of the 25km LGRS zone without the zone information; However, zone information can be reported. As implemented, and 25km^2 area on the lunar surface will have a set of a unique set of ACC coordinates to report locations The shape files provided in this data release are projected in the LTM or LPS PCRSs and must utilize these projections to be dimensioned correctly. LGRS ACC Grids Files and Resolution: LGRS ACC Grids in LPS portion: Amundsen_Rim 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Nobile_Rim_2 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Haworth 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Faustini_Rim_A 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache_Rim_2 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Connecting_Ridge_Extension 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Connecting_Ridge 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Nobile_Rim_1 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Peak_Near_Shackleton 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache_Rim' 'Leibnitz_Beta_Plateau 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Malapert_Massif 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile de_Gerlache-Kocher_Massif 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile LGRS ACC Grids in LTM portion: Apollo_11 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_12 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_14 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_15 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_16 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile Apollo_17 1km Grid Shapefile 100m Grid Shapefile 10m Grid Shapefile LTM, LPS, and LGRS PCRS shapefiles utilize either a custom transverse Mercator or polar Stereographic projection. For PCRS grids the LTM and LPS projections are recommended for all LTM, LPS, and LGRS grid sizes. See McClernan et. al. (in-press) for such projections. For GIS utilization of grid shapefiles projected in Lunar Latitude and Longitude should utilize a registered lunar geographic coordinate system (GCS) such as IAU_2015:30100 or ESRI:104903. This only applies to grids that cross multiple LTM zones. Note: All data, shapefiles require a specific projection and datum. The projection is recommended as LTM and LPS or, when needed, IAU_2015:30100 or ESRI:104903. The datum utilized must be the Jet Propulsion Laboratory (JPL) Development Ephemeris (DE) 421 in the Mean Earth (ME) Principal Axis Orientation as recommended by the International Astronomy Union (IAU) (Archinal et. al., 2008).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
Dataset Composition:
Intended Use:
Additional Information:
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.