Facebook
TwitterPsychological scientists increasingly study web data, such as user ratings or social media postings. However, whether research relying on such web data leads to the same conclusions as research based on traditional data is largely unknown. To test this, we (re)analyzed three datasets, thereby comparing web data with lab and online survey data. We calculated correlations across these different datasets (Study 1) and investigated identical, illustrative research questions in each dataset (Studies 2 to 4). Our results suggest that web and traditional data are not fundamentally different and usually lead to similar conclusions, but also that it is important to consider differences between data types such as populations and research settings. Web data can be a valuable tool for psychologists when accounting for such differences, as it allows for testing established research findings in new contexts, complementing them with insights from novel data sources.
Facebook
TwitterThis dataset contains the 30 questions that were posed to the chatbots (i) ChatGPT-3.5; (ii) ChatGPT-4; and (iii) Google Bard, in May 2023 for the study “Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard”. These 30 questions describe mathematics and logic problems that have a unique correct answer. The questions are fully described with plain text only, without the need for any images or special formatting. The questions are divided into two sets of 15 questions each (Set A and Set B). The questions of Set A are 15 “Original” problems that cannot be found online, at least in their exact wording, while Set B contains 15 “Published” problems that one can find online by searching on the internet, usually with their solution. Each question is posed three times to each chatbot.
This dataset contains the following: (i) The full set of the 30 questions, A01-A15 and B01-B15; (ii) the correct answer for each one of them; (iii) an explanation of the solution, for the problems where such an explanation is needed, (iv) the 30 (questions) × 3 (chatbots) × 3 (answers) = 270 detailed answers of the chatbots. For the published problems of Set B, we also provide a reference to the source where each problem was taken from.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Measuring the quality of Question Answering (QA) systems is a crucial task to validate the results of novel approaches. However, there are already indicators of a reproducibility crisis as many published systems have used outdated datasets or use subsets of QA benchmarks, making it hard to compare results. We identified the following core problems: there is no standard data format, instead, proprietary data representations are used by the different partly inconsistent datasets; additionally, the characteristics of datasets are typically not reflected by the dataset maintainers nor by the system publishers. To overcome these problems, we established an ontology---Question Answering Dataset Ontology (QADO)---for representing the QA datasets in RDF. The following datasets were mapped into the ontology: the QALD series, LC-QuAD series, RuBQ series, ComplexWebQuestions, and Mintaka. Hence, the integrated data in QADO covers widely used datasets and multilinguality. Additionally, we did intensive analyses of the datasets to identify their characteristics to make it easier for researchers to identify specific research questions and to select well-defined subsets. The provided resource will enable the research community to improve the quality of their research and support the reproducibility of experiments.
Here, the mapping results of the QADO process, the SPARQL queries for data analytics, and the archived analytics results file are provided.
Up-to-date statistics can be created automatically by the script provided at the corresponding QADO GitHub RDFizer repository.
Facebook
TwitterConducted an in-depth analysis comparing:
Trends Derived from Tags: Extracted and analyzed tags from the Stack Exchange API to identify programming language trends.
Annual User Survey Data: Examined data from Stack Overflow's annual user survey to understand user preferences and technology adoption.
By comparing these two data sources, I validated trends and patterns, offering a comprehensive understanding of the current programming language and technology landscape.
Facebook
TwitterThis data package includes information about List and Rating of Home Health Care Agencies, Health Care for Patient Survey data and State data of several home health agency quality measures as well as State Averages for Home Health Agency (HHA) Quality Measures. It also provides datasets over Hospice General Information, Provider data and CASPER or ASPEN Information about hospice agencies.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By openai (From Huggingface) [source]
This dataset contains comparisons between WebGPT models and OpenAI models, along with various metrics used to evaluate their performance. The dataset includes several columns such as 'question' which represents the question asked in the comparison, 'quotes_0' and 'quotes_1' which correspond to the quotes or statements from WebGPT model and OpenAI model respectively. The answers provided by both models are recorded in the columns 'answer_0' and 'answer_1'. Additionally, there are columns indicating the number of tokens used by each model ('tokens_0' and 'tokens_1'), as well as the score or confidence level of their respective answers ('score_0' and 'score_1').
The purpose of this dataset is to provide training data for comparing different versions of WebGPT models with OpenAI models. By capturing various aspects such as question formulation, generated answers, token usage, and confidence scores, this dataset aims to enable a comprehensive analysis of the performance and capabilities of these models.
Overall, this dataset offers researchers an opportunity to explore the similarities and differences between WebGPT models and OpenAI models based on real-world comparisons. It can serve as a valuable resource for training machine learning algorithms, conducting comparative analyses, understanding model behavior, or developing new techniques in natural language processing
Overview
The dataset consists of several columns that contain valuable information for each comparison. Here is an overview of the columns present in this dataset:
question: The question asked in the comparison.quotes_0: The quotes or statements from the WebGPT model.answer_0: The answer provided by the WebGPT model.tokens_0: The number of tokens used by the WebGPT model to generate the answer.score_0: The score or confidence level of the answer provided by the WebGPT model.quotes_1: The quotes or statements from the OpenAI model.answer_1: The answer provided by OpenAI model.tokens_1: The number of tokens used by OpenAI model to generate the answer.score_1:The score or confidence level of the answer provided by OpenAI model.Dataset Usage
This dataset can be utilized in various ways for research, analysis, and improvement-related purposes related to comparing performance between different models.
Here are a few examples:
1) Model Comparison:
You can compare and analyze how well both models (WebGTP and OpenAI) perform on specific questions based on their answers, scores/confidence levels, token usage, and supporting quotes/statements.
2) Metric Evaluation:
By examining both scores/confidence levels (score_0 & score_1), you can evaluate which model tends to provide more reliable answers overall.
3) Token Efficiency:
By analyzing tokens usage (tokens_0 & tokens_1), you can gain insights into which model is more efficient in generating answers within token limits.
4) Model Improvements:
The dataset can be used to identify areas of improvement for both the WebGPT and OpenAI models. By analyzing the answers, quotes, and scores, you may discover patterns or common pitfalls that can guide future model enhancements.
Conclusion
This dataset provides a valuable resource for comparing WebGPT and OpenAI models. With the information provided in each column, researchers can perform a wide range of analysis to better understand the strengths and weaknesses of each model. Whether it's
- Model Evaluation: This dataset can be used to compare the performance of different models, specifically WebGPT models and OpenAI models. The scores, quotes, answers, and token counts provided by each model can be analyzed to determine which model performs better for a given task.
- Feature Engineering: The dataset can be used to extract relevant features that indicate the quality or accuracy of an answer generated by a model. These features can then be used in building machine learning models to improve the performance of question answering systems.
- Bias Analysis: By analyzing the quotes and answers provided by WebGPT and OpenAI models, this dataset can help identify any biases or patterns in their responses. This analysis can provide insights into potential biases present in AI-generated content and inform efforts towards making AI systems more fair and unbiased
Facebook
TwitterThe dataset allows to replicate the results of the following article: Höhne, J. K., Gavras, K., & Claassen, J. (2024; accepted). Typing or Speaking? Comparing text and voice answers to open questions on sensitive topics in smartphone surveys. Social Science Computer Review.
Facebook
TwitterMillennium Challenge Corporation hired Mathematica Policy Research to conduct an independent evaluation of the BRIGHT II program. The three main research questions of interest are: • What was the impact of the program on school enrollment, attendance, and retention? • What was the impact of the program on test scores? • Are the impacts different for girls than for boys?
Mathematica will compare data collected from the 132 communities served by BRIGHT II (the "treatment group") with that collected from the 161 communities that applied but were not selected for the program (the "comparison group"). Using a statistical technique called regression discontinuity, Mathematica will compare the outcomes of the treatment villages just above the cutoff point to the outcomes of the comparison villages just below the cutoff point. If the intervention had an impact, we will observe a "jump" in outcomes at the point of discontinuity.
Mathematica will perform additional analyses to estimate the overall merit of the BRIGHT investment. By conducting a cost-benefit analysis and a cost-effectiveness analysis and calculating the economic rate of return, Mathematica will be able to answer questions related to the sustainability of the program, and compare the program to interventions and social investments in other sectors. The household survey is designed to capture household-level data rather than community-level data; however, questions have been included to measure head-of-household expectations of educational attainment. These questions ask the head of household what grade level he hopes each child will attain; and what grade level he thinks the child will be capable of achieving in reality.
132 rural villages throughout the 10 provinces of Burkina Faso in which girls' enrollment rates were lowest
Households
Households, students, and educators in the 287 villages surveyed
Sample survey data [ssd]
The BRIGHT II program was implemented in the same 132 villages that received the BRIGHT I interventions. These 132 villages were originally selected using a scoring process, with eligibility scores based on the villages’ potential to improve girls’ educational outcomes. A total of 293 villages applied to receive a BRIGHT school; the Burkina Faso Ministry of Basic Education (MEBA) selected the 132 villages with scores that were above a certain cutoff point. Whenever possible, the survey will be conducted with the same children in the same households and schools surveyed during the BRIGHT I evaluation. By visiting the same households and schools, the evaluator will be able to better assess the longer-term impacts of the BRIGHT project.
Mathematica has developed two surveys, a household survey and a school survey, to collect relevant data from villages in both the treatment and comparison groups. The household survey was administered to a new cross-section of households compared to the BRIGHT I evaluation. Data will be collected on the attendance and educational attainment of school-age children in the household, attitudes towards girls' education, and parental assessment of the extent to which the complementary interventions influenced school enrollment decisions. It will also assess the performance of all household children on basic tests of French and math. The school survey, to be administered to all local schools in the 293 villages, gathers data on school characteristics, personnel, and physical structure, and collects enrollment and attendance records. Data will be gathered by a local data collection firm selected by MCA-Burkina Faso, with Mathematica providing technical assistance and oversight.
Following data collection, Mathematica will work with BERD to ensure that the data are correctly entered and are complete and clean. This will include a review of all frequencies for out-of-range responses, missing data, or other problems, as well as a comparison between the data and paper copies for a random selection of variables.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Time series are a critical component of ecological analysis, used to track changes in biotic and abiotic variables. Information can be extracted from the properties of time series for tasks such as classification (e.g. assigning species to individual bird calls); clustering (e.g. clustering similar responses in population dynamics to abrupt changes in the environment or management interventions); prediction (e.g. accuracy of model predictions to original time series data); and anomaly detection (e.g. detecting possible catastrophic events from population time series). These common tasks in ecological research rely on the notion of (dis-) similarity, which can be determined using distance measures. A plethora of distance measures have been described, predominantly in the computer and information sciences, but many have not been introduced to ecologists. Furthermore, little is known about how to select appropriate distance measures for time-series-related tasks. Therefore, many potential applications remain unexplored. Here we describe 16 properties of distance measures that are likely to be of importance to a variety of ecological questions involving time series. We then test 42 distance measures for each property and use the results to develop an objective method to select appropriate distance measures for any task and ecological dataset. We demonstrate our selection method by applying it to a set of real-world data on breeding bird populations in the UK and discuss other potential applications for distance measures, along with associated technical issues common in ecology. Our real-world population trends exhibit a common challenge for time series comparisons: a high level of stochasticity. We demonstrate two different ways of overcoming this challenge, first by selecting distance measures with properties that make them well-suited to comparing noisy time series, and second by applying a smoothing algorithm before selecting appropriate distance measures. In both cases, the distance measures chosen through our selection method are not only fit-for-purpose but are consistent in their rankings of the population trends. The results of our study should lead to an improved understanding of, and greater scope for, the use of distance measures for comparing ecological time series, and help us answer new ecological questions. Methods Distance measure test results were produced using R and can be replicated using scripts available on GitHub at https://github.com/shawndove/Trend_compare. Detailed information on wading bird trends can be found in Jellesmark et al. (2021) below. Jellesmark, S., Ausden, M., Blackburn, T. M., Gregory, R. D., Hoffmann, M., Massimino, D., McRae, L., & Visconti, P. (2021). A counterfactual approach to measure the impact of wet grassland conservation on U.K. breeding bird populations. Conservation Biology, 35(5), 1575–1585. https://doi.org/10.1111/cobi.13692
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
In this dataset, a variety of questions spanning different subjects and mediums are presented, and a comparison is made between the actual marks obtained by human respondents and the marks gained by the ChatGPT model. The dataset encompasses questions related to logical equivalences, programming concepts, and applications of various logical laws.
Each entry in the dataset includes the following information: - Questions: The text of the questions asked. - Subject: The subject of the question (e.g., Data Structures). - Medium: The type of assessment (e.g., Exam, Quiz, Assignment). - Max Marks: The maximum possible marks for the question. - Marks Obtained: The actual marks obtained by human respondents. - Marks Obtained ChatGPT: The marks gained by the ChatGPT model.
The dataset aims to provide insights into the performance of both human respondents and the ChatGPT model across different question types and assessment scenarios. It serves as a resource for evaluating the effectiveness of the model in predicting human-level performance on various question-based assessments, helping to understand the alignment between human reasoning and the model's responses.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CSVvsVizQA
Dataset to compare question answering ability from CSV Data vs Data Visualization images.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
ELI5 means "Explain like I am 5" . It's originally a "long and free form" Question-Answering scraping from reddit eli5 subforum. Original ELI5 datasets (https://github.com/facebookresearch/ELI5) can be used to train a model for "long & free" form Question-Answering , e.g. by Encoder-Decoder models like T5 or Bart
When we get a model, how can we estimate model performance (ability to give high-quality answers) ? Conventional methods are ROUGE-family metrics (see ELI5 paper linked above)
However, ROUGE scores are based on n-gram and and need to compare a generated answer to a ground-truth answer. Unfortunately, n-gram scoring cannot evaluate high-quality paraphrase answers.
Worse, the need to a ground-truth answer in order to compare and calculate (ROUGE) score. This scoring perspective is against the "spirit" of the "free form" question answering where there are many possible (non-paraphrase) valid and good answers .
To summarize, "creative & high-quality" answers cannot be estimated with ROUGE , which prevents us to construct (and estimate) creative models.
This dataset, in contrast, is aimed for training a "scoring" (regression) model , which can predict an upvote score on each Q-A pair individually (not A-A pair like ROUGE) .
The data is simply a CSV file containing Q-A pairs and their scores. Each line contains Q-A texts (in Roberta format) and its upvote score (non-negative integer)
It is intended to be easy and direct to create scoring model with Roberta (or other Transformer models with changing separation token) .
In the csv file, there is qa column and answer_score column
Each row in qa is written in Roberta paired-sentences format -- Answer
With answer_score we have the following principle :
- High quality answer related to its question should get high score (upvotes)
- Low quality answer related to its question should get low score
- Well written answer NOT related to its question should get 0 score
Each positive Q-A pair comes from the original ELI5 dataset (true upvote score). Each 0-score Q-A pair is constructed with details in the next subsection.
The principle is contrastive training. We need somewhat high-quality 0-score pairs for model to generalize. Too easy 0-score pairs (e.g. a question with random answers will be too easy and a model will learn nothing)
Therefore, for each question, we try to construct two answers (two 0-score pairs) where each answer is related to the topic of the question, but does not answer the question.
This can be achieve by vectorizing all questions into vectors using RetriBERT and storing with FAISS. We can then measure a distance between two question vectors using cosine distance.
More precisely, for a question Q1, we choose two answers of related (but non-identical) questions Q2 and Q3 , i.e. answer A2 and A3, to construct Q1-A2 and Q1-A3 pairs of 0-score. Combining with the Q1-A1 pair of positive score, we will have 3 Q1 pairs , and 3 pairs for each questions in total. Therefore, from 272,000 examples of original ELI5 , in this dataset we have 3 times of its size = 816,000 examples .
Note that two question vectors that are very close can be the same (paraphrase) question , and two questions that are very far apart are totally different questions. Therefore, we need a threshold to determine not-too-close & not-too-far pair of questions so that we get non-identical but same-topic question pairs. In a simple experiment, a cosine distance of 10-11 of RetriBERT vectors seem work well, so we use this number as a threshold to construct a 0-score Q-A pair.
roberta-base baseline with MAE 3.91 on validation set can be found here :
https://www.kaggle.com/ratthachat/eli5-scorer-roberta-base-500k-mae391
Facebook AI team for creating original ELI5 dataset, and Huggingface NLP library for make us access this dataset easily . - https://github.com/facebookresearch/ELI5 - https://huggingface.co/nlp/viewer/
My project on ELI5 is mainly inspired from this amazing work of Yacine Jernite : https://yjernite.github.io/lfqa.html
Facebook
TwitterThese datasets contain 1.48 million question and answer pairs about products from Amazon.
Metadata includes
question and answer text
is the question binary (yes/no), and if so does it have a yes/no answer?
timestamps
product ID (to reference the review dataset)
Basic Statistics:
Questions: 1.48 million
Answers: 4,019,744
Labeled yes/no questions: 309,419
Number of unique products with questions: 191,185
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The comparison between MetaQA and recent table question answering tables.
Facebook
TwitterThe MAR allows the District Government to more easily compare information across databases and agencies. Learn more about the MAR with these frequently asked questions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Webis Comparative Questions 2022 dataset contains about 31,000 questions labeled as comparative or not.
3,500 comparative questions are labeled on the token level with comparison objects, aspects, predicates, or none.
For 950 questions, text passages that potentially answer the questions are labeled with the stance: pro first comparison object, pro second, neutral, or no stance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Readability scores for Chatgpt-4o, Gemini, and Perplexity responses to the most frequently asked Ankylosing spondylitis -related questions, and a statistical comparison of the text content to a 6th-grade reading level [Median, 95% Confidence Interval (CI) (Lower limit of confidence interval- Upper limit of confidence interval)].
Facebook
TwitterThis file has two sheets. Data are measurements by Citizen Science Air Monitors (CSAM) and Federal Monitors, which sampled particulate matter (PM), nitrogen dioxide (NO2), relative humidity (RH), and temperature (T). Variables for each sheet are described in more detail below The sheet “Snorkel No-Snorkel Comparison” includes data from two CSAM units, CSAM-2 and CSAM-3. CSAM-2 used a snorkel tube to sample outdoor air, and CSAM-3 did not use a snorkel tube. CSAM-2 and CSAM-3 were not in the same sampling location, but did sample contemporaneous measurements. These data were used to perform a snorkel and no-snorkel comparison. The sheet “CSAM-1 and Federal Monitor” includes data from a CSAM unit (CSAM-1) and a Federal Monitor (which is used for regulatory measurements of air pollution). CSAM-1 and the Federal Monitor were installed in the same sampling location and recorded contemporaneous measurements. For CSAM-1, original recorded measurements are included, as well as measurements that were corrected (using regression equations) to better reflect the Federal Monitor values. This dataset is associated with the following publication: Barzyk, T., H. Huang, R. Williams, A. Kaufman, and J. Essoka. Advice and Frequently Asked Questions (FAQs) for Citizen-Science Environmental Health Assessments. International Journal of Environmental Research and Public Health. Molecular Diversity Preservation International, Basel, SWITZERLAND, 15(5): 960, (2018).
Facebook
TwitterThis study investigates the effect of survey mode on respondent learning and fatigue during repeated choice experiments. Stated preference data are obtained from an experiment concerning high-speed Internet service conducted on samples of mail respondents and online respondents. We identify and estimate aspects of the error components for different subsets of the choice questions, for both mail and online respondents. Results show mail respondents answer questions consistently throughout a series of choice experiments, but the quality of the online respondents' answers declines. Therefore, while the online survey provides lower survey administration costs and reduced time between implementation and data analysis, such benefits come at the cost of less precise responses.
Facebook
TwitterThis data collection was undertaken to gather information on the extent of police officers' knowledge of search and seizure law, an issue with important consequences for law enforcement. A specially-produced videotape depicting line duty situations that uniformed police officers frequently encounter was viewed by 478 line uniformed police officers from 52 randomly-selected cities in which search and seizure laws were determined to be no more restrictive than applicable United States Supreme Court decisions. Testing of the police officers occurred in all regions as established by the Federal Bureau of Investigation, except for the Pacific region (California, Oregon, and Washington), since search and seizure laws in these states are, in some instances, more restrictive than United States Supreme Court decisions. No testing occurred in cities with populations under 10,000 because of budget limitations. Fourteen questions to which the officers responded were presented in the videotape. Each police officer also completed a questionnaire that included questions on demographics, training, and work experience, covering their age, sex, race, shift worked, years of police experience, education, training on search and seizure law, effectiveness of various types of training instructors and methods, how easily they could obtain advice about search and seizure questions they encountered, and court outcomes of search and seizure cases in which they were involved. Police department representatives completed a separate questionnaire providing department characteristics and information on search and seizure training and procedures, such as the number of sworn officers, existence of general training and the number of hours required, existence of in-service search and seizure training and the number of hours and testing required, existence of policies and procedures on search and seizure, and means of advice available to officers about search and seizure questions. These data comprise Part 1. For purposes of comparison and interpretation of the police officer test scores, question responses were also obtained from other sources. Part 2 contains responses from 36 judges from states with search and seizure laws no more restrictive than the United States Supreme Court decisions, as well as responses from a demographic and work-experience questionnaire inquiring about their age, law school attendance, general judicial experience, and judicial experience and education specific to search and seizure laws. All geographic regions except New England and the Pacific were represented by the judges. Part 3, Comparison Data, contains answers to the 14 test questions only, from 15 elected district attorneys, 6 assistant district attorneys, the district attorney in another city and 11 of his assistant district attorneys, a police attorney with expertise in search and seizure law, 24 police academy trainees with no previous police work experience who were tested before search and seizure law training, a second group of 17 police academy trainees -- some with police work experience but no search and seizure law training, 55 law enforcement officer trainees from a third academy tested immediately after search and seizure training, 7 technical college students with no previous education or training on search and seizure law, and 27 university criminal justice course students, also with no search and seizure law education or training.
Facebook
TwitterPsychological scientists increasingly study web data, such as user ratings or social media postings. However, whether research relying on such web data leads to the same conclusions as research based on traditional data is largely unknown. To test this, we (re)analyzed three datasets, thereby comparing web data with lab and online survey data. We calculated correlations across these different datasets (Study 1) and investigated identical, illustrative research questions in each dataset (Studies 2 to 4). Our results suggest that web and traditional data are not fundamentally different and usually lead to similar conclusions, but also that it is important to consider differences between data types such as populations and research settings. Web data can be a valuable tool for psychologists when accounting for such differences, as it allows for testing established research findings in new contexts, complementing them with insights from novel data sources.