These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. As such, it is a violation of Federal Law to publish them. Format: These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. This dataset is associated with the following publication: Stiegel, M., J. Pleil, J. Sobus, T. Stevens, and M. Madden. Linking physiological parameters to perturbations in the human exposome: Environmental exposures modify blood pressure and lung function via inflammatory cytokine pathway. JOURNAL OF TOXICOLOGY AND ENVIRONMENTAL HEALTH - PART A: CURRENT ISSUES. Taylor & Francis, Inc., Philadelphia, PA, USA, 80(9): 485-501, (2017).
Overview: This dataset offers a unique collection of research abstracts, comprising both human-written and AI-generated versions. Each entry provides a title, followed by the abstract text, with annotations specifying if the abstract was human-authored or generated by GPT. This dataset was created for the research article Detecting AI Authorship: Analyzing Descriptive Features for AI Detection.
Structure: The dataset is structured in the following manner:
title: A research paper's title that remains consistent for both human-written and GPT-generated abstracts.
abstract: The main content of the abstract. Each title is associated with two abstract texts — one penned by a human author, and another created by GPT.
ai_generated (Boolean): True indicates the abstract was generated by GPT. False indicates the abstract was human-authored. is_ai_generated (Binary): 1 denotes an AI-generated abstract. 0 denotes a human-written abstract.
Human abstracted taken from this dataset: https://www.kaggle.com/datasets/Cornell-University/arxiv
Licence: This dataset is under the MIT licence (https://opensource.org/license/mit/) meaning that "any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software... "
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This archive contains the files submitted to the 4th International Workshop on Data: Acquisition To Analysis (DATA) at SenSys. Files provided in this package are associated with the paper titled "Dataset: Analysis of IFTTT Recipes to Study How Humans Use Internet-of-Things (IoT) Devices"
With the rapid development and usage of Internet-of-Things (IoT) and smart-home devices, researchers continue efforts to improve the ''smartness'' of those devices to address daily needs in people's lives. Such efforts usually begin with understanding evolving user behaviors on how humans utilize the devices and what they expect in terms of their behavior. However, while research efforts abound, there is a very limited number of datasets that researchers can use to both understand how people use IoT devices and to evaluate algorithms or systems for smart spaces. In this paper, we collect and characterize more than 50,000 recipes from the online If-This-Then-That (IFTTT) service to understand a seemingly straightforward but complicated question: ''What kinds of behaviors do humans expect from their IoT devices?'' The dataset we collected contains the basic information of the IFTTT rules, trigger and action event, and how many people are using each rule.
For more detail about this dataset, please refer to the paper listed above.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Many fundamental concepts in evolutionary biology were discovered using non-human study systems. Humans are poorly suited to key study designs used to advance this field, and are subject to cultural, technological, and medical influences often considered to restrict the pertinence of human studies to other species and general contexts. Whether studies using current and recent human populations provide insights that have broader biological relevance in evolutionary biology is, therefore, frequently questioned. We first surveyed researchers in evolutionary biology and related fields on their opinions regarding whether studies on contemporary humans can advance evolutionary biology. Almost all 442 participants agreed that humans still evolve, but fewer agreed that this occurs through natural selection. Most agreed that human studies made valuable contributions to evolutionary biology, although those less exposed to human studies expressed more negative views. With a series of examples, we discuss strengths and limitations of evolutionary studies on contemporary humans. These show that human studies provide fundamental insights into evolutionary processes, improve understanding of the biology of many other species, and will make valuable contributions to evolutionary biology in the future.
This study was a response to the Trafficking Victims Protection Reauthorization Act passed by Congress in 2005, which called for a collection of data; a comprehensive statistical review and analysis of human trafficking data; and a biennial report to Congress on sex trafficking and unlawful commercial sex acts. It examined the human trafficking experiences (and to a lesser extent commercial sex acts) among a random sample of 60 counties across the United States. In contrast to prior research that had examined the issue from a federal perspective, this study examined experiences with human trafficking at the local level across the United States. The specific aims of the research were to: Identify victims and potential victims of domestic labor and sex trafficking; Determine whether they have been identified as victims by law enforcement; and Explore differences between sex trafficking and unlawful commercial sex. To achieve these goals the researchers collected data through telephone interviews with local law enforcement, prosecutors, and service providers; a mail-out statistical survey completed by knowledgeable officials in those jurisdictions; and an examination of case files in four local communities. This latter effort consisted of reviewing incident and arrest reports and charging documents for a variety of offenses that might have involved criminal conduct with characteristics of human trafficking. Through this method, the researchers not only gained a sense of how local authorities handled these types of cases but also the ways in which trafficking victims "fall through the cracks" in the interfaces between local and federal judicial systems as well as among local, state, and federal law enforcement and social service systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PhysioIntent database was acquired during the master thesis at INESC TEC. The dataset was built to research human movement intention through biosignals (electromyogram (EMG), electroencephalogram (EEG) and electrocardiogram (ECG)) using the Cyton board from openBCI [1]. Inertial data (9-axis) was also recorded with a proprietary device from INESC TEC named iHandU [2]. A camera, logitech C270 HD, was also used to record the participant’s session video, thus better supporting the post-processing of the recorded data and the agreement between the protocol and the participant activity. All data was then synchronized with the aid of a photoresistor, correlating the visual stimuli presented to the user with the signals acquired. The acquisitions are divided into two phases, where the 2nd phase was performed to improve some setbacks encountered in the 1st phase, such as data loss and synchronization issues. The 1st phase study included 6 healthy volunteers (range of age = 22 to 25; average age = 22.3±0.9; 2 males and 4 females; all right-handed). In the 2nd phase, the study included 3 healthy volunteers (range of age = 20 to 26; average age = 22.6±2.5; 2 males and 1 female; all right-handed). The protocol consists in the execution and imagination of some upper limb movements, which will be repeated several times throughout the protocol. There are a total of three different movements during the session: hand-grasping, wrist supination and pick and place. Each sequence of movements, imagination and execution, as well as the resting periods is called a trial. A run is a sequence of trials that end on a 60s break. This dataset has two different phases of acquisition. Phase 1 has a total of four runs with fifteen trials each, while phase 2 has five runs with eighteen trials each. During Phase 1, on every run, each movement is imagined and executed 5 times corresponding to a total of 20 repetitions per movement during each session. On phase two, on every run, each movement was executed and imagined 6 times, resulting in 30 repetitions per movement on each session. In phase 1, 4 different muscles, bicep brachii, tricep brachii, flexor carpi radialis, and extensor digitorum, were measured. For the EEG, the measured channels were: FP1, FP2, FCZ, C3, CZ, C4, CP3, CP4, P3, and P4. During phase 2, only one muscle, extensor digitorum, was measured. For the EEG, the channels measured were: FP1, FP2, FC3, FCz, FC4, C1, C3, Cz, C2, C4, CP3, CP4, P3, and P4. Before the experiments, the participants were informed about the experimental protocols, paradigms, and purpose. After ensuring they understood the information, the participants signed a written consent approved by the DPO from INESC TEC. All files are grouped by subject. You can find all the detailed descriptions of how the files are organized on the README file. Also, there is an extra folder called "PhysioIntent supporting material" where you can find some extra material including a script with functions to help you read the data, a description of the experimental protocol and the setup create for each phase. For each subject the data is organized according to the data model ("Subject_data_storage_model") where it is shown that each type of data is present in a different folder. Regarding biosignals (openBCI/ folder), there is the raw and processed data. There is an additional README file for some subjects that contains some particular details of the acquisition. [1] Cyton + Daisy Biosensing Boards (16-Channels). (2022). Retrieved 23 August 2022, from https://shop.openbci.com/products [2] Oliveira, Ana, Duarte Dias, Elodie Múrias Lopes, Maria do Carmo Vilas-Boas, and João Paulo Silva Cunha. "SnapKi—An Inertial Easy-to-Adapt Wearable Textile Device for Movement Quantification of Neurological Patients." Sensors 20, no. 14 (2020): 3875.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 20000 pieces of text collected from Wikipedia, Gutenberg, and CNN/DailyMail. The text is cleaned by replacing symbols such as (.*?/) with a white space using automatic scripts and regex.
The data was collected from these source to ensure the highest level of integrity against AI generated text. * Wikipedia: The 20220301 dataset was chosen to minimize the chance of including articles generated or heavily edited by AI. * Gutenberg: Books from this source are guaranteed to be written by real humans and span various genres and time periods. * CNN/DailyMail: These news articles were written by professional journalists and cover a variety of topics, ensuring diversity in writing style and subject matter.
The dataset consists of 5 CSV files.
1. CNN_DailyMail.csv
: Contains all processed news articles.
2. Gutenberg.csv
: Contains all processed books.
3. Wikipedia.csv
: Contains all processed Wikipedia articles.
4. Human.csv
: Combines all three datasets in order.
5. Shuffled_Human.csv
: This is the randomly shuffled version of Human.csv
.
Each file has 2 columns:
- Title
: The title of the item.
- Text
: The content of the item.
This dataset is suitable for a wide range of NLP tasks, including: - Training models to distinguish between human-written and AI-generated text (Human/AI classifiers). - Training LSTMs or Transformers for chatbots, summarization, or topic modeling. - Sentiment analysis, genre classification, or linguistic research.
While the data was collected from such sources, the data may not be 100% pure from AI generated text. Wikipedia articles may reflect systemic biases in contributor demographics. CNN/DailyMail articles may focus on specific news topics or regions.
For details on how the dataset was created, click here to view the Kaggle notebook used.
This dataset is published under the MIT License, allowing free use for both personal and commercial purposes. Attribution is encouraged but not required.
The "https://addhealth.cpc.unc.edu/" Target="_blank">National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32*. Add Health combines longitudinal survey data on respondents' social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood. The fifth wave of data collection is planned to begin in 2016.
Initiated in 1994 and supported by three program project grants from the "https://www.nichd.nih.gov/" Target="_blank">Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) with co-funding from 23 other federal agencies and foundations, Add Health is the largest, most comprehensive longitudinal survey of adolescents ever undertaken. Beginning with an in-school questionnaire administered to a nationally representative sample of students in grades 7-12, the study followed up with a series of in-home interviews conducted in 1995, 1996, 2001-02, and 2008. Other sources of data include questionnaires for parents, siblings, fellow students, and school administrators and interviews with romantic partners. Preexisting databases provide information about neighborhoods and communities.
Add Health was developed in response to a mandate from the U.S. Congress to fund a study of adolescent health, and Waves I and II focus on the forces that may influence adolescents' health and risk behaviors, including personal traits, families, friendships, romantic relationships, peer groups, schools, neighborhoods, and communities. As participants have aged into adulthood, however, the scientific goals of the study have expanded and evolved. Wave III, conducted when respondents were between 18 and 26** years old, focuses on how adolescent experiences and behaviors are related to decisions, behavior, and health outcomes in the transition to adulthood. At Wave IV, respondents were ages 24-32* and assuming adult roles and responsibilities. Follow up at Wave IV has enabled researchers to study developmental and health trajectories across the life course of adolescence into adulthood using an integrative approach that combines the social, behavioral, and biomedical sciences in its research objectives, design, data collection, and analysis.
* 52 respondents were 33-34 years old at the time of the Wave IV interview.
** 24 respondents were 27-28 years old at the time of the Wave III interview.
The Wave III public-use data are helpful in analyzing the transition between adolescence and young adulthood. Included in this dataset are data on pregnancy.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
20 participants installed and interacted with a thematic analysis coding assistant (TACA), an interactive machine learning desktop application designed to train a classifier on user-defined coded datasets to generate additional coding suggestions. The interviews were conducted with the participants after they interacted with the tool for 20 minutes, or until no more benefits were perceived. The questions were aimed to understand the experience of the participants with TACA and their perceptions of the ML model.The coded_transcripts.docx file contains the anonymised interview transcripts coded with codes appearing as comments. The document is split into Study 1 (5 participants) and Study 2 (15 participants). The participants in Study 1 imported their own dataset into TACA, while the participants in Study 2 used a set of newspaper restaurant reviews that were given to them by the researchers. Participant IDs follow the structure "S[study number]_P[participant number]", e.g. "S2_P1".The themes.csv file shows all the codes below each corresponding theme, the result of conducting thematic analysis on the interview transcripts.The restaurant_reviews.docx file is the collection of 21 restaurant reviews from the newspaper The Guardian (Restaurants + Reviews | Food | The Guardian) that was given to 15 of the 20 participants who did not have their own dataset available for the study.The logs folder contains an anonymised interaction log file for each participant with the interface of TACA named with the corresponding participant ID. The interaction logs for participants S1_P4 and S2_P5 are missing due to an issue in data storage.
A database based on a random sample of the noninstitutionalized population of the United States, developed for the purpose of studying the effects of demographic and socio-economic characteristics on differentials in mortality rates. It consists of data from 26 U.S. Current Population Surveys (CPS) cohorts, annual Social and Economic Supplements, and the 1980 Census cohort, combined with death certificate information to identify mortality status and cause of death covering the time interval, 1979 to 1998. The Current Population Surveys are March Supplements selected from the time period from March 1973 to March 1998. The NLMS routinely links geographical and demographic information from Census Bureau surveys and censuses to the NLMS database, and other available sources upon request. The Census Bureau and CMS have approved the linkage protocol and data acquisition is currently underway. The plan for the NLMS is to link information on mortality to the NLMS every two years from 1998 through 2006 with research on the resulting database to continue, at least, through 2009. The NLMS will continue to incorporate data from the yearly Annual Social and Economic Supplement into the study as the data become available. Based on the expected size of the Annual Social and Economic Supplements to be conducted, the expected number of deaths to be added to the NLMS through the updating process will increase the mortality content of the study to nearly 500,000 cases out of a total number of approximately 3.3 million records. This effort would also include expanding the NLMS population base by incorporating new March Supplement Current Population Survey data into the study as they become available. Linkages to the SEER and CMS datasets are also available. Data Availability: Due to the confidential nature of the data used in the NLMS, the public use dataset consists of a reduced number of CPS cohorts with a fixed follow-up period of five years. NIA does not make the data available directly. Research access to the entire NLMS database can be obtained through the NIA program contact listed. Interested investigators should email the NIA contact and send in a one page prospectus of the proposed project. NIA will approve projects based on their relevance to NIA/BSR''s areas of emphasis. Approved projects are then assigned to NLMS statisticians at the Census Bureau who work directly with the researcher to interface with the database. A modified version of the public use data files is available also through the Census restricted Data Centers. However, since the database is quite complex, many investigators have found that the most efficient way to access it is through the Census programmers. * Dates of Study: 1973-2009 * Study Features: Longitudinal * Sample Size: ~3.3 Million Link: *ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/00134
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction This dataset was gathered during the Vid2Real online video-based study, which investigates humans’ perception of robots' intelligence in the context of an incidental Human-Robot encounter. The dataset contains participants' questionnaire responses to four video study conditions, namely Baseline, Verbal, Body language, and Body language + Verbal. The videos depict a scenario where a pedestrian incidentally encounters a quadruped robot trying to enter a building. The robot uses verbal commands or body language to try to ask for help from the pedestrian in different study conditions. The differences in the conditions were manipulated using the robot’s verbal and expressive movement functionalities. Dataset Purpose The dataset includes the responses of human subjects about the robots' social intelligence used to validate the hypothesis that robot social intelligence is positively correlated with human compliance in an incidental human-robot encounter context. The video based dataset was also developed to obtain empirical evidence that can be used to design future real-world HRI studies. Dataset Contents Four videos, each corresponding to a study condition. Four sets of Perceived Social Intelligence Scale data. Each set corresponds to one study condition Four sets of compliance likelihood questions, each set include one Likert question and one free-form question One set of Godspeed questionnaire data. One set of Anthropomorphism questionnaire data. A csv file containing the participants demographic data, Likert scale data, and text responses. A data dictionary explaining the meaning of each of the fields in the csv file. Study Conditions There are 4 videos (i.e. study conditions), the video scenarios are as follows. Baseline: The robot walks up to the entrance and waits for the pedestrian to open the door without any additional behaviors. This is also the "control" condition. Verbal: The robot walks up to the entrance, and says ”can you please open the door for me” to the pedestrian while facing the same direction, then waits for the pedestrian to open the door. Body Language: The robot walks up to the entrance, turns its head to look at the pedestrian, then turns its head to face the door, and waits for the pedestrian to open the door. Body Language + Verbal: The robot walks up to the entrance, turns its head to look at the pedestrian, and says ”Can you open the door for me” to the pedestrian, then waits for the pedestrian to open the door. Image showing the Verbal condition. Image showing the Body Language condition. A within-subject design was adopted, and all participants experienced all conditions. The order of the videos, as well as the PSI scales, were randomized. After receiving consent from the participants, they were presented with one video, followed by the PSI questions and the two exploratory questions (compliance likelihood) described above. This set was repeated 4 times, after which the participants would answer their general perceptions of the robot with Godspeed and AMPH questionnaires. Each video was around 20 seconds and the total study time was around 10 minutes. Video as a Study Method A video-based study in human-robot interaction research is a common method for data collection. Videos can easily be distributed via online participant recruiting platforms, and can reach a larger sample than in-person/lab-based studies. Therefore, it is a fast and easy method for data collection for research aiming to obtain empirical evidence. Video Filming The videos were filmed with a first-person point-of-view in order to maximize the alignment of video and real-world settings. The device used for the recording was an iPhone 12 pro, and the videos were shot in 4k 60 fps. For better accessibility, the videos have been converted to lower resolutions. Instruments The questionnaires used in the study include the Perceived Social Intelligence Scale (PSI), Godspeed Questionnaire, and Anthropomorphism Questionnaire (AMPH). In addition to these questionnaires, a 5-point Likert question and a free-text response measuring human compliance were added for the purpose of the video-based study. Participant demographic data was also collected. Questionnaire items are attached as part of this dataset. Human Subjects For the purpose of this project, the participants are recruited through Prolific. Therefore, the participants are users of Prolific. Additionally, they are restricted to people who are currently living in the United States, fluent in English, and have no hearing or visual impairments. No other restrictions were imposed. Among the 385 participants, 194 participants identified as female, and 191 as male, the age ranged from 19 to 75 (M = 38.53, SD = 12.86). Human subjects remained anonymous. Participants were compensated with $4 upon submission approval. This study was reviewed and approved by UT Austin Internal Review Board. Robot The dataset contains data about humans’ perceived...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
There are several works based on Natural Language Processing on newspaper reports. Mining opinions from headlines [ 1 ] using Standford NLP and SVM by Rameshbhaiet. Al.compared several algorithms on a small and large dataset. Rubinet. al., in their paper [ 2 ], created a mechanism to differentiate fake news from real ones by building a set of characteristics of news according to their types. The purpose was to contribute to the low resource data available for training machine learning algorithms. Doumitet. al.in [ 3 ] have implemented LDA, a topic modeling approach to study bias present in online news media.
However, there are not many NLP research invested in studying COVID-19. Most applications include classification of chest X-rays and CT-scans to detect presence of pneumonia in lungs [ 4 ], a consequence of the virus. Other research areas include studying the genome sequence of the virus[ 5 ][ 6 ][ 7 ] and replicating its structure to fight and find a vaccine. This research is crucial in battling the pandemic. The few NLP based research publications are sentiment classification of online tweets by Samuel et el [ 8 ] to understand fear persisting in people due to the virus. Similar work has been done using the LSTM network to classify sentiments from online discussion forums by Jelodaret. al.[ 9 ]. NKK dataset is the first study on a comparatively larger dataset of a newspaper report on COVID-19, which contributed to the virus’s awareness to the best of our knowledge.
2 Data-set Introduction
2.1 Data Collection
We accumulated 1000 online newspaper report from United States of America (USA) on COVID-19. The newspaper includes The Washington Post (USA) and StarTribune (USA). We have named it as “Covid-News-USA-NNK”. We also accumulated 50 online newspaper report from Bangladesh on the issue and named it “Covid-News-BD-NNK”. The newspaper includes The Daily Star (BD) and Prothom Alo (BD). All these newspapers are from the top provider and top read in the respective countries. The collection was done manually by 10 human data-collectors of age group 23- with university degrees. This approach was suitable compared to automation to ensure the news were highly relevant to the subject. The newspaper online sites had dynamic content with advertisements in no particular order. Therefore there were high chances of online scrappers to collect inaccurate news reports. One of the challenges while collecting the data is the requirement of subscription. Each newspaper required $1 per subscriptions. Some criteria in collecting the news reports provided as guideline to the human data-collectors were as follows:
The headline must have one or more words directly or indirectly related to COVID-19.
The content of each news must have 5 or more keywords directly or indirectly related to COVID-19.
The genre of the news can be anything as long as it is relevant to the topic. Political, social, economical genres are to be more prioritized.
Avoid taking duplicate reports.
Maintain a time frame for the above mentioned newspapers.
To collect these data we used a google form for USA and BD. We have two human editor to go through each entry to check any spam or troll entry.
2.2 Data Pre-processing and Statistics
Some pre-processing steps performed on the newspaper report dataset are as follows:
Remove hyperlinks.
Remove non-English alphanumeric characters.
Remove stop words.
Lemmatize text.
While more pre-processing could have been applied, we tried to keep the data as much unchanged as possible since changing sentence structures could result us in valuable information loss. While this was done with help of a script, we also assigned same human collectors to cross check for any presence of the above mentioned criteria.
The primary data statistics of the two dataset are shown in Table 1 and 2.
Table 1: Covid-News-USA-NNK data statistics
No of words per headline
7 to 20
No of words per body content
150 to 2100
Table 2: Covid-News-BD-NNK data statistics No of words per headline
10 to 20
No of words per body content
100 to 1500
2.3 Dataset Repository
We used GitHub as our primary data repository in account name NKK^1. Here, we created two repositories USA-NKK^2 and BD-NNK^3. The dataset is available in both CSV and JSON format. We are regularly updating the CSV files and regenerating JSON using a py script. We provided a python script file for essential operation. We welcome all outside collaboration to enrich the dataset.
3 Literature Review
Natural Language Processing (NLP) deals with text (also known as categorical) data in computer science, utilizing numerous diverse methods like one-hot encoding, word embedding, etc., that transform text to machine language, which can be fed to multiple machine learning and deep learning algorithms.
Some well-known applications of NLP includes fraud detection on online media sites[ 10 ], using authorship attribution in fallback authentication systems[ 11 ], intelligent conversational agents or chatbots[ 12 ] and machine translations used by Google Translate[ 13 ]. While these are all downstream tasks, several exciting developments have been made in the algorithm solely for Natural Language Processing tasks. The two most trending ones are BERT[ 14 ], which uses bidirectional encoder-decoder architecture to create the transformer model, that can do near-perfect classification tasks and next-word predictions for next generations, and GPT-3 models released by OpenAI[ 15 ] that can generate texts almost human-like. However, these are all pre-trained models since they carry huge computation cost. Information Extraction is a generalized concept of retrieving information from a dataset. Information extraction from an image could be retrieving vital feature spaces or targeted portions of an image; information extraction from speech could be retrieving information about names, places, etc[ 16 ]. Information extraction in texts could be identifying named entities and locations or essential data. Topic modeling is a sub-task of NLP and also a process of information extraction. It clusters words and phrases of the same context together into groups. Topic modeling is an unsupervised learning method that gives us a brief idea about a set of text. One commonly used topic modeling is Latent Dirichlet Allocation or LDA[17].
Keyword extraction is a process of information extraction and sub-task of NLP to extract essential words and phrases from a text. TextRank [ 18 ] is an efficient keyword extraction technique that uses graphs to calculate the weight of each word and pick the words with more weight to it.
Word clouds are a great visualization technique to understand the overall ’talk of the topic’. The clustered words give us a quick understanding of the content.
4 Our experiments and Result analysis
We used the wordcloud library^4 to create the word clouds. Figure 1 and 3 presents the word cloud of Covid-News-USA- NNK dataset by month from February to May. From the figures 1,2,3, we can point few information:
In February, both the news paper have talked about China and source of the outbreak.
StarTribune emphasized on Minnesota as the most concerned state. In April, it seemed to have been concerned more.
Both the newspaper talked about the virus impacting the economy, i.e, bank, elections, administrations, markets.
Washington Post discussed global issues more than StarTribune.
StarTribune in February mentioned the first precautionary measurement: wearing masks, and the uncontrollable spread of the virus throughout the nation.
While both the newspaper mentioned the outbreak in China in February, the weight of the spread in the United States are more highlighted through out March till May, displaying the critical impact caused by the virus.
We used a script to extract all numbers related to certain keywords like ’Deaths’, ’Infected’, ’Died’ , ’Infections’, ’Quarantined’, Lock-down’, ’Diagnosed’ etc from the news reports and created a number of cases for both the newspaper. Figure 4 shows the statistics of this series. From this extraction technique, we can observe that April was the peak month for the covid cases as it gradually rose from February. Both the newspaper clearly shows us that the rise in covid cases from February to March was slower than the rise from March to April. This is an important indicator of possible recklessness in preparations to battle the virus. However, the steep fall from April to May also shows the positive response against the attack. We used Vader Sentiment Analysis to extract sentiment of the headlines and the body. On average, the sentiments were from -0.5 to -0.9. Vader Sentiment scale ranges from -1(highly negative to 1(highly positive). There were some cases
where the sentiment scores of the headline and body contradicted each other,i.e., the sentiment of the headline was negative but the sentiment of the body was slightly positive. Overall, sentiment analysis can assist us sort the most concerning (most negative) news from the positive ones, from which we can learn more about the indicators related to COVID-19 and the serious impact caused by it. Moreover, sentiment analysis can also provide us information about how a state or country is reacting to the pandemic. We used PageRank algorithm to extract keywords from headlines as well as the body content. PageRank efficiently highlights important relevant keywords in the text. Some frequently occurring important keywords extracted from both the datasets are: ’China’, Government’, ’Masks’, ’Economy’, ’Crisis’, ’Theft’ , ’Stock market’ , ’Jobs’ , ’Election’, ’Missteps’, ’Health’, ’Response’. Keywords extraction acts as a filter allowing quick searches for indicators in case of locating situations of the economy,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data package, titled Data from: Cross-sectional personal network analysis of adult smoking in rural areas, includes several files. First, there are annonymized raw data files in .rds file format (ego_data.rds & alter_data.rds). Second, there is the R code that allow the replication of various statistical analyses. Interested parts may consult the R code as .pdf file format (Supplementary_Material_R_Code.pdf), .Rmd file format (that can be run to create the .pdf file format) and the .R file format (that can be accesed with R and RStudio). Moreover, the labels files are useful for recreating the Supplementary Material pdf file.
Readers should know that this dataset corresponds to the study (paper) Cross-sectional personal network analysis of adult smoking in rural areas.
The ego_data.rds file includes 20 variables by 76 observations (respondents) while the alter_data.rds file includes 46 variables by 1681 observations (social contacts). We collected this information by deploying a personal network analysis research design. Initially, we interviewed 83 respondents (dubbed egos). Due to missing data, we kept in the analysis 76 egos and dropped seven respondents. We recruited the respondents using a link-tracining sampling framework. We started from a number of six seeds. We interviewed the seeds then we asked them to recommend other people in the study. We continued in a referee-referral fashion until 83 interviews were completed. The study was performed in a small rural Romanian community (4124 residents): Lerești (Argeș county).
Our study was carried out in accordance with the recommendations, relevant guidelines, and regulations (specifically, those provided by the Romanian Sociologists Society, i.e., the professional association of Romanian sociologists). The research was performed in accordance with the Declaration of Helsinki. The research protocol was approved by a named institutional/licensing committee. Specifically, the Ethics Committee of the Center for Innovation in Medicine (InoMed) reviewed and approved all these study procedures (EC-INOMED Decision No. D001/09-06-2023 and No. D001/19-01-2024). All participants gave written informed consent. The privacy rights of the study participants were observed. The authors did not have access to information that could identify participants. Face-to-face interviews were collected between September 13 – 23, 2023, in Lerești, Romania. After each interview, information that could identity the participants were anonymized. Before conducting the interview, we provided each participant with a dossier containing informative materials about the project's objectives, how the data would be analyzed and reported, and their participation rights (e.g., the right to withdraw from the project at any time, even after the interview was completed). All study participants gave their written informed consent prior to enrolment in the study.
The variables in the ego_data.rds file are as follows:
(1) "networkCanvasEgoUUID" (unique alpha numeric code for each observation);
(2) "ego_age" (the age of each study participant);
(3) "ego_age.cen" (the age of each study participant, centered);
(4) "ego_educ_b" (the education of each ego, binary);
(5) "ego_educ_f" (the education of each ego, educational achievement);
(6) "ego_marital.s_f" (the marital status of each ego);
(7) "ego_occupation.cat2_f" (the occupation of each ego);
(8) "ego_occupation_b" (the occupation of each ego, unemployed vs employed);
(9) "ego_relstatus_b" (whether the ego is in a relationship or not);
(10) "ego_sex_f" (the sex of the ego assigned at birth; male & female);
(11) "ego_sex_n" (the sex of the ego assigned at birth; 0 = male & 1 = female);
(12) "ego_smk_status_b1" (smoking status: 1 smoking, 0 others);
(13) "ego_smk_status_b2" (smoking status: 1 former smoker, 0 others);
(14) "ego_smk_status_b3" (smoking status: 1 not a smoker, 0 others);
(15) "ego_smkstatus_f" (smoking status: former smoker, never-smoker, non-smoker (smoked too little), occasional smoker, smoker);
(16) "ego_smoking_3cat" (smoking status: non-smoker, former smoker, smoker);
(17) "net.size" (number of social contacts, alters, that were elicited by an ego);
(18) "net.components" (number of strong components in the personal network);
(19) "net.deg.centralization" (personal network degree centralization);
(20) "net.density" (personal network density).
The variables in the alter_data.rds file are as follows:
(1) "alter_age" (the age of the alter);
(2) "alter_age.cen" (the age of the alter - centered);
(3) "alter_btw" (alter's betweenness score);
(4) "alter_btw.cen" (alter's betweenness score - centered);
(5) "alter_deg" (alter's degree score);
(6) "alter_deg.cen" (alter's degree score - centered);
(7) "alter_educ_b" (alter's education);
(8) "alter_educ_f" (alter's education);
(9) "alter_marital.s_f" (alter's marital status);
(10) "alter_relstatus_b" (alter's marital status - binary variable);
(11) "alter_sex_f" (alter's sex assigned at birth);
(12) "alter_sex_n" (alter's sex assigned at birth; 1 - female; 0 - male);
(13) "alter_smk_status_b1" (alter's smoking status; 1 smoker, 0 others);
(14) "alter_smk_status_b2" (alter's smoking status; 1 former smoker, 0 others);
(15) "alter_smk_status_b3" (alter's smoking status; 1 non-smoker, 0 others);
(16) "alter_smoking_3cat" (alter's smoking status: three categories - smoker, non-smoker, former smoker);
(17) "assortativity_score_fsmoker" (assortativity score for alter, former smoker);
(18) "assortativity_score_nsmoker" (assortativity score for alter, non-smoker);
(19) "assortativity_score_smoker" (assortativity score for alter, smoker);
(20) "ego.alter_meet_f" (ego's meeting frequency with alter);
(21) "ego_alter_meet_b" (ego's meeting frequency with alter, binary variable);
(22) "ego.alter_meet_n" (ego's meeting frequency with alter, numerical codes);
(23) "alter_rel.w.ego_f" (type of alters in an ego's network);
(24) "networkCanvasUUID" (alpha numeric code for alter);
(25) "networkCanvasEgoUUID" (alpha numeric code for ego);
(26) "ego_smkstatus_f" (smoking status: former smoker, never-smoker, non-smoker (smoked too little), occasional smoker, smoker);
(27) "ego_smoking_3cat" (three categories, smoking status: former smoker, non-smoker, smoker);
(28) "ego_type_fsmk" (former smoking egos by type of ego-alter relationship);
(29) "ego_type_nsmk" (non smoking egos by type of ego-alter relationship);
(30) "ego_type_smk" (smoking egos by type of ego-alter relationship);
(31) "ego_sex_f" (ego's sex, binary);
(32) "ego_sex_n" (ego's sex, numerical code, 1 female, 0 male);
(33) "ego_educ_b" (ego's education, binary variable)
(34) "ego_age" (ego's age)
(35) "ego_age.cen" (ego's age, centered)
(36) "ego_relstatus_b" (ego's marital status, binary variable)
(37) "ego_occupation_b" (ego's employment status, binary variable)
(38) "net.components" (number of strong components in the personal network)
(39) "net.deg.centralization" (degree centralization score in the personal networ)
(40) "net.density" (density score in the personal network)
(41) "prop_fsmokers" (proportion of former smokers in the personal network - alters)
(42) "prop_fsmokers.cen" (proportion of former smokers in the personal network, centered- alters)
(43) "prop_nsmokers" (proportion of non-smokers in the personal network- alters)
(44) "prop_nsmokers.cen" (proportion of non-smokers in the personal network, centered- alters)
(45) "prop_smokers" (proportion of smokers in the personal network- alters)
(46) "prop_smokers.cen" (proportion of smokers in the personal network, centered- alters)
This research project developed and fully documented a method to estimate the number of females and males trafficked for the purposes of sexual and labor exploitation from eight countries (Colombia, Ecuador, El Salvador, Guatemala, Mexico, Nicaragua, Peru, and Venezuela) into the United States at the Southwest border. The model utilizes only open source data. This research represents the first phase of a two-phase project and Provides a conceptual framework for identifying potential data sources to estimate the number of victims at different stages in traffickingDevelops statistical models to estimate the number of males and females at risk of being trafficked for sexual and labor exploitation from the eight countries, and the number of males and females actually trafficked for sex and laborIncorporates into the estimation models the transit journey of trafficking victims from the eight countries to the southwest border of the United StatesDesigns the estimation models such that they are highly flexible and modular so that they can evolve as the body of data expands Utilizes open source data as inputs to the statistical model, making the model accessible to anyone interested in using itPresents preliminary estimates that illustrate the use of the statistical methodsIlluminates gaps in data sources. The data included in this collection are the open source data which were primarily used in the models to estimate the number of males and females at risk of being trafficked.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Taste Quality Representation in the Human Brain The Journal of Neuroscience, January 29, 2020 40(5):1042–1052 https://doi.org/10.1523/JNEUROSCI.1751-19.2019
In the mammalian brain, the insula is the primary cortical substrate involved in the perception of taste. Recent imaging studies in rodents have identified a ‘gustotopic’ organization in the insula, whereby distinct insula regions are selectively responsive to one of the five basic tastes. However, numerous studies in monkeys have reported that gustatory cortical neurons are broadly-tuned to multiple tastes, and tastes are not represented in discrete spatial locations. Neuroimaging studies in humans have thus far been unable to discern between these two models, though this may be due to the relatively low spatial resolution employed in taste studies to date. In the present study, we examined the spatial representation of taste within the human brain of 18 healthy subjects using ultra-high resolution functional magnetic resonance imaging (MRI) at high magnetic field strength (7-Tesla). During scanning, male and female participants tasted sweet, salty, sour and tasteless liquids, delivered via a custom-built MRI-compatible tastant-delivery system. During the fMRI task, received 0.5mL of sweet, sour, salty, and neutral tastant, in a block design (4 identical taste events / 20s block), followed by a (5s) wash period. Functional MRI data was acquired at ultra-high voxel resolution (1.2mm x 1.2mm x 1.2mm) at high magnetic field strength (7-Tesla). Echo-planar images (EPI) were acquired in 68 axial slices, in a scan window that ranged from the top of the cingulate gyrus (superiorly) to the tip of the temporal pole (inferiorly).
MRI Files included are: A) Skull-stripped T1w anatomical scans from MP2RAGE acquisition (uni_den volume) resolution 0.7mm X 0.7mm X 0.7mm. B) 8 task epi files - 130vol, acquired Anterior-to-Posterior C) 1 epi file (fmap) - 20vol, acquired Posterior-to-Anterior, for topup spatial distortion correction
The National Child Development Study (NCDS) is a continuing longitudinal study that seeks to follow the lives of all those living in Great Britain who were born in one particular week in 1958. The aim of the study is to improve understanding of the factors affecting human development over the whole lifespan.
The NCDS has its origins in the Perinatal Mortality Survey (PMS) (the original PMS study is held at the UK Data Archive under SN 2137). This study was sponsored by the National Birthday Trust Fund and designed to examine the social and obstetric factors associated with stillbirth and death in early infancy among the 17,000 children born in England, Scotland and Wales in that one week. Selected data from the PMS form NCDS sweep 0, held alongside NCDS sweeps 1-3, under SN 5565.
Survey and Biomeasures Data (GN 33004):
To date there have been nine attempts to trace all members of the birth cohort in order to monitor their physical, educational and social development. The first three sweeps were carried out by the National Children's Bureau, in 1965, when respondents were aged 7, in 1969, aged 11, and in 1974, aged 16 (these sweeps form NCDS1-3, held together with NCDS0 under SN 5565). The fourth sweep, also carried out by the National Children's Bureau, was conducted in 1981, when respondents were aged 23 (held under SN 5566). In 1985 the NCDS moved to the Social Statistics Research Unit (SSRU) - now known as the Centre for Longitudinal Studies (CLS). The fifth sweep was carried out in 1991, when respondents were aged 33 (held under SN 5567). For the sixth sweep, conducted in 1999-2000, when respondents were aged 42 (NCDS6, held under SN 5578), fieldwork was combined with the 1999-2000 wave of the 1970 Birth Cohort Study (BCS70), which was also conducted by CLS (and held under GN 33229). The seventh sweep was conducted in 2004-2005 when the respondents were aged 46 (held under SN 5579), the eighth sweep was conducted in 2008-2009 when respondents were aged 50 (held under SN 6137) and the ninth sweep was conducted in 2013 when respondents were aged 55 (held under SN 7669).
Four separate datasets covering responses to NCDS over all sweeps are available. National Child Development Deaths Dataset: Special Licence Access (SN 7717) covers deaths; National Child Development Study Response and Outcomes Dataset (SN 5560) covers all other responses and outcomes; National Child Development Study: Partnership Histories (SN 6940) includes data on live-in relationships; and National Child Development Study: Activity Histories (SN 6942) covers work and non-work activities. Users are advised to order these studies alongside the other waves of NCDS.
From 2002-2004, a Biomedical Survey was completed and is available under End User Licence (EUL) (SN 8731) and Special Licence (SL) (SN 5594). Proteomics analyses of blood samples are available under SL SN 9254.
Linked Geographical Data (GN 33497):
A number of geographical variables are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies.
Linked Administrative Data (GN 33396):
A number of linked administrative datasets are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies. These include a Deaths dataset (SN 7717) available under SL and the Linked Health Administrative Datasets (SN 8697) available under Secure Access.
Additional Sub-Studies (GN 33562):
In addition to the main NCDS sweeps, further studies have also been conducted on a range of subjects such as parent migration, unemployment, behavioural studies and respondent essays. The full list of NCDS studies available from the UK Data Service can be found on the NCDS series access data webpage.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from NCDS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Further information about the full NCDS series can be found on the Centre for Longitudinal Studies website.
The National Child Development Study: Biomedical Survey 2002-2004 was funded under the Medical Research Council 'Health of the Public' initiative, and was carried out in 2002-2004 in collaboration with the Institute of Child Health, St George's Hospital Medical School, and NatCen. The survey was designed to obtain objective measures of ill-health and biomedical risk factors in order to address a wide range of specific hypotheses relating to anthropometry: cardiovascular, respiratory and allergic diseases; visual and hearing impairment; and mental ill-health.
The majority of the biomedical data (1,064 variables) are now available under EUL (SN 8731), with some data considered sensitive still available under Special Licence (SN 5594). This decision was the result of the CLS's disclosure assessment of each variable and the broad aim to make as much data available with the lowest possible barriers. Information about the medication taken by the cohort members of the study is also available under EUL for the first time. These data were collected in 2002-2004, but they were never released via the UKDS.
The Special Licence dataset contains 122 variables including new data on child adversity not previously released, as well as a number of original variables that were previously available under Special Licence due to their sensitive nature, such as Clinical Interview Schedule-Revised (CIS-R) specific questions on mental health and questions which contain categories with small frequencies related to personal details such as skin colour, pregnancy, a surgical operation, specific height and unusual high number of children.
For the second edition (December 2020), the data and documentation have been revised. Previously unreleased variables on child adversity have been added and some variables removed as they are now available under EUL. Users are advised to download the EUL version (SN 8731) before deciding to apply for the Special Licence version.
https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
Enhancing Human-Like Responses in Large Language Models
🤗 Models | 📊 Dataset | 📄 Paper
Human-Like-DPO-Dataset
This dataset was created as part of research aimed at improving conversational fluency and engagement in large language models. It is suitable for formats like Direct Preference Optimization (DPO) to guide models toward generating more human-like responses. The dataset includes 10,884 samples across 256 topics, including: Technology Daily Life Science… See the full description on the dataset page: https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created as a part of the study presented in IEEE Transactions on Human-Machine Systems with the title "An Online Multi-Index Approach to Human Ergonomics Assessment in the Workplace" by Marta Lorenzini, Wansoo Kim and Arash Ajoudani. This paper introduces an online approach to monitor kinematic and dynamic quantities on the workers, providing on the spot an estimate of the physical load required in their daily jobs. A set of ergonomic indexes is defined to account for multiple potential contributors to work-related musculoskeletal disorders (WMSDs), which remain one of the major occupational safety and health problems in the European Union nowadays. Thus, the continuous tracking of workers’ exposure to the factors that may contribute to their development is paramount. To evaluate the proposed framework, a throughout experimental analysis was conducted.
Twelve healthy adult subjects were recruited in the experimental study to perform, in the laboratory settings, occupational activities that are commonly carried out by workers in the current industrial scenario. Three tasks were selected to encompass the most significant risk factors in the workplace: mechanical overloading of the body joints, variable and high-intensity interaction forces, and repetitive and monotonous movements. Accordingly, lifting/lowering of a heavy object, drilling, and painting with a lightweight tool were considered, respectively, in this study. While the subjects were carrying out such activities, the data regarding the whole-body motion and the forces exchanged with the environment (both ground reaction force (GRF) and interaction forces at the end-effector) were collected. In addition, ten surface electromyography (sEMG) sensors were placed on the body of each subject to measure muscle activity as a reference to the effective physical effort required for the tasks.
The whole experimental procedure was carried out in accordance with the Declaration of Helsinki and the protocol was approved by the ethics committee azienda sanitaria locale (ASL) Genovese N.3 (Protocol IIT_HRII_ERGOLEAN 156/2020).
This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This multi-subject and multi-session EEG dataset for modelling human visual object recognition (MSS) contains:
More details about the dataset are described as follows.
32 participants were recruited from college students in Beijing, of which 4 were female, and 28 were male, with an age range of 21-33 years. 100 sessions were conducted. They were paid and gave written informed consent. The study was conducted under the approval of the ethical committee of the Institute of Automation at the Chinese Academy of Sciences, with the approval number: IA21-2410-020201.
After every 50 sequences, there was a break for the participants to rest. Each rapid serial sequence lasted approximately 7.5 seconds, starting with a 750ms blank screen with a white fixation cross, followed by 20 or 21 images presented at 5 Hz with a 50% duty cycle. The sequence ended with another 750ms blank screen.
After the rapid serial sequence, there was a 2-second interval during which participants were instructed to blink and then report whether a special image appeared in the sequence using a keyboard. During each run, 20 sequences were randomly inserted with additional special images at random positions. The special images are logos for brain-computer interfaces.
Each image was displayed for 1 second and was followed by 11 choice boxes (1 correct class box, 9 random class boxes, and 1 reject box). Participants were required to select the correct class of the displayed image using a mouse to increase their engagement. After the selection, a white fixation cross was displayed for 1 second in the centre of the screen to remind participants to pay attention to the upcoming task.
The stimuli are from two image databases, ImageNet and PASCAL. The final set consists of 10,000 images, with 500 images for each class.
In the derivatives/annotations folder, there are additional information of MSS:
The EEG signals were pre-processed using the MNE package, version 1.3.1, with Python 3.9.16. The data was sampled at a rate of 1,000 Hz with a bandpass filter applied between 0.1 and 100 Hz. A notch filter was used to remove 50 Hz power frequency. Epochs were created for each trial ranging from 0 to 500 ms relative to stimulus onset. No further preprocessing or artefact correction methods were applied in technical validation. However, researchers may want to consider widely used preprocessing steps such as baseline correction or eye movement correction. After the preprocessing, each session resulted in two matrices: RSVP EEG data matrix of shape (8,000 image conditions × 122 EEG channels × 125 EEG time points) and low-speed EEG data matrix of shape (400 image conditions × 122 EEG channels × 125 EEG time points).
These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. As such, it is a violation of Federal Law to publish them. Format: These data are from a human study collected under IRB protocol: ClinicalTrials.gov # NCT01874834. This dataset is associated with the following publication: Stiegel, M., J. Pleil, J. Sobus, T. Stevens, and M. Madden. Linking physiological parameters to perturbations in the human exposome: Environmental exposures modify blood pressure and lung function via inflammatory cytokine pathway. JOURNAL OF TOXICOLOGY AND ENVIRONMENTAL HEALTH - PART A: CURRENT ISSUES. Taylor & Francis, Inc., Philadelphia, PA, USA, 80(9): 485-501, (2017).