Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory.This network project brings together economists, psychologists, computer and complexity scientists from three leading centres for behavioural social science at Nottingham, Warwick and UEA. This group will lead a research programme with two broad objectives: to develop and test cross-disciplinary models of human behaviour and behaviour change; to draw out their implications for the formulation and evaluation of public policy. Foundational research will focus on three inter-related themes: understanding individual behaviour and behaviour change; understanding social and interactive behaviour; rethinking the foundations of policy analysis. The project will explore implications of the basic science for policy via a series of applied projects connecting naturally with the three themes. These will include: the determinants of consumer credit behaviour; the formation of social values; strategies for evaluation of policies affecting health and safety. The research will integrate theoretical perspectives from multiple disciplines and utilise a wide range of complementary methodologies including: theoretical modeling of individuals, groups and complex systems; conceptual analysis; lab and field experiments; analysis of large data sets. The Network will promote high quality cross-disciplinary research and serve as a policy forum for understanding behaviour and behaviour change. Experimental data. In study 1, we collected and analyzed millions of time-indexed posts on Twitter. In this study we obtained a large number of tweets that referenced dates in the future, and were able to use these tweets to determine the concreteness of the language used to describe events at these dates. This allowed us to observe how psychological distance influences everyday discourse, and put the key assumptions of the CLT to a real-world test. In study 2, we analyzed word concreteness in news articles using the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). This corpus contains over 1.8 million NYT articles written between 1987 and 2007. Importantly for our purposes, these articles are tagged with keywords describing the topics of the articles. In this study we obtained all NYT articles written before and after the 1988, 1992, 1996, 2000, and 2004 US Presidential elections, which were tagged as pertaining to these elections. We subsequently tested how the concreteness of the words used in the articles varied as a function of temporal distance to the election they reference. We also performed this analysis with NYT articles referencing three popular public holidays. Unlike study 1 and prior work (such as Snefjella & Kuperman, 2015), study 2 allowed us to examine the influence of temporal distance in the past and in the future, while controlling for the exact time when specific events occurred.
On 2018 February 4.41, the All-Sky Automated Survey for SuperNovae (ASAS-SN) discovered ASASSN-18bt in the K2 Campaign 16 field. With a redshift of z=0.01098 and a peak apparent magnitude of B_max_=14.31, ASASSN-18bt is the nearest and brightest Supernovae Ia type (SNe Ia) yet observed by the Kepler spacecraft. Here we present the discovery of ASASSN-18bt, the K2 light curve, and prediscovery data from ASAS-SN and the Asteroid Terrestrial-impact Last Alert System. The K2 early-time light curve has an unprecedented 30-minute cadence and photometric precision for an SN Ia light curve, and it unambiguously shows a ~4 day nearly linear phase followed by a steeper rise. Thus, ASASSN-18bt joins a growing list of SNe Ia whose early light curves are not well described by a single power law. We show that a double-power-law model fits the data reasonably well, hinting that two physical processes must be responsible for the observed rise. However, we find that current models of the interaction with a nondegenerate companion predict an abrupt rise and cannot adequately explain the initial, slower linear phase. Instead, we find that existing published models with shallow ^56^Ni are able to span the observed behavior and, with tuning, may be able to reproduce the ASASSN-18bt light curve. Regardless, more theoretical work is needed to satisfactorily model this and other early-time SNe Ia light curves. Finally, we use Swift X-ray nondetections to constrain the presence of circumstellar material (CSM) at much larger distances and lower densities than possible with the optical light curve. For a constant-density CSM, these nondetections constrain {rho}<4.5x10^5^cm^-3^ at a radius of 4x10^15^cm from the progenitor star. Assuming a wind-like environment, we place mass loss limits of dM/dt<8x10^-6^M{sun}/yr for {nu}w=100km/s, ruling out some symbiotic progenitor systems. This work highlights the power of well-sampled early-time data and the need for immediate multiband, high-cadence follow-up for progress in understanding SNe Ia.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
# Title
Interview-Based Stress Assessment Dataset
# Overview
The dataset includes stress evaluations (6 grades) assessed by interviews of 50 Japanese workers (49 completed follow-up), as well as self-reported stress and attribute information and personality information measured at the pre and one-month follow-up.
# Data Source
Interviews were conducted between December 2022 and January 2023. The average follow-up period was 34.2 days.
The main variables were interview-based stress evaluation, with self-reported stress (stress load, mental symptoms and physical symptoms from the Brief Job Stress Questionnaire), well-being (life satisfaction and happiness), and burnout were measured pre and 1 month later. Interview-based stress evaluations were conducted by two occupational health professionals in addition to an evaluation by the interviewer, a psychologist.
# Data Description
## main variables are total (time 1 self-reported stress), burnout, wellbeing, meanStressEv (mean overall stress ratings of interviewer and two evaluators), T2_loadAll, T2_mental, T2_physical, T2_burnout, and T2_wellbeing
no: Record number or identifier.
age: Age of the individual in years.
gender: Gender of the individual. Possible values include 'male', 'female', etc.
height_cm: Height of the individual in centimeters.
weight_kg: Weight of the individual in kilograms.
BMI: Body Mass Index, calculated based on height and weight.
drinking_freq: Frequency of alcohol consumption. Example values might be 'daily', 'weekly', 'monthly', etc.
smoking_habits: Smoking habits of the individual. Possible values include 'smoker', 'non-smoker', etc.
money_spending_hobby: Attitude towards spending money on hobbies. Describes how much an individual spends on their hobbies.
employment_status: Current employment status. Possible values include 'employed', 'unemployed', 'self-employed', etc.
full_time: employment_status
part_time: employment_status
discretionary: employment_status
side_job: This variable likely indicates whether the individual has a side job in addition to their primary employment. The values could be binary (yes/no) or provide more detail about the nature of the side job.
work_type: This variable probably categorizes the type of work the individual is engaged in. It could include categories such as 'full-time', 'part-time', 'contract', 'freelance', etc.
fixedHours: This variable might indicate whether the individual's work schedule has fixed hours. It could be a binary variable (yes/no) indicating the presence or absence of a fixed work schedule.
rotationalShifts: This variable likely denotes whether the individual works in rotational shifts. It could be a binary (yes/no) variable or provide details on the shift rotation pattern.
flexibleShifts: This variable possibly reflects if the individual has flexible shift options in their work. This could involve varying start and end times or the ability to switch shifts.
flexTime: This variable might indicate the presence of 'flextime' in the individual's work arrangement, allowing them to choose their working hours within certain limits.
adjustableWorkHours: This variable probably denotes whether the individual has the ability to adjust their work hours, suggesting a degree of flexibility in their work schedule.
discretionaryWork: This variable could indicate whether the individual's work involves a degree of discretion or autonomy in decision-making or task execution.
nightShift: This variable likely indicates if the individual works night shifts. It could be a simple binary (yes/no) or provide details about the frequency or regularity of night shifts.
remote_work_freq: This variable probably measures the frequency of remote work. It could include categories like 'never', 'sometimes', 'often', or 'always'.
primary_job_industry: This variable likely categorizes the industry sector of the individual's primary job. It could include sectors like 'technology', 'healthcare', 'education', 'finance', etc.
ind: industry
ind.manu–ind.gove: binary coding of industry
primary_job_role: This variable likely represents the specific role or position held by the individual in their primary job. It could include titles like 'manager', 'engineer', 'teacher', etc.
job: job
job.admi–job.carClPa: binary coding of job
job_duration_years: This variable probably indicates the duration of the individual's current job in years. It typically measures the length of time an individual has been in their current job role.
years: Without additional context, this variable could represent various time-related aspects, such as years of experience in a particular field, age in years, or years in a specific role. It generally signifies a duration or period in years.
months: Similar to 'years', this variable could refer to a duration in months. It might represent age in months (for younger individuals), months of experience, or months spent in a current role or activity.
job_duration_months: This variable is likely to indicate the total duration of the individual's current job in months. It's a more precise measure compared to 'job_duration_years', especially for shorter employment periods.
working_days_per_week: This variable probably denotes the number of days the individual works in a typical week. It helps to understand the work pattern, whether it's a standard five-day workweek or otherwise.
work_hours_per_day: This variable likely measures the average number of hours the individual works each day. It can be used to assess work-life balance and overall workload.
job_workload: This variable might represent the overall workload associated with the individual's job. This could be subjective (based on the individual's perception) or objective (based on quantifiable measures like hours worked or tasks completed).
job_qualitative_load: This variable likely assesses the qualitative aspects of the job's workload, such as the level of mental or emotional stress, complexity of tasks, or level of responsibility.
job_control: This variable probably measures the degree of control or autonomy the individual has in their job. It could assess how much freedom they have in making decisions, planning their work, or the flexibility in how they perform their duties.
hirou_1–hirou_7: Working Conditions of Fatigue Accumulation Checklist
hirou_kinmu: Sum of Working Conditions of Fatigue Accumulation Checklist
WH_1–WH_2: Items related to workaholic
workaholic: Sum of items related to workaholic
WE_1–WE_3: Items related to work engagement
engagement: Sum of items related to work engagement
relationship_stress: This variable likely measures stress stemming from personal relationships, possibly including family, romantic partners, or friends.
future_uncertainty_stress: This variable probably captures stress related to uncertainties about the future, such as career prospects, financial stability, or life goals.
discrimination_stress: This variable indicates stress experienced due to discrimination, possibly based on factors like race, gender, age, or other personal characteristics.
financial_stress: This variable measures stress related to financial matters, such as income, expenses, debt, or overall financial security.
health_stress: This variable likely assesses stress concerning personal health or the health of loved ones.
commuting_stress: This variable measures stress associated with daily commuting, such as traffic, travel time, or transportation issues.
irregular_lifestyle: This variable probably indicates the presence of an irregular lifestyle, potentially including erratic sleep patterns, eating habits, or work schedules.
living_env_stress: This variable likely measures stress related to the living environment, which could include housing conditions, neighborhood safety, or noise levels.
unrewarded_efforts: This variable probably assesses feelings of stress or dissatisfaction due to efforts that are perceived as unrewarded or unacknowledged.
other_stressors: This variable might capture additional stress factors not covered by other specific variables.
coping: This variable likely assesses the individual's coping mechanisms or strategies in response to stress.
support: This variable measures the level of support the individual perceives or receives, possibly from friends, family, or professional services.
weekday_bedtime: This variable likely indicates the typical bedtime of the individual on weekdays.
weekday_wakeup: This variable represents the typical time the individual wakes up on weekdays.
holiday_bedtime: This variable indicates the typical bedtime of the individual on holidays or non-workdays.
holiday_wakeup: This variable measures the typical wake-up time of the individual on holidays or non-workdays.
avg_sleep_duration: This variable likely represents the average duration of sleep the individual gets, possibly averaged over a certain period.
weekday_bedtime_posix: This variable might represent the weekday bedtime in POSIX time format.
weekday_wakeup_posix: Similar to bedtime, this represents the weekday wakeup time in POSIX time format.
holiday_bedtime_posix: This variable likely indicates the holiday bedtime in POSIX time format.
holiday_wakeup_posix: This represents the holiday wakeup time in POSIX time format.
weekday_bedtime_posix_hms: This variable could be the weekday bedtime in POSIX time format, specifically in hours, minutes, and seconds.
weekday_wakeup_posix_hms: This variable might represent the weekday wakeup time in POSIX time format in hours, minutes, and seconds.
holiday_bedtime_posix_hms: The holiday bedtime in POSIX time format, detailed to hours, minutes, and seconds.
holiday_wakeup_posix_hms: The holiday wakeup time in POSIX time format, in hours, minutes, and
Since the publication of the Adverse Outcome Pathway (AOP) for skin sensitization, there have been many efforts to develop systematic approaches to integrate the information generated from different key events for decision making. The types of information characterizing key events in an AOP can be generated from in silico, in chemico, in vitro or in vivo approaches. Integration of this information and interpretation for decision making are known as integrated approaches to testing and assessment or IATA. One such IATA that has been developed was published by Jaworska et al (2013) which describes a Bayesian network model known as ITS-2. The current work evaluated the performance of ITS-2 using a stratified cross validation approach. We also characterized the impact of refinements to the network by replacing the most significant component, the output from a commercial expert system TIMES-SS with structural alert information readily generated from the freely available OECD QSAR Toolbox. Lack of any structural alert flags or TIMES-SS predictions, yielded a sensitization potential prediction of 79% +3%/-4%. If the TIMES-SS prediction was replaced by an indicator for the presence of a structural alert, the network predictivity increased to 84% +2%/-4%, which was only slightly less than found for the original network (89% ±2%). The local applicability domain of the original ITS-2 network was also evaluated using reaction mechanistic domains to better understand what types of chemicals ITS-2 was able to make the best predictions for – i.e. a local validity domain analysis. We ultimately found that the original network was successful at predicting which chemicals would be sensitizers, but not at predicting their relative potency. This dataset is associated with the following publication: Fitzpatrick, J., and G. Patlewicz. (SAR AND QSAR IN ENVIRONMENTAL RESEARCH) Application of IATA - A case study in evaluating the global and local performance of a Bayesian Network model for Skin Sensitization. SAR AND QSAR IN ENVIRONMENTAL RESEARCH. Taylor & Francis, Inc., Philadelphia, PA, USA, 28(4): 297-310, (2017).
A dataset within the Harmonized Database of Western U.S. Water Rights (HarDWR). For a detailed description of the database, please see the meta-record v2.0. Changelog v2.0 - Recalculated based on data sourced from WestDAAT - Changed using a Site ID column to identify unique records to using aa combination of Site ID and Allocation ID - Removed the Water Management Area (WMA) column from the harmonized records. The replacement is a separate file which stores the relationship between allocations and WMAs. This allows for allocations to contribute to water right amounts to multiple WMAs during the subsequent cumulative process. - Added a column describing a water rights legal status - Added "Unspecified" was a water source category - Added an acre-foot (AF) column - Added a column for the classification of the right's owner v1.02 - Added a .RData file to the dataset as a convenience for anyone exploring our code. This is an internal file, and the one referenced in analysis scripts as the data objects are already in R data objects. v1.01 - Updated the names of each file with an ID number less than 3 digits to include leading 0s v1.0 - Initial public release Description Heremore » we present an updated database of Western U.S. water right records. This database provides consistent unique identifiers for each water right record, and a consistent categorization scheme that puts each water right record into one of seven broad use categories. These data were instrumental in conducting a study of the multi-sector dynamics of inter-sectoral water allocation changes though water markets (Grogan et al., in review). Specifically, the data were formatted for use as input to a process-based hydrologic model, Water Balance Model (WBM), with a water rights module (Grogan et al., in review). While this specific study motivated the development of the database presented here, water management in the U.S. West is a rich area of study (e.g., Anderson and Woosly, 2005; Tidwell, 2014; Null and Prudencio, 2016; Carney et al., 2021) so releasing this database publicly with documentation and usage notes will enable other researchers to do further work on water management in the U.S. West. We produced the water rights database presented here in four main steps: (1) data collection, (2) data quality control, (3) data harmonization, and (4) generation of cumulative water rights curves. Each of steps (1)-(3) had to be completed in order to produce (4), the final product that was used in the modeling exercise in Grogan et al. (in review). All data in each step is associated with a spatial unit called a Water Management Area (WMA), which is the unit of water right administration utilized by the state in which the right came from. Steps (2) and (3) required use to make assumptions and interpretation, and to remove records from the raw data collection. We describe each of these assumptions and interpretations below so that other researchers can choose to implement alternative assumptions an interpretation as fits their research aims. Motivation for Changing Data Sources The most significant change has been a switch from collecting the raw water rights directly from each state to using the water rights records presented in WestDAAT, a product of the Water Data Exchange (WaDE) Program under the Western States Water Council (WSWC). One of the main reasons for this is that each state of interest is a member of the WSWC, meaning that WaDE is partially funded by these states, as well as many universities. As WestDAAT is also a database with consistent categorization, it has allowed us to spend less time on data collection and quality control and more time on answering research questions. This has included records from water right sources we had previously not known about when creating v1.0 of this database. The only major downside to utilizing the WestDAAT records as our raw data is that further updates are tied to when WestDAAT is updated, as some states update their public water right records daily. However, as our focus is on cumulative water amounts at the regional scale, it is unlikely most records updates would have a significant effect on our results. The structure of WestDAAT led to several important changes to how HarWR is formatted. The most significant change is that WaDE has calculated a field known as SiteUUID
, which is a unique identifier for the Point of Diversion (POD), or where the water is drawn from. This separate from AllocationNativeID
, which is the identifier for the allocation of water, or the amount of water associated with the water right. It should be noted that it is possible for a single site to have multiple allocations associated with it and for an allocation to be able to be extracted from multiple sites. The site-allocation structure has allowed us to adapt a more consistent, and hopefully more realistic, approach in organizing the water right records than we had with HarDWR v1.0. This was incredibly helpful as the raw data from many states had multiple water uses within a single field within a single row of their raw data, and it was not always clear if the first water use was the most important, or simply first alphabetically. WestDAAT has already addressed this data quality issue. Furthermore, with v1.0, when there were multiple records with the same water right ID, we selected the largest volume or flow amount and disregarded the rest. As WestDAAT was already a common structure for disparate data formats, we were better able to identify sites with multiple allocations and, perhaps more importantly, allocations with multiple sites. This is particularly helpful when an allocation has sites which cross WMA boundaries, instead of just assigning the full water amount to a single WMA we are now able to divide the amount of water between the number of relevant WMAs. As it is now possible to identify allocations with water used in multiple WMAs, it is no longer practical to store this information within a single column. Instead the stAllocationToWMATab.csv file was created, which is an allocation by WMA matrix containing the percent Place of Use area overlap with each WMA. We then use this percentage to divide the allocation's flow amount between the given WMAs during the cumulation process to hopefully provide more realistic totals of water use in each area. However, not every state provides areas of water use, so like HarDWR v1.0, a hierarchical decision tree was used to assign each allocation to a WMA. First, if a WMA could be identified based on the allocation ID, then that WMA was used; typically, when available, this applied to the entire state and no further steps were needed. Second was the spatial analysis of Place of Use to WMAs. Third was a spatial analysis of the POD locations to WMAs, with the assumption that allocation's POD is within the WMA it should belong to; if an allocation still had multiple WMAs based on its POD locations, then the allocation's flow amount would be divided equally between all WMAs. The fourth, and final, process was to include water allocations which spatially fell outside of the state WMA boundaries. This could be due to several reasons, such as coordinate errors / imprecision in the POD location, imprecision in the WMA boundaries, or rights attached with features, such as a reservoir, which crosses state boundaries. To include these records, we decided for any POD which was within one kilometer of the state's edge would be assigned to the nearest WMA. Other Changes WestDAAT has Allowed In addition to a more nuanced and consistent method of assigning water right's data to WMAs, there are other benefits gained from using the WestDAAT dataset. Among those is a consistent categorization of a water right's legal status. In HarDWR v1.0, legal status was effectively ignored, which led to many valid concerns about the quality of the database related to the amounts of water the rights allowed to be claimed. The main issue was that rights with legal status' such as "application withdrawn", "non-active", or "cancelled" were included within HarDWR v1.0. These, and other water rights status' which were deemed to not be in use have been removed from this version of the database. Another major change has been the addition of the "unspecified water source category. This is water that can come from either surface water or groundwater, or the source of which is unknown. The addition of this source category brings the total number of categories to three. Due to reviewer feedback, we decided to add the acre-foot (AF) column so that the data may be more applicable to a wider audience. We added the ownerClassification column so that the data may be more applicable to a wider audience. File Descriptions The dataset is a series of various files organized by state sub-directories. In addition, each file begins with the state's name, in case the file is separate from its sub-directory for some reason. After the state name is the text which describes the contents of the file. Here is each file described in detail. Note that st is a placeholder for the state's name. stFullRecords_HarmonizedRights.csv: A file of the complete water records for each state. The column headers for each of this type of file are: state - The name of the state to which the allocations belong to. FIPS - The two digit numeric state ID code. siteID - The site location ID for POD locations. A site may have multiple allocations, which are the actual amount of water which can be drawn. In a simplified hypothetical, a farm stead may have an allocation for "irrigation" and an allocation for "domestic" water use, but the water is drawn from the same pumping equipment. It should be noted that many of the site ID appear to have been added by WaDE, and therefore may not be recognized by a given state's water rights database. allocationID - The allocation ID for the water right. For most states this is the water right ID, and what is
Audio and video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers are 18-69 years old and from two geographic areas. For the intonational experiments, there are audio recordings only, whereas some of the free interviews and map tasks feature video recordings. The material used as stimuli in the experiments is available with references encoded in the transcriptions. The Hamburg Corpus of Argentinean Spanish (HaCASpa) was compiled in December 2008 and November/December 2009 within the context of the research project The intonation of Spanish in Argentina (H9, director: Christoph Gabriel), part of the Collaborative Research Centre "Multilingalism", funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) and hosted by the University of Hamburg. It comprises data from two varieties of Argentinean Spanish, i.e. a) the dialect spoken in the capital of Buenos Aires (also called Porteño, derived from puerto 'harbor') and b) the variety of the Neuquén/Comahue area (Northern Patagonia). The seven parts of HaCASpa correspond to the seven tasks described below in more detail: Five experiments were carried out in order to elicit specific data for research in prosody, with a main focus on (Task 1–5); in addition, several speakers took part in a free interview (Task 6) and a map task experiment (Task 7). The Task is encoded as a metadata attribute for each communication. HaCASpa comprises three different types of spoken data, depending on the Task, i.e. spontaneous, semi-spontaneous, and scripted speech. This information corresponds to the metadata attribute Speech type. The regional dimension of the corpus is represented through the attribute Area (i.e. Buenos Aires or Neuquén/Comahue), its diachronic dimension through the attribute Age group (i.e. Under 25/Over 25). The subjects are 60 native speakers of the relevant variety of Argentinean Spanish, i.e. Buenos Aires (Porteño) or Nequén/Comahue Spanish. For each speaker, the following information is available: Age, Education, Occupation, Year of school enrollment, Year of school graduation and Parents' mother tongue. The current version 0.2 contains mainly orthographic transcriptions of verbal behaviour (141,000 transcribed words) and codes that relate utterances to the materials used for the experimental tasks. Experimental design: Task (1) consists of two subparts: reading a story (1a) and retelling it (1b). For (1a), the subjects were asked to read the short story "The North Wind and the Sun", which was presented on a computer screen, two times. The fable is well known for its use of phonetic descriptions of different languages (see Handbook of the International Phonetic Association, International Phonetic Association. Cambridge: Cambridge University Press, 2005); the Latin American version we used in our data stems from the Dialectoteca del español, (coordination: C.-E. Piñeros). For (1b), the speakers were instructed to retell the story in their own words without being able to consult the text. With the help of these two parts, data of scripted (part 1a) as well as of semi-spontaneous speech (part 1b) could be collected. Task (2) was designed to collect data of semi-spontaneous speech by asking the subjects to answer questions pertaining to a given picture story. In a first step, the speakers were familiarized with the story, which was presented as two pictures displayed on a computer screen. In a second step, they were asked to answer specific questions about the story. The questions were also presented on the computer screen and varied in their design in order to elicit answers with different information-structural readings (such as broad vs. narrow focus or different focus types). In general, the speakers were free to answer as they wished. However, in order to avoid single word answers, they were asked to utter complete sentences. Task (3) consisted of reading question-answer pairs, the content of which was based on the picture stories already familiar from task (2). The answers were given together with the questions on the computer screen (i.e. one question / one answer) and the speakers simply had to read both the question and the answer. Task (4) was a reading task in which the subjects were asked to utter 10 simple subject-verb-object (SVO) sentences, presented on a computer screen. The speakers were instructed to read them at both normal and fast speech rate. Along the lines proposed in D´Imperio et al. 2005 ("Intonational Phrasing in Romance: The Role of Syntactic and Prosodic Structure", in: Prosodies: With Special Reference to Iberian Languages, ed. by Frota, S. et al., Berlin: Mouton de Gruyter, 59-97), the subject and object constituents differed in their syntactic and prosodic complexity (e.g. determiner plus noun or determiner plus noun plus adjective and one or three prosodic words, respectively). The participants were instructed to read the sentences as if they contained new information. The complete experiment design is described in Gabriel, C. et al. 2011 ("Prosodic phrasing in Porteño Spanish", in: Intonational Phrasing in Romance and Germanic: Cross-Linguistic and Bilingual Studies, ed. by Gabriel, C. & Lleó, C., Amsterdam: Benjamins, 153-182). Task (5), the so-called intonation survey, consisted of 48 situations designed to elicit various intonational contours with specific pragmatic meanings. In this inductive method, the researcher confronts the speaker with a series of hypothetical situations to which he or she is supposed to react verbally. In the Argentinean version of the questionnaire, the hypothetical situations were illustrated by appropriate pictures. The experimental design is described in more detail in Prieto, P. & Roseano, P. 2010 (eds). Transcription of Intonation of the Spanish Language. Munich: Lincom; see also the Interactive atlas of Spanish intonation (coordination: P. Prieto & P. Roseano). Task (6) was conducted to collect spontaneous speech data by conducting free interviews. In this task, the subjects were asked to tell the interviewer something about a past experience, be it a vacation or memories of Argentina as it was decades ago. Even though the interviewer was still part of the conversation, it was mainly the subjects who spoke during the recordings. Task (7) consists of Map Task dialogs. Map Task is a technique employed to collect data of spontaneous speech in which two subjects cooperate to complete a specified task. It is designed to lead the subjects to produce particular interrogative patterns. Each of the two subjects receives a map of an imaginary town marked with buildings and other specific elements. A route is marked on the map of one of the two participants, who assumes the role of the instruction-giver. The version of the same map given to the other participant, who assumes the role of the instruction-follower, differs from that of the instruction-giver in that it does not show the route to be followed. The instruction-follower therefore must ask the instruction-giver questions in order to be able to reproduce the same route on his or her own map (see also the Interactive atlas of Spanish intonation). CLARIN Metadata summary for Hamburg Corpus of Argentinean Spanish (HaCASpa) (CMDI-based) Title: Hamburg Corpus of Argentinean Spanish (HaCASpa) Description: Audio and video recordings of experimental/read and spontaneous speech from adult speakers of Porteño Spanish in Argentina. Speakers are 18-69 years old and from two geographic areas. For the intonational experiments, there are audio recordings only, whereas some of the free interviews and map tasks feature video recordings. The material used as stimuli in the experiments is available with references encoded in the transcriptions. Publication date: 2011-06-30 Data owner: Christoph Gabriel, Institut für Romanistik / Von-Melle-Park 6 / D-20146 Hamburg, christoph.gabriel@uni-hamburg.de Contributors: Christoph Gabriel, Institut für Romanistik / Von-Melle-Park 6 / D-20146 Hamburg, christoph.gabriel@uni-hamburg.de (compiler) Project: H9 "The intonation of Spanish in Argentina", German Research Foundation (DFG) Keywords: contact variety, cross-sectional data, regional variety, language contact, EXMARaLDA Language: Spanish (spa) Size: 63 speakers (39 female, 24 male), 259 communications, 261 recordings, 1119 minutes, 261 transcriptions, 141321 words Annotation types: transcription (manual): mainly orthographic, project-specific conventions, code: reference to underlying prompts Temporal Coverage: 2008-11-01/2009-12-01 Spatial Coverage: Buenos Aires, AR; Neuquén/Comahue, AR Genre: discourse Modality: spoken
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This GIS database is a generalized land cover database designed for Regional Planning with a land use component used for forecasts and modeling at ARC. LandPro should not be taken out of its Regional context, though county-level or municipal-level analysis may be useful for transportation, environmental and land use planning.
Description This layer was developed by the Research & Analytics Division of the Atlanta Regional Commission and is a generalized land cover database designed for regional planning with a land use component used for forecasts and modeling at ARC. LandPro2012 should not be taken out of its regional context, though county-level or municipal-level analysis may be useful for transportation, environmental and land use planning. LandPro2012 is ARC's land use/land cover GIS database for the 21-county Atlanta Region (Cherokee, Clayton, Cobb, DeKalb, Douglas, Fayette, Fulton, Gwinnett, Henry, Rockdale, the EPA non-attainment (8hr standard) counties of Carroll, Coweta, Barrow, Bartow, Forsyth, Hall, Newton, Paulding, Spalding and Walton and Dawson which will become a part of the 2010 Urbanized Area). LandPro2012 was created by on-screen photo-interpretation and digitizing of ortho-rectified aerial photography. The primary source for this GIS database were the local parcels and the 2009 true color imagery with 1.64-foot pixel resolution, provided by Aerials Express, Inc. 2010 is the first year we have used parcel data to help more accurately delineate the LandPro categories.For ArcGIS 10 users: See full metadata by enabling FGDC metadata in ArcCatalog Customize > ArcCatalog Options > Metadata (tab)Though the terms are often used interchangeably, land use and land cover are not synonymous. Land cover generally refers to the natural or cultivated vegetation, rock, or water covering the land, as well as the developed surface which can be identified on aerial photography. Land use generally refers to the way that humans use or will use the land, regardless of its apparent land cover. Collateral data for the land cover mapping effort included the Aero Surveys of Georgia street atlas, the Georgia Department of Community Affairs (DCA) Community Facilities database and the USGS Digital Raster Graphics (DRGs) of 1:24,000 scale topographic maps. The land use component of this database was added after the land cover interpretation was completed, and is based primarily on ownership information provided by the 21 counties and the City of Atlanta for larger tracts of undeveloped land that meet the land use definition of "Extensive Institutional" or "Park Lands" (refer to the Code Descriptions and Discussion section below). Although some of the boundaries of these tracts may align with visible features from the aerial photography, these areas are generally "non-photo-identifiable," thus require other sources for accurate identification. The land use/cover classification system is adapted from the USGS (Anderson) classification system, incorporating a mix of level I, II and III classes. There are a total of 25 categories in ARC's land use/cover system (described below), 2 of which are used only for land use designations: Park Lands (Code 175) and Extensive Institutional (Code 125). The other 23 categories can describe land use and/or land cover, and in most cases will be the same. The LU code will differ from the LC code only where the Park Lands (Code 175) and Extensive Institutional (Code 125) land holdings have been identified from collateral sources of land ownership.Although similar to previous eras of ARC land use/cover databases developed before 1999 (1995, 1990 etc.), "LandPro" differs in many significant ways. Originally, ARC's land use and land cover database was built from 1975 data compiled by USGS at scales of 1:100,000 and selectively, 1:24,000. The coverage was updated in 1990 using SPOT satellite imagery and low-altitude aerial photography and again in 1995 using 1:24,000 scale panchromatic aerial photography. Unlike these previous 5-year updates, the 1999, 2001, 2003, 2005 2007, 2008 and 2009 LandPro databases were compiled at a larger scale (1:14,000) and do not directly reflect pre-1999 delineations. In addition, all components of LandPro were produced using digital orthophotos for on-screen photo-interpretation and digitizing, thus eliminating the use of unrectified photography and the need for data transfer and board digitizing. As a result, the positional accuracy of LandPro is much higher than in previous eras. There have also been some changes to the classification system prior to 1999. Previously, three categories of Forest (41-deciduous, 42-coniferous, and 43-mixed forest) were used; this version does not distinguish between coniferous and deciduous forest, thus Code 40 is used to simply designate Forest. Likewise, two categories of Wetlands (61-forested wetland, and 62-non-forested wetland) were used before; this version does not distinguish between forested and non-forested wetlands, thus Code 60 is used to simply designate Wetlands. With regard to Wetlands, the boundaries themselves are now based on the National Wetlands Inventory (NWI) delineations along with the CIR imagery. Furthermore, Code 51 has been renamed "Rivers" from "Streams and Canals" and represents the Chattahoochee and Etowah Rivers which have been identified in the land use/cover database. In addition to these changes, Code 52 has been dropped from the system as there are no known instances of naturally occurring lakes in the Region. Finally, the land use code for Park Lands has been changed from 173 to 175 so as to minimize confusion with the Parks land cover code, 173. There has been a change in the agriculture classification for LandPro2005 and any LandPro datasets hereafter. Previously, four categories of agriculture (21- agriculture-cropland and pasture, 22 - agriculture - orchards, 23 - agriculture - confined feeding operations and 24 - agriculture - other) were used; this version does not distinguish between the different agricultural lands. Code 20 is now used to designate agriculture. Due to new technology and the enhancements to this database, direct comparison between LandPro99, LandPro2001, LandPro2003 and landPro2005 and all successive updates are now possible, with the 1999 database serving as ARC's new baseline. Please note that as a result of the 2003 mapping effort, LandPro2001 has been adjusted for better comparison to LandPro2003 and is named "LandPro01_adj." Likewise, LandPro99 was previously adjusted when LandPro2001 was completed, but was not further adjusted following the 2003 update. Although some adjustments were originally made to the 1995 land use/cover database for modeling applications, direct comparisons to previous versions of ARC land use/cover before 1999 should be avoided in most cases.The 2010 update has moved away from using the (1:14,000) scale, as will any future updates. Due to the use of local parcels, we have begun to snap LandPro boundaries to the parcel data, making a more accurate dataset. The major change in this update was to make residential areas reflect modern zoning codes more closely. Due to these changes you will no longer be able to compare this dataset to previous years. High density (113) has changed from lots below .25 to lots .25 and smaller. Medium density (112) has changed from .25 to 2 acre lots, to .26 to 1 acre lots. Low density has changed from 2 to 5 acre lots to 1.1 to 2 acre lots. It must be noted that in the 2010 update, you still have old acreage standards reflected in the low density. This will be corrected in the 2011 and 2012 updates. The main focus of the 2010 update was to make sure the LandPro' residential areas reflected the local parcels and change LandPro based on the parcel acreage. DeKalb is the only county not corrected at this time because no parcels were available. The future updates will consist of but are not limited to, reclassifying areas in 111 that do not meet the new acreage standards, delineating and reclassifying Cell Towers, substations and transmission lines/power cuts from TCU (14) to a subset of this (142), reclassifying airports as 141 form TCU, and reclassifying landfills form urban other (17) to 174. Other changes are delineating more roads other than just Limited Access Highways, making sure parks match the already existing Land use parks layer, and beginning to differentiate office from commercial and commercial/industrial.Classification System:111: Low Density Single Family Residential - Houses on 1.1 - 2 acre lots. Though 2010 still reflects the old standard of lots up to 5 acres.112: Medium Density Single Family Residential - These areas usually occur in urban or suburban zones and are generally characterized by houses on .26 to 1 acre lots. This category accounts for the majority of residential land use in the Region and includes a wide variety of neighborhood types.113: High Density Residential - Areas that have predominantly been developed for concentrated single family residential use. These areas occur almost exclusively in urban neighborhoods with streets on a grid network, and are characterized by houses on lots .25 acre or smaller but may also include mixed residential areas with duplexes and small apartment buildings.117: Multifamily Residential - Residential areas comprised predominantly of apartment, condominium and townhouse complexes where net density generally exceeds eight units per acre. Typical apartment buildings are relatively easy to identify, but some high rise structures may be interpreted as, or combined with, office buildings, though many of these dwellings were identified and delineated in downtown and midtown for the first time with the 2003 update. Likewise, some smaller apartments and townhouses may be interpreted as, or combined with, medium- or high-density single family residential. Housing on military bases, campuses, resorts, agricultural properties and construction work sites is
The knowledge about processes concerning perception and understanding is of paramount importance for designing means of communication like maps and charts. This is especially the case, if one does not want to lose sight of the map-user and if map-design is to be orientated along the map-users needs and preferences in order to improve the cartographic product's usability. A scientific approach to visualization can help to achieve useable results. The insights achieved by such an approach can lead to modes of visualization that are superior to those, which have seemingly proved their value in praxis – so-called "bestpractices" –, concerning their utility and efficiency. This thesis shows this by using the example of visualizing the limits of bodies of waters in the Southern Ocean. After making some introductorily remarks on the chosen mode of problem-solution in chapter one, which simultaneously illustrate the flow of work while working on the problem, in chapter two the relevant information concerning the drawing of limits in the Southern Ocean is outlined.Chapter 3 builds the theoretical framework, which is a multidisciplinary approach to representation. This theoretical framework is based on "How Maps Work" by the American Cartographer MacEachren (1995/2004). His "scientific approach to visualization" is amended and adjusted by the knowledge gained from recent findings of the social sciences where necessary. So, the approach suggested in this thesis represents a synergy of psychology, sociology, semiotics, linguistics, communication theory and cartography. It follows the tradition of interdisciplinary research getting over the boundaries of a single scientific subject. The achieved holistic approach can help to improve the usability of cartographic products. It illustrates on the one hand those processes taking place while perceiving and recognizing cartographic information – so-called bottom-up-processes. On the other hand it illuminates the processes which happen during understanding this information in so-called top-down-processes. Bottom-up- and top-down-processes are interdependent and inseparably interrelated and therefore cannot be understood without each other. Regarding aspects of usability the approach suggested in this thesis strongly focuses on the map-user. This is the reason why the phenomenon of communication gains more weight than in MacEachren's map-centered approach.Because of this, in chapter 4 a holistic approach to communication is developed. This approach makes clear that only the map-user can evaluate the usability of a cartographic product. Only if he can extract the information relevant for him from the cartographical product, it is really useable. The concept of communication is well suited to conceive that. In case of the visualization of limits of bodies of water in the Southern Ocean, which is not complex enough to illustrate all results of the theoretical considerations, it is suggested to visualize the limits with red lines. This suggestion deviates from the commonly used mode of visualization. So, this thesis shows how theory is able to ameliorate praxis.Chapter 5 leads back to the task of fixing limits of the bodies of water in the area of concern. A convention by the International Hydrographic Organization (IHO) states that those limits should be drawn by using meridians, parallels, rhumb lines and bathymetric data. Based on the available bathymetric data both a representation and a process model are calculated, which should support the drawing of the limits. The quality of both models, which depends on the quality of the bathymetric data at hand, leads to the decision that the representation model is better suited to support the drawing of limits. provides the limits in shape file format, links to an overview map in bitmap format.
This dataset package is focused on U.S construction materials and three construction companies: Cemex, Martin Marietta & Vulcan.
In this package, SpaceKnow tracks manufacturing and processing facilities for construction material products all over the US. By tracking these facilities, we are able to give you near-real-time data on spending on these materials, which helps to predict residential and commercial real estate construction and spending in the US.
The dataset includes 40 indices focused on asphalt, cement, concrete, and building materials in general. You can look forward to receiving country-level and regional data (activity in the North, East, West, and South of the country) and the aforementioned company data.
SpaceKnow uses satellite (SAR) data to capture activity and building material manufacturing and processing facilities in the US.
Data is updated daily, has an average lag of 4-6 days, and history back to 2017.
The insights provide you with level and change data for refineries, storage, manufacturing, logistics, and employee parking-based locations.
SpaceKnow offers 3 delivery options: CSV, API, and Insights Dashboard
Available Indices Companies: Cemex (CX): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Martin Marietta (MLM): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates Vulcan (VMC): Construction Materials (covers all manufacturing facilities of the company in the US), Concrete, Cement (refinery and storage) indices, and aggregates
USA Indices:
Aggregates USA Asphalt USA Cement USA Cement Refinery USA Cement Storage USA Concrete USA Construction Materials USA Construction Mining USA Construction Parking Lots USA Construction Materials Transfer Hub US Cement - Midwest, Northeast, South, West Cement Refinery - Midwest, Northeast, South, West Cement Storage - Midwest, Northeast, South, West
Why get SpaceKnow's U.S Construction Materials Package?
Monitor Construction Market Trends: Near-real-time insights into the construction industry allow clients to understand and anticipate market trends better.
Track Companies Performance: Monitor the operational activities, such as the volume of sales
Assess Risk: Use satellite activity data to assess the risks associated with investing in the construction industry.
Index Methodology Summary Continuous Feed Index (CFI) is a daily aggregation of the area of metallic objects in square meters. There are two types of CFI indices; CFI-R index gives the data in levels. It shows how many square meters are covered by metallic objects (for example employee cars at a facility). CFI-S index gives the change in data. It shows how many square meters have changed within the locations between two consecutive satellite images.
How to interpret the data SpaceKnow indices can be compared with the related economic indicators or KPIs. If the economic indicator is in monthly terms, perform a 30-day rolling sum and pick the last day of the month to compare with the economic indicator. Each data point will reflect approximately the sum of the month. If the economic indicator is in quarterly terms, perform a 90-day rolling sum and pick the last day of the 90-day to compare with the economic indicator. Each data point will reflect approximately the sum of the quarter.
Where the data comes from SpaceKnow brings you the data edge by applying machine learning and AI algorithms to synthetic aperture radar and optical satellite imagery. The company’s infrastructure searches and downloads new imagery every day, and the computations of the data take place within less than 24 hours.
In contrast to traditional economic data, which are released in monthly and quarterly terms, SpaceKnow data is high-frequency and available daily. It is possible to observe the latest movements in the construction industry with just a 4-6 day lag, on average.
The construction materials data help you to estimate the performance of the construction sector and the business activity of the selected companies.
The foundation of delivering high-quality data is based on the success of defining each location to observe and extract the data. All locations are thoroughly researched and validated by an in-house team of annotators and data analysts.
See below how our Construction Materials index performs against the US Non-residential construction spending benchmark
Each individual location is precisely defined to avoid noise in the data, which may arise from traffic or changing vegetation due to seasonal reasons.
SpaceKnow uses radar imagery and its own unique algorithms, so the indices do not lose their significance in bad weather conditions such as rain or heavy clouds.
→ Reach out to get free trial
...
Motivation
This dataset is derived and cleaned from the full PULSE project dataset to share with others data gathered about the users during the project.
Disclaimer
Any third party need to respect ethics rules and GDPR and must mention “PULSE DATA H2020 - 727816” in any dissemination activities related to data being exploited. Also, you should provide a link to the project associated website: http://www.project-pulse.eu/
The data provided in the files is provided as is. Despite our best efforts at filtering out potential issues, some information could be erroneous.
Description of the dataset
The only difference with the original dataset comes from anonymised user information.
The dataset content is described in a dedicated JSON file:
{
"citizen_id": "pseudonymized unique key of each citizen user in the PULSE system",
"city_code": {
"description": "3-letter city codes taken by convention from IATA codebook of airports and metropolitan areas, as the codebook of global cities in most common and widespread use and therefore adopted as standard in PULSE (since there is currently - in the year 2020 - still no relevant ISO or other standardized codebook of cities uniformly globally adopted and used). Exception is Pavia which does not have its own airport,and nearby Milan/Bergamo airports are not applicable, so the 'PAI' internal code (not existing in original IATA codes) has been devised in PULSE. For cities with multiple airports, IATA metropolitan area codes are used (New York, Paris).",
"BCN": "Barcelona",
"BHX": "Birmingham",
"NYC": "New York",
"PAI": "Pavia",
"PAR": "Paris",
"SIN": "Singapore",
"TPE": "Keelung(Taipei)"
},
"zip_code": "Zip or postal code (area) within a city, basic default granular territorial/administrative subdivision unit for localization of citizen users by place of residence (in all PULSE cities)",
"models": {
"asthma_risk_score": "PULSE asthma risk consensus model score, decimal value ranging from 0 to 1",
"asthma_risk_score_category": {
"description": "Categorized value of the PULSE asthma risk consensus model score, with the following possible category options:",
"low": "low asthma risk, score value below 0,05",
"medium-low": "medium-low asthma risk, score value from 0,05 and below 0,1",
"medium": "medium asthma risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high asthma risk, score value from 0,15 and below 0,2",
"high": "high asthma risk, score value from 0,2 and higher"
},
"T2D_risk_score": "PULSE diabetes type 2 (T2D) risk consensus model score, decimal value ranging from 0 to 1",
"T2D_risk_score_category": {
"description": "Categorized value of the PULSE diabetes type 2 risk consensus model score, with the following possible category options:",
"low": "low T2D risk, score value below 0,05",
"medium-low": "medium-low T2D risk, score value from 0,05 and below 0,1",
"medium": "medium T2D risk, score value from 0,1 and below 0,15",
"medium-high": "medium-high T2D risk, score value from 0,15 and below 0,2",
"high": "high T2D risk, score value from 0,2 and below 0,25",
"very_high": "very high T2D risk, score value from 0,25 and higher"
},
"well-being_score": "PULSE well-being model score, decimal value ranging from -5 to 5",
"well-being_score_category": {
"description": "Categorized value of the PULSE well-being model score, with the following possible category options:",
"low": "low well-being, score value below -0,37",
"medium-low": "medium-low well-being, score value from -0,37 and below 0,04",
"medium-high": "medium-high well-being, score value from 0,04 and below 0,36",
"high": "high well-being, score value from 0,36 and higher"
},
"computed_time": "Timestamp (UTC) when each relevant model score value/result had been computed or derived"
}
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the code for Relevance and Redundancy ranking; a an efficient filter-based feature ranking framework for evaluating relevance based on multi-feature interactions and redundancy on mixed datasets.Source code is in .scala and .sbt format, metadata in .xml, all of which can be accessed and edited in standard, openly accessible text edit software. Diagrams are in openly accessible .png format.Supplementary_2.pdf: contains the results of experiments on multiple classifiers, along with parameter settings and a description of how KLD converges to mutual information based on its symmetricity.dataGenerator.zip: Synthetic data generator inspired from NIPS: Workshop on variable and feature selection (2001), http://www.clopinet.com/isabelle/Projects/NIPS2001/rar-mfs-master.zip: Relevance and Redundancy Framework containing overview diagram, example datasets, source code and metadata. Details on installing and running are provided below.Background. Feature ranking is benfiecial to gain knowledge and to identify the relevant features from a high-dimensional dataset. However, in several datasets, few features by themselves might have small correlation with the target classes, but by combining these features with some other features, they can be strongly correlated with the target. This means that multiple features exhibit interactions among themselves. It is necessary to rank the features based on these interactions for better analysis and classifier performance. However, evaluating these interactions on large datasets is computationally challenging. Furthermore, datasets often have features with redundant information. Using such redundant features hinders both efficiency and generalization capability of the classifier. The major challenge is to efficiently rank the features based on relevance and redundancy on mixed datasets. In the related publication, we propose a filter-based framework based on Relevance and Redundancy (RaR), RaR computes a single score that quantifies the feature relevance by considering interactions between features and redundancy. The top ranked features of RaR are characterized by maximum relevance and non-redundancy. The evaluation on synthetic and real world datasets demonstrates that our approach outperforms several state of-the-art feature selection techniques.# Relevance and Redundancy Framework (rar-mfs) rar-mfs is an algorithm for feature selection and can be employed to select features from labelled data sets. The Relevance and Redundancy Framework (RaR), which is the theory behind the implementation, is a novel feature selection algorithm that - works on large data sets (polynomial runtime),- can handle differently typed features (e.g. nominal features and continuous features), and- handles multivariate correlations.## InstallationThe tool is written in scala and uses the weka framework to load and handle data sets. You can either run it independently providing the data as an
.arff
or .csv
file or you can include the algorithm as a (maven / ivy) dependency in your project. As an example data set we use heart-c. ### Project dependencyThe project is published to maven central (link). To depend on the project use:- maven xml de.hpi.kddm rar-mfs_2.11 1.0.2
- sbt: sbt libraryDependencies += "de.hpi.kddm" %% "rar-mfs" % "1.0.2"
To run the algorithm usescalaimport de.hpi.kddm.rar._// ...val dataSet = de.hpi.kddm.rar.Runner.loadCSVDataSet(new File("heart-c.csv", isNormalized = false, "")val algorithm = new RaRSearch( HicsContrastPramsFA(numIterations = config.samples, maxRetries = 1, alphaFixed = config.alpha, maxInstances = 1000), RaRParamsFixed(k = 5, numberOfMonteCarlosFixed = 5000, parallelismFactor = 4))algorithm.selectFeatures(dataSet)
### Command line tool- EITHER download the prebuild binary which requires only an installation of a recent java version (>= 6) 1. download the prebuild jar from the releases tab (latest) 2. run java -jar rar-mfs-1.0.2.jar--help
Using the prebuild jar, here is an example usage: sh rar-mfs > java -jar rar-mfs-1.0.2.jar arff --samples 100 --subsetSize 5 --nonorm heart-c.arff Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
- OR build the repository on your own: 1. make sure sbt is installed 2. clone repository 3. run sbt run
Simple example using sbt directly after cloning the repository: sh rar-mfs > sbt "run arff --samples 100 --subsetSize 5 --nonorm heart-c.arff" Feature Ranking: 1 - age (12) 2 - sex (8) 3 - cp (11) ...
### [Optional]To speed up the algorithm, consider using a fast solver such as Gurobi (http://www.gurobi.com/). Install the solver and put the provided gurobi.jar
into the java classpath. ## Algorithm### IdeaAbstract overview of the different steps of the proposed feature selection algorithm:https://github.com/tmbo/rar-mfs/blob/master/docu/images/algorithm_overview.png" alt="Algorithm Overview">The Relevance and Redundancy ranking framework (RaR) is a method able to handle large scale data sets and data sets with mixed features. Instead of directly selecting a subset, a feature ranking gives a more detailed overview into the relevance of the features. The method consists of a multistep approach where we 1. repeatedly sample subsets from the whole feature space and examine their relevance and redundancy: exploration of the search space to gather more and more knowledge about the relevance and redundancy of features 2. decude scores for features based on the scores of the subsets 3. create the best possible ranking given the sampled insights.### Parameters| Parameter | Default value | Description || ---------- | ------------- | ------------|| m - contrast iterations | 100 | Number of different slices to evaluate while comparing marginal and conditional probabilities || alpha - subspace slice size | 0.01 | Percentage of all instances to use as part of a slice which is used to compare distributions || n - sampling itertations | 1000 | Number of different subsets to select in the sampling phase|| k - sample set size | 5 | Maximum size of the subsets to be selected in the sampling phase|
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Construal level theory proposes that events that are temporally proximate are represented more concretely than events that are temporally distant. We tested this prediction using two large natural language text corpora. In study 1 we examined posts on Twitter that referenced the future, and found that tweets mentioning temporally proximate dates used more concrete words than those mentioning distant dates. In study 2 we obtained all New York Times articles that referenced U.S. presidential elections between 1987 and 2007. We found that the concreteness of the words in these articles increased with the temporal proximity to their corresponding election. Additionally the reduction in concreteness after the election was much greater than the increase in concreteness leading up to the election, though both changes in concreteness were well described by an exponential function. We replicated this finding with New York Times articles referencing US public holidays. Overall, our results provide strong support for the predictions of construal level theory, and additionally illustrate how large natural language datasets can be used to inform psychological theory.This network project brings together economists, psychologists, computer and complexity scientists from three leading centres for behavioural social science at Nottingham, Warwick and UEA. This group will lead a research programme with two broad objectives: to develop and test cross-disciplinary models of human behaviour and behaviour change; to draw out their implications for the formulation and evaluation of public policy. Foundational research will focus on three inter-related themes: understanding individual behaviour and behaviour change; understanding social and interactive behaviour; rethinking the foundations of policy analysis. The project will explore implications of the basic science for policy via a series of applied projects connecting naturally with the three themes. These will include: the determinants of consumer credit behaviour; the formation of social values; strategies for evaluation of policies affecting health and safety. The research will integrate theoretical perspectives from multiple disciplines and utilise a wide range of complementary methodologies including: theoretical modeling of individuals, groups and complex systems; conceptual analysis; lab and field experiments; analysis of large data sets. The Network will promote high quality cross-disciplinary research and serve as a policy forum for understanding behaviour and behaviour change. Experimental data. In study 1, we collected and analyzed millions of time-indexed posts on Twitter. In this study we obtained a large number of tweets that referenced dates in the future, and were able to use these tweets to determine the concreteness of the language used to describe events at these dates. This allowed us to observe how psychological distance influences everyday discourse, and put the key assumptions of the CLT to a real-world test. In study 2, we analyzed word concreteness in news articles using the New York Times (NYT) Annotated Corpus (Sandhaus, 2008). This corpus contains over 1.8 million NYT articles written between 1987 and 2007. Importantly for our purposes, these articles are tagged with keywords describing the topics of the articles. In this study we obtained all NYT articles written before and after the 1988, 1992, 1996, 2000, and 2004 US Presidential elections, which were tagged as pertaining to these elections. We subsequently tested how the concreteness of the words used in the articles varied as a function of temporal distance to the election they reference. We also performed this analysis with NYT articles referencing three popular public holidays. Unlike study 1 and prior work (such as Snefjella & Kuperman, 2015), study 2 allowed us to examine the influence of temporal distance in the past and in the future, while controlling for the exact time when specific events occurred.