Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As social media booms, abusive online practices such as hate speech have unfortunately increased as well. As letters are often repeated in words used to construct social media messages, these types of words should be eliminated or reduced in number to enhance the efficacy of hate speech detection. Although multiple models have attempted to normalize out-of-vocabulary (OOV) words with repeated letters, they often fail to determine whether the in-vocabulary (IV) replacement words are correct or incorrect. Therefore, this study developed an improved model for normalizing OOV words with repeated letters by replacing them with correct in-vocabulary (IV) replacement words. The improved normalization model is an unsupervised method that does not require the use of a special dictionary or annotated data. It combines rule-based patterns of words with repeated letters and the SymSpell spelling correction algorithm to remove repeated letters within the words by multiple rules regarding the position of repeated letters in a word, be it at the beginning, middle, or end of the word and the repetition pattern. Two hate speech datasets were then used to assess performance. The proposed normalization model was able to decrease the percentage of OOV words to 8%. Its F1 score was also 9% and 13% higher than the models proposed by two extant studies. Therefore, the proposed normalization model performed better than the benchmark studies in replacing OOV words with the correct IV replacement and improved the performance of the detection model. As such, suitable rule-based patterns can be combined with spelling correction to develop a text normalization model to correctly replace words with repeated letters, which would, in turn, improve hate speech detection in texts.
IMPORTANT! PLEASE READ DISCLAIMER BEFORE USING DATA. This dataset backcasts estimated modeled savings for a subset of 2007-2012 completed projects in the Home Performance with ENERGY STAR® Program against normalized savings calculated by an open source energy efficiency meter available at https://www.openee.io/. Open source code uses utility-grade metered consumption to weather-normalize the pre- and post-consumption data using standard methods with no discretionary independent variables. The open source energy efficiency meter allows private companies, utilities, and regulators to calculate energy savings from energy efficiency retrofits with increased confidence and replicability of results. This dataset is intended to lay a foundation for future innovation and deployment of the open source energy efficiency meter across the residential energy sector, and to help inform stakeholders interested in pay for performance programs, where providers are paid for realizing measurable weather-normalized results. To download the open source code, please visit the website at https://github.com/openeemeter/eemeter/releases
D I S C L A I M E R:
Normalized Savings using open source OEE meter. Several data elements, including, Evaluated Annual Elecric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), and Post-retrofit Usage Gas (MMBtu) are direct outputs from the open source OEE meter.
Home Performance with ENERGY STAR® Estimated Savings. Several data elements, including, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, and Estimated First Year Energy Savings represent contractor-reported savings derived from energy modeling software calculations and not actual realized energy savings. The accuracy of the Estimated Annual kWh Savings and Estimated Annual MMBtu Savings for projects has been evaluated by an independent third party. The results of the Home Performance with ENERGY STAR impact analysis indicate that, on average, actual savings amount to 35 percent of the Estimated Annual kWh Savings and 65 percent of the Estimated Annual MMBtu Savings. For more information, please refer to the Evaluation Report published on NYSERDA’s website at: http://www.nyserda.ny.gov/-/media/Files/Publications/PPSER/Program-Evaluation/2012ContractorReports/2012-HPwES-Impact-Report-with-Appendices.pdf.
This dataset includes the following data points for a subset of projects completed in 2007-2012: Contractor ID, Project County, Project City, Project ZIP, Climate Zone, Weather Station, Weather Station-Normalization, Project Completion Date, Customer Type, Size of Home, Volume of Home, Number of Units, Year Home Built, Total Project Cost, Contractor Incentive, Total Incentives, Amount Financed through Program, Estimated Annual kWh Savings, Estimated Annual MMBtu Savings, Estimated First Year Energy Savings, Evaluated Annual Electric Savings (kWh), Evaluated Annual Gas Savings (MMBtu), Pre-retrofit Baseline Electric (kWh), Pre-retrofit Baseline Gas (MMBtu), Post-retrofit Usage Electric (kWh), Post-retrofit Usage Gas (MMBtu), Central Hudson, Consolidated Edison, LIPA, National Grid, National Fuel Gas, New York State Electric and Gas, Orange and Rockland, Rochester Gas and Electric.
How does your organization use this dataset? What other NYSERDA or energy-related datasets would you like to see on Open NY? Let us know by emailing OpenNY@nyserda.ny.gov.
Splitgraph serves as an HTTP API that lets you run SQL queries directly on this data to power Web applications. For example:
See the Splitgraph documentation for more information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
FlowMapper.org is a web-based framework for automated production and design of origin-destination flow maps. FlowMapper has four major features that contribute to the advancement of existing flow mapping systems. First, users can upload and process their own data to design and share customized flow maps. The ability to save data, cartographic design and map elements in a project file allows users to easily share their data and/or cartographic design with others. Second, users can generate customized flow symbols to support different flow map reading tasks such as comparing flow magnitudes and directions and identifying flow and location clusters that are strongly connected with each other. Third, FlowMapper supports supplementary layers such as node symbols, choropleth, and base maps to contextualize flow patterns with location references and characteristics. Finally, the web-based architecture of FlowMapper supports server-side computational capabilities to process and normalize large flow data and reveal natural patterns of flows.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This artifact accompanies the SEET@ICSE article "Assessing the impact of hints in learning formal specification", which reports on a user study to investigate the impact of different types of automated hints while learning a formal specification language, both in terms of immediate performance and learning retention, but also in the emotional response of the students. This research artifact provides all the material required to replicate this study (except for the proprietary questionnaires passed to assess the emotional response and user experience), as well as the collected data and data analysis scripts used for the discussion in the paper.
Dataset
The artifact contains the resources described below.
Experiment resources
The resources needed for replicating the experiment, namely in directory experiment:
alloy_sheet_pt.pdf: the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment. The sheet was passed in Portuguese due to the population of the experiment.
alloy_sheet_en.pdf: a version the 1-page Alloy sheet that participants had access to during the 2 sessions of the experiment translated into English.
docker-compose.yml: a Docker Compose configuration file to launch Alloy4Fun populated with the tasks in directory data/experiment for the 2 sessions of the experiment.
api and meteor: directories with source files for building and launching the Alloy4Fun platform for the study.
Experiment data
The task database used in our application of the experiment, namely in directory data/experiment:
Model.json, Instance.json, and Link.json: JSON files with to populate Alloy4Fun with the tasks for the 2 sessions of the experiment.
identifiers.txt: the list of all (104) available participant identifiers that can participate in the experiment.
Collected data
Data collected in the application of the experiment as a simple one-factor randomised experiment in 2 sessions involving 85 undergraduate students majoring in CSE. The experiment was validated by the Ethics Committee for Research in Social and Human Sciences of the Ethics Council of the University of Minho, where the experiment took place. Data is shared the shape of JSON and CSV files with a header row, namely in directory data/results:
data_sessions.json: data collected from task-solving in the 2 sessions of the experiment, used to calculate variables productivity (PROD1 and PROD2, between 0 and 12 solved tasks) and efficiency (EFF1 and EFF2, between 0 and 1).
data_socio.csv: data collected from socio-demographic questionnaire in the 1st session of the experiment, namely:
participant identification: participant's unique identifier (ID);
socio-demographic information: participant's age (AGE), sex (SEX, 1 through 4 for female, male, prefer not to disclosure, and other, respectively), and average academic grade (GRADE, from 0 to 20, NA denotes preference to not disclosure).
data_emo.csv: detailed data collected from the emotional questionnaire in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID) and the assigned treatment (column HINT, either N, L, E or D);
detailed emotional response data: the differential in the 5-point Likert scale for each of the 14 measured emotions in the 2 sessions, ranging from -5 to -1 if decreased, 0 if maintained, from 1 to 5 if increased, or NA denoting failure to submit the questionnaire. Half of the emotions are positive (Admiration1 and Admiration2, Desire1 and Desire2, Hope1 and Hope2, Fascination1 and Fascination2, Joy1 and Joy2, Satisfaction1 and Satisfaction2, and Pride1 and Pride2), and half are negative (Anger1 and Anger2, Boredom1 and Boredom2, Contempt1 and Contempt2, Disgust1 and Disgust2, Fear1 and Fear2, Sadness1 and Sadness2, and Shame1 and Shame2). This detailed data was used to compute the aggregate data in data_emo_aggregate.csv and in the detailed discussion in Section 6 of the paper.
data_umux.csv: data collected from the user experience questionnaires in the 2 sessions of the experiment, namely:
participant identification: participant's unique identifier (ID);
user experience data: summarised user experience data from the UMUX surveys (UMUX1 and UMUX2, as a usability metric ranging from 0 to 100).
participants.txt: the list of participant identifiers that have registered for the experiment.
Analysis scripts
The analysis scripts required to replicate the analysis of the results of the experiment as reported in the paper, namely in directory analysis:
analysis.r: An R script to analyse the data in the provided CSV files; each performed analysis is documented within the file itself.
requirements.r: An R script to install the required libraries for the analysis script.
normalize_task.r: A Python script to normalize the task JSON data from file data_sessions.json into the CSV format required by the analysis script.
normalize_emo.r: A Python script to compute the aggregate emotional response in the CSV format required by the analysis script from the detailed emotional response data in the CSV format of data_emo.csv.
Dockerfile: Docker script to automate the analysis script from the collected data.
Setup
To replicate the experiment and the analysis of the results, only Docker is required.
If you wish to manually replicate the experiment and collect your own data, you'll need to install:
A modified version of the Alloy4Fun platform, which is built in the Meteor web framework. This version of Alloy4Fun is publicly available in branch study of its repository at https://github.com/haslab/Alloy4Fun/tree/study.
If you wish to manually replicate the analysis of the data collected in our experiment, you'll need to install:
Python to manipulate the JSON data collected in the experiment. Python is freely available for download at https://www.python.org/downloads/, with distributions for most platforms.
R software for the analysis scripts. R is freely available for download at https://cran.r-project.org/mirrors.html, with binary distributions available for Windows, Linux and Mac.
Usage
Experiment replication
This section describes how to replicate our user study experiment, and collect data about how different hints impact the performance of participants.
To launch the Alloy4Fun platform populated with tasks for each session, just run the following commands from the root directory of the artifact. The Meteor server may take a few minutes to launch, wait for the "Started your app" message to show.
cd experimentdocker-compose up
This will launch Alloy4Fun at http://localhost:3000. The tasks are accessed through permalinks assigned to each participant. The experiment allows for up to 104 participants, and the list of available identifiers is given in file identifiers.txt. The group of each participant is determined by the last character of the identifier, either N, L, E or D. The task database can be consulted in directory data/experiment, in Alloy4Fun JSON files.
In the 1st session, each participant was given one permalink that gives access to 12 sequential tasks. The permalink is simply the participant's identifier, so participant 0CAN would just access http://localhost:3000/0CAN. The next task is available after a correct submission to the current task or when a time-out occurs (5mins). Each participant was assigned to a different treatment group, so depending on the permalink different kinds of hints are provided. Below are 4 permalinks, each for each hint group:
Group N (no hints): http://localhost:3000/0CAN
Group L (error locations): http://localhost:3000/CA0L
Group E (counter-example): http://localhost:3000/350E
Group D (error description): http://localhost:3000/27AD
In the 2nd session, likewise the 1st session, each permalink gave access to 12 sequential tasks, and the next task is available after a correct submission or a time-out (5mins). The permalink is constructed by prepending the participant's identifier with P-. So participant 0CAN would just access http://localhost:3000/P-0CAN. In the 2nd sessions all participants were expected to solve the tasks without any hints provided, so the permalinks from different groups are undifferentiated.
Before the 1st session the participants should answer the socio-demographic questionnaire, that should ask the following information: unique identifier, age, sex, familiarity with the Alloy language, and average academic grade.
Before and after both sessions the participants should answer the standard PrEmo 2 questionnaire. PrEmo 2 is published under an Attribution-NonCommercial-NoDerivatives 4.0 International Creative Commons licence (CC BY-NC-ND 4.0). This means that you are free to use the tool for non-commercial purposes as long as you give appropriate credit, provide a link to the license, and do not modify the original material. The original material, namely the depictions of the diferent emotions, can be downloaded from https://diopd.org/premo/. The questionnaire should ask for the unique user identifier, and for the attachment with each of the depicted 14 emotions, expressed in a 5-point Likert scale.
After both sessions the participants should also answer the standard UMUX questionnaire. This questionnaire can be used freely, and should ask for the user unique identifier and answers for the standard 4 questions in a 7-point Likert scale. For information about the questions, how to implement the questionnaire, and how to compute the usability metric ranging from 0 to 100 score from the answers, please see the original paper:
Kraig Finstad. 2010. The usability metric for user experience. Interacting with computers 22, 5 (2010), 323–327.
Analysis of other applications of the experiment
This section describes how to replicate the analysis of the data collected in an application of the experiment described in Experiment replication.
The analysis script expects data in 4 CSV files,
https://www.nconemap.gov/pages/termshttps://www.nconemap.gov/pages/terms
The imagery has a pixel resolution of 6 inches with a RMSE of 1.0 ft X and Y. Processing has been minimized to preserve the ability to use this data for raster analysis.The normalized difference vegetation index (NDVI) is useful for measuring the quality, density, and amount of vegetation in a particular area. It is a single band dataset that represents vegetation health, based on the difference between the red and near infrared bands. Red and orange pixels represent areas with no vegetation. Yellow pixels represent areas with low to moderate vegetation. Green pixels represent areas with high vegetation density and health. This data is provided as a web service only (no download).To view the latest imagery for any location in the state, customers should always use the "Orthoimagery_Latest" image service which can be found at https://nconemap.gov.To find specific dates the images were captured use the imagery dates app or download the data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.