Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The data are obtained from a hypothetical email campaign by serving 3 different Subject Lines across a three month period. There exist three tables for you to play around.
For this campaign, we consider as valid responses only the emails which were opened at the Sent Date (i.e. Sent_Date should be equal to Responded_Date.
Build a model which can predict the Open Rate based on the Customer’s attributes and the SubjetLine_ID received!
Facebook
TwitterThe Delay Discounting task is the most widely used paradigm to measure the capacity to wait for a hypothetical monetary reward in children between 8-18 years old. A child is given a series of option between a variable immediate monetary reward and 10 Euros after a certain delay. The delay of the 10 Euro reward varies between 2, 30, 180, or 365 days. Each trail starts with the question if the child would rather have a specific immediate reward now, or 10 Euros after a specific delay. Based on the choices of the child, the task determines an indifference point per delay. That is, when the immediate reward has the same subjective value as the 10 Euros at that delay. The different delays are presented in random order, as are the immediate rewards. Based on the decision of the child, the immediate reward is adapted on the next trial of that specific delay following a mathematical model until the indifference point is reached (Richards et al., 1999). The total number of trials depends on the behavior of the child. The task lasts 5 minutes on average.
Facebook
TwitterpolyOne Data Set
The data set contains 100 million hypothetical polymers each with 29 predicted properties using machine learning models. We use PSMILES strings to represent polymer structures, see here and here. The polymers are generated by decomposing previously synthesized polymers into unique chemical fragments. Random and enumerative compositions of these fragments yield 100 million hypothetical PSMILES strings. All PSMILES strings are chemically valid polymers but, mostly, have never been synthesized before. More information can be found in the paper. Please note the license agreement in the LICENSE file.
Full data set including the properties
The data files are in Apache Parquet format. The files start with polyOne_*.parquet.
I recommend using dask (pip install dask) to load and process the data set. Pandas also works but is slower.
Load sharded data set with dask
python
import dask.dataframe as dd
ddf = dd.read_parquet("*.parquet", engine="pyarrow")
For example, compute the description of data set ```python df_describe = ddf.describe().compute() df_describe
PSMILES strings only
generated_polymer_smiles_train.txt - 80 million PSMILES strings for training polyBERT. One string per line.
generated_polymer_smiles_dev.txt - 20 million PSMILES strings for testing polyBERT. One string per line.
Facebook
TwitterA MODFLOW-NWT (version 1.0.9) model of a hypothetical stream-aquifer system is presented for the evaluation and characterization of capture map bias. The hypothetical model is a single-layer model with 30 rows and 100 columns. The hypothetical model is used to develop methods to create capture difference maps and to calculate and characterize capture map bias with sensitivity analyses. The hypothetical stream-aquifer system generally represents the arid western U.S.; however, the methods developed with this model can be applied to any model used to generate capture maps, regardless of location. The model simulates a hypothetical 200-year transient stress period after a steady-state stress period. This time period loosely represents 1960 – 2160; however, actual time is not represented. The MODFLOW model includes a stream (represented with the MODFLOW SFR or CHD Packages), mountain block recharge (represented with the MODFLOW RCH Package), and evapotranspiration (represented with the MODFLOW EVT Package). This USGS data release contains all of the input and output files for the simulations described in the associated journal article (https://doi.org/10.1111/gwat.12597).
Facebook
TwitterThe use of real decision-making incentives remains under debate after decades of economic experiments. In time preferences experiments involving future payments, real incentives are particularly problematic due to between-options differences in transaction costs, among other issues. What if hypothetical payments provide accurate data which, moreover, avoid transaction cost problems? In this paper, we test whether the use of hypothetical or one-out-of-ten-participants probabilistic—versus real—payments affects the elicitation of short-term and long-term discounting in a standard multiple price list task. We analyze data from a lab experiment in Spain and well-powered field and online experiments in Nigeria and the UK, respectively (N = 2,043). Our results indicate that the preferences elicited using the three payment methods are mostly the same: we can reject that either hypothetical or one-out-of-ten payments change any of the four preference measures considered by more than 0.18 SD wit...
Facebook
TwitterThis data release contains extent shapefiles for 16 hypothetical slope failure scenarios for a landslide complex at Barry Arm, western Prince William Sound, Alaska. The landslide is likely active due to debuttressing from the retreat of Barry Glacier (Dai and others, 2020) and sits above Barry Arm, posing a tsunami risk in the event of slope failure (Barnhart and others, 2021). Since discovery of the landslide by a citizen scientist in 2020, kinematic structural elements have been mapped (Coe and others, 2020) and ground-based and satellite synthetic aperture radar (SAR) have been used to track ongoing movement at a high spatial resolution (Schaefer and others, 2020; Schaefer and others, 2022). These efforts have revealed complex, zonal movement; the mechanisms of which remain unknown. To support hazard assessment, we constructed 16 different failure scenarios. The scenarios are all based on structural elements and/or remotely sensed evidence of motion but are also intended to cover a range of shapes and volumes of material such that different modes of failure and subsequent tsunami wave behavior can be modeled. Extents are presented in ESRI shapefile (.shp) format. Each shapefile has a Slip Angle field and a Sequence field. The Slip Angle field records the horizontal direction of failure (0 degrees = north). In some cases, a multi-phase failure is delineated, e.g., where the failure of part of the landslide might destabilize an additional upslope component. In these instances, an ordinal sequence of failure is specified in the Sequence field. These extents were manually digitized on lidar (1-meter horizontal resolution; Daanen and others, 2021) and SAR imagery (2–3-meter horizontal resolution; Schaefer and others, 2020; Schaefer and others, 2022) to align with either mapped kinematic components of the landslide or clear edges of motion identified by coherent synthetic aperture radar signals. As such they are subjective, based on expert opinion and current best available data. References Cited Barnhart, K.R., Jones, R.P., George, D.L., Coe, J.A., and Staley, D.M., 2021, Preliminary assessment of the wave generating potential from landslides at Barry Arm, Prince William Sound, Alaska: U.S. Geological Survey Open-File Report 2021–1071, 28 p., https://doi.org/10.3133/ ofr20211071. Coe, J.A., Wolken, G.J., Daanen, R.P., and Schmitt, R.G., 2021, Map of landslide structures and kinematic elements at Barry Arm, Alaska in the summer of 2020: U.S. Geological Survey data release, https://doi.org/10.5066/P9EUCGJQ. Daanen, R.P., Wolken, G.J., Wikstrom Jones, Katreen, and Herbst, A.M., 2021, Lidar-derived elevation data for upper Barry Arm, Southcentral Alaska, June 26, 2020: Alaska Division of Geological & Geophysical Surveys Raw Data File 2021-1, 9 p. https://doi.org/10.14509/30589. Dai, C., Higman, B., Lynett, P.J., Jacquemart, M., Howat, I.M., Liljedahl, A.K., Dufresne, A., Freymueller, J.T., Geertsema, M., Ward Jones, M. and Haeussler, P.J., 2020, Detection and assessment of a large and potentially‐tsunamigenic periglacial landslide in Barry Arm, Alaska: Geophysical Research Letters, 47(22), e2020GL089800. https://doi.org/10.1029/2020GL089800. Schaefer, L.N., Coe, J.A., Godt, J.W., and Wolken, G.J., 2020, Interferometric synthetic aperture radar data from 2020 for landslides at Barry Arm Fjord, Alaska: U.S. Geological Survey data release, https://doi.org/10.5066/P9Z04LNK. Schaefer, L.N., Coe, J.A., and Wolken, G.J., 2022, Interferometric synthetic aperture radar data from 2021 for landslides at Barry Arm Fjord, Alaska: U.S. Geological Survey data release, https://doi.org/10.5066/P9QJ8IO4.
Facebook
TwitterSix hypothetical 1-dimensional models are used to verify and demonstrate new unsaturated-zone heat transport functionality added to MT3D-USGS (version 1.1.0). Because the governing equations describing groundwater solute transport and heat transport have a similar form, MT3D-USGS may be applied to heat transport problems. Published examples of MT3DMS, from which MT3D-USGS is derived, as a heat transport modeling tool have previously been limited to the saturated zone. However, with the publication of MT3D-USGS which added unsaturated zone solute transport capabilities, some additional support (i.e., new source code) is necessary to enable its use as a heat transport simulator where the unsaturated zone also is going to be modeled. The first five scenarios represent a 30 meter thick unsaturted zone. Of these, the first scenario is referred to a "quasi-steady-state" model because it divides a 16-year simulation period into 4 steady flow periods. In other words, this proof-of-concept model is setup using a free drainage lower boundary and transient stress periods that repeat the same boundary conditions over a four-year period before changing to a new set of steady flow and transport conditions that are subsequently repeated for another four-year period. This approach was adopted because the period of transition from one steady flow period to the next is of most interest in this scenario. Scenarios 2 through 5 simulate a 100-year transient period, wherein both the infiltration and temperature assigned to the infiltration vary daily. Scenario 2 serves as the baseline to which scenarios 3, 4, and 5 are compared. Briefly, scenarios 3, 4, and 5 investigate how warm-up at the model surface, intended to represent warmer atmospheric temperatures, manifest at different depths within the unsaturated zone. Scenarios 3, 4, and 5 vary in how they represent warm-up at land surface. Scenario 3 applies a gradual 3 degree Celcius (C) warm-up between the twentieth and fiftieth years. Scenarios 4 and 5 apply 'shock' (instantaneous) warm-ups of 1.5 and 3.0 degrees C, respectively, at the twentieth year. The sixth scenario returns to using the scenario 1 setup, but instead of applying a free drainage condition at the bottom of the active model domain, it uses a specified-head boundary condition that fixes a water table at 12 m deep. The purpose of the water table is to investigate the inter-play between convection, conduction, and dispersion in the presence of a water table. All of the 1D models use grid cells that are 1 meter on a side and 15 cm thick. Rainfall and the temperature assigned to the rainfall are simulated with the unsaturated-zone flow (UZF1) and unsaturated-zone transport (UZT) packages, respectively. All six scenarios are simulated first with MODFLOW-NWT and then with MT3D-USGS, but the latter model does not simulate groundwater flow nor variably-saturated flow. Thus, MODFLOW-NWT must be run prior to running MT3D-USGS to generate all the cell-by-cell flows required by MT3D-USGS. In addition, an identical model setup was created for VS2DH for each scenario to verify the new variably-saturated heat transport functionality within MT3D-USGS. This USGS data release contains the input and output data files for the six hypothetical 1-dimensional models used to demonstrate new functionality of MT3D-USGS. Model input files were developed from published information; no new datasets were collected as part of the modeling study associated with this data release. Details on data sources and processing for developing model input and output files are documented in the associated journal article (https://doi.org/10.1111/gwat.13256)
Facebook
Twitter{"Because the sustainable development of a society is strongly related to the sustainable development of its manufacturing companies, more and more of the companies decide to install and operate on-site energy conversion (utility) systems (ECS). In consequence, many approaches for the design and operation of ECSs are developed. The provided data set contains the applied energy sources demand of different hypothetical manufacturing companies to make different approaches for the design and operation comparable. Altogether 32 company types (all having a production system with parallel machines) are considered and distinguished according to the following production-related parameters: production system size (i.e., number of machines), job size (i.e., mean processing times) and variability (i.e., processing time distribution), energy demand type (i.e., energy demand course), and energy demand variability (i.e., energy demand distribution). For each company type, we provide the energy demands of 240 production days. In addition, the energy demand data is generated with respect to two scheduling objectives: makespan and total flow time. Thus, a total of 15,360 energy demand time series (32 company types, 2 scheduling objectives, and 240 production days) is available. The data consist of three parts: First part lists the planning horizons (lengths) of the time series for each production day. The second part contains the raw data of the energy demands (one time series per company type and scheduling objective). The third part contains the aggregated values (10 minutes aggregated to one period). Note that the combination of part one and two enables the separation of the data of an individual production day. Further detail about the data can be found in “A flexible approach for the dimensioning of on-site energy conversion systems for manufacturing companies”."}
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Explanatory memo (word), Stata code (do), and Stata dataset (dta)
Facebook
TwitterData collection, along with hydraulic and fluvial egg transport modeling, were completed along a 70.9-mile reach of the Ohio River between Markland Locks and Dam and McAlpine Locks and Dam. Data were collected during two surveys: October 27–November 4, 2016, and June 26–29, 2017. Water-quality data collected in this reach included surface measurements and vertical profiles of water temperature, specific conductance, pH, dissolved oxygen, turbidity, relative chlorophyll, and relative phycocyanin. Streamflow and velocity data were collected simultaneously with the water-quality data at cross sections and along longitudinal lines (corresponding to the water-quality surface measurements) and at selected stationary locations (corresponding to the water-quality vertical profiles). The data were collected to understand variability of flow and water-quality conditions relative to simulated reaches of the Ohio River and to aid in identifying parts of the reach that may provide conditions favorable to spawning and recruitment habitat for bighead carp (Hypophthalmichthys nobilis). A copy of an existing hydraulic model of the Ohio River was obtained from the National Weather Service and used to simulate hydraulic conditions for four different streamflows. Streamflows used for the simulations were selected to represent a range of conditions from a high-streamflow event to a seasonal dry-weather event. Outputs from the hydraulic model were used as input to the Fluvial Egg Drift Simulator (FluEgg) along with a range of five water temperatures observed in water-quality data and four potential spawning locations to simulate the extents and quantile positions of developing bighead carp, from egg hatching to the gas bladder inflation stage, under each scenario. A total of 80 simulations were run. Results from the FluEgg scenarios (which include only the hydraulic influences on survival that result from settling, irrespective of mortality from other physical factors such as excess turbulence, or biological factors such as fertilization failure, predation or starvation) indicate that the majority of the eggs will hatch, about half will die, and a quarter of the surviving larvae will reach the gas bladder inflation stage within the modeled reach. The overall average percentage of embryos surviving to the gas bladder inflation stage was 13.1 percent. Individual simulations have embryo survival percentages as high as 49.1 percent. The highest embryo survival percentages occurred for eggs spawned at a streamflow of 38,100 cubic feet per second and water temperatures of 24°C to 30°C. Conversely, embryo survival percentages were lowest for the lowest and highest streamflows regardless of water temperature or spawn location. Under low water temperature, high-streamflow conditions, some of the eggs did not hatch nor did the larvae reach the gas bladder inflation stage until passing beyond the downstream model domain. While the final quantile positions of the eggs and larvae beyond the downstream model domain are unknown, the outcomes still provide useful information.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
We have used first-principles density functional theory to relax the experimentally reported crystal structures for the low- and high-temperature phases of Mg(BH4)2, which contain 330 and 704 atoms per unit cell, respectively. The relaxed low-temperature structure was found to belong to the P6122 space group, whereas the original experimental structure has P61 symmetry. The higher symmetry identified in our calculations may be the T = 0 ground-state structure or may be the actual room-temperature structure because it is difficult to distinguish between P61 and P6122 with the available powder diffraction data. We have identified several hypothetical structures for Mg(BH4)2 that have calculated total energies that are close to the low-temperature ground-state structure, including two structures that lie within 0.2 eV per formula unit of the ground-state structure. These alternate structures are all much simpler than the experimentally observed structure. We have used Bader charge analysis to compute the charge distribution in the P6122 Mg(BH4)2 structure and have compared this with charges in the much simpler Mg(AlH4)2 structure. We find that the B−H bonds are significantly more covalent than the Al−H bonds; this difference in bond character may contribute to the very different crystal structures for these two materials. Our calculated vibrational frequencies for the P6122 structure are in good agreement with experimental Raman spectra for the low-temperature Mg(BH4)2 structure. The calculated total energy of the high-temperature structure is only about 0.1 eV per formula unit higher in energy than the low-temperature structure.
Facebook
TwitterMODFLOW-NWT groundwater flow models and MODPATH6 particle tracking simulations were developed to determine contributing areas (CAs) for and advective travel times to domestic wells under extreme recharge events in a small hypothetical watershed underlain by dipping sedimentary rocks. The hypothetical models are based on hydrogeologic conditions in the Newark Basin, located in New Jersey, New York, and Pennsylvania, USA. During extreme recharge events, groundwater supply wells have increased vulnerability to contaminants or pathogens originating at land surface that are flushed into the subsurface. Fractured-rock aquifers are particularly vulnerable because transport to wells can be very fast owing to preferential flow paths through high-permeability fractures with small effective porosities. A base case (BC) scenario was developed in which the flow models simulate transient extreme recharge events and twice daily pumping, and the particle tracking uses a porosity of 0.0001. Alternate transient scenarios were developed in which the models have fewer vertical fractures (FewVF), increased recharge (Rch6.4), or larger effective porosity (Por.001), and an alternate steady-state scenario (StSt) was developed that uses long-term average recharge and pumping rates. For the BC and StSt scenarios, MODFLOW-NWT simulations were run for 48 different pumping well locations (24 shallow well locations and 24 mid-depth well locations). For the FewVF and Rch6.4 scenarios, MODFLOW-NWT simulations were run for the 24 mid-depth well locations. For all scenarios, MODPATH simulations were conducted to define the CAs, the travel times from the CAs to the well, and the arrival times at the well. Transient simulations used hourly releases of particles at the water table throughout the extreme recharge event. The StSt scenario had a single release at the beginning of the simulations. Software tools are provided in this data release to post-process the MODPATH results and produce figures similar to those in the companion journal article (https://doi.org/10.1111/gwat.13169). This USGS data release contains all the input and selected output files for the simulations described in the companion journal article (https://doi.org/10.1111/gwat.13169).
Facebook
TwitterBackground: Participants in clinical trials frequently fail to appreciate key differences between research and clinical care. This phenomenon, known as therapeutic misconception, undermines informed consent to clinical research, but to date there have been no effective interventions to reduce it and concerns have been expressed that to do so might impede recruitment. We determined whether a scientific reframing intervention reduces therapeutic misconception without significantly reducing willingness to participate in hypothetical clinical trials. Methods: This prospective randomized trial was conducted from 2015 to 2016 to test the efficacy of an informed consent intervention based on scientific reframing compared to a traditional informed consent procedure (control) in reducing therapeutic misconception among patients considering enrollment in hypothetical clinical trials modeled on real-world studies for one of five disease categories. Patients with diabetes mellitus, hypertension, coronary artery disease, head/neck cancer, breast cancer, and major depression were recruited from medical clinics and a clinical research volunteer database. The primary outcomes were therapeutic misconception, as measured by a validated, ten-item Therapeutic Misconception Scale (range=10-50), and willingness to participate in the clinical trial. Results: 154 participants completed the study (age range, 23-87 years; 92.3% white, 56.5% female); 74 (48.1%) had been randomized to receive the experimental intervention. Therapeutic misconception was significantly lower (p=0.004) in the scientific reframing group (26.4, 95% CI [23.7 to 29.1] compared to the control group (30.9, 95% CI [28.4 to 33.5], and remained so after controlling for education (p=0.017). Willingness to participate in the hypothetical trial was not significantly different (p=0.603) between intervention (52.1%, 95% CI [40.2 to 62.4]) and control (56.3%, 95% CI [45.3 to 66.6] groups. Conclusions: An enhanced educational intervention augmenting traditional informed consent led to a meaningful reduction in therapeutic misconception without a statistically significant change in willingness to enroll in hypothetical clinical trials. Additional study of this intervention is required in real-world clinical trials.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Whether evolutionary history is mostly contingent or deterministic has been given much focus in the field of evolutionary biology. Studies addressing this issue have been conducted theoretically, based on models, and experimentally, based on microcosms. It has been argued that the shape of the adaptive landscape and mutation rate are major determinants of replicated phenotypic evolution. In the present study, to incorporate the effects of phenotypic plasticity, we constructed a model using tree-like organisms. In this model, the basic rules used to develop trees are genetically determined, but tree shape (described by the number and aspect ratio of the branches) is determined by both genetic components and plasticity. The results of the simulation show that the tree shapes become more deterministic under higher mutation rates. However, the tree shape became most contingent and diverse at the lower mutation rate. In this situation, the variances of the genetically determinant characters were low, but the variance of the tree shape is rather high, suggesting that phenotypic plasticity results in this contingency and diversity of tree shape. The present findings suggest that plasticity cannot be ignored as a factor that increases contingency and diversity of evolutionary outcomes.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/3R5TS3https://dataverse.harvard.edu/api/datasets/:persistentId/versions/3.0/customlicense?persistentId=doi:10.7910/DVN/3R5TS3
This dataset is comprised of fake data that has been created to illustrate the potential transfer of a study algorithm for creating CHU-9D utility mapping models to new data. Outputs in this dataset are for instructional purposes only and should not be used to inform decision making
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/GEJGV8https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/GEJGV8
Discussions around declining trust in the U.S. media can be vague about its effects. One classic answer comes from the persuasion literature, in which source credibility plays a key role. However, existing research almost universally takes credibility as a given. To overcome the potentially severe confounding that can result from this, we create a hypothetical news outlet and manipulate to what extent it is portrayed as credible. We then randomly assign subjects to read op-eds attributed to the source. Our credibility treatments are strong, increasing trust in our mock source until up to ten days later. We find some evidence that the resulting higher perceived credibility boosts the persuasiveness of arguments about more partisan topics (but not for a less politicized issue). Though our findings are mixed, we argue that this experimental approach can fruitfully enhance our understanding of the interplay between source trust and opinion change over sustained periods. \par \vspace{.3cm} \textbf{Keywords:} source credibility, persuasion, media trust, attitude change
Facebook
TwitterThis data release complements Murray et al. (2023) which presents a framework for incorporating earthquake magnitude estimates based on real-time Global Navigation Satellite System (GNSS) data into the ShakeAlert® earthquake early warning system for the west coast of the United States. Murray et al. (2023) assess the impact of time-dependent noise in GNSS real-time position estimates on the reliability of earthquake magnitudes estimated using such data. To do so they derived peak ground displacement (PGD) estimates from time series of background noise in GNSS real-time positions. These noise-only PGD measurements were used as input to a published empirical relationship to compute magnitude for hypothetical earthquakes that are each defined by an epicentral location and origin time. The data files provided here give the locations of GNSS stations used in the study, the hypothetical epicenters and origin times, and the PGD for each GNSS station for four time windows following each hypothetical origin time. We also provide the epicenters and origin times used to simulate the impact of noisy PGD data in terms of the annual number of spuriously large magnitude estimates that would be generated in the geographic region spanned by California, Oregon, and Washington, United States, due to noise alone. Finally, we include the estimated magnitudes for the annual simulations along with the number of GNSS stations for which the measured PGD exceeding a threshold value that was defined empirically to eliminate unreliable magnitude estimates.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This folder contains the data behind the story The LeBron James Decision-Making Machine.
lebron.xslx contains the data used to create the hypothetical depth charts for what every NBA team might look like with LeBron James. Each player carries a rating on offense and on defense, based on our CARMELO projection system. For team projections, a player is also allocated a certain number of minutes per game at each position, guided by CARMELO's playing-time projections. Players with 0 allocated minutes are no longer on the team in question, and are indicated with orange cells. The spreadsheet indicates a team's projected payroll after adding LeBron, compared with the NBA salary cap and luxury-tax thresholds. Players with salaries listed under "$$$ Shed" have left the team, for a reason indicated to the right. (Players departing for special reasons, such as the stretch provision, are color coded when applicable.) A team's projected W-L record is generated from its players' ratings and playing-time projections, and that is used (along with the team's average age and the wins added by its best player) to calculate the team's odds of winning a championship over the next four seasons."
This is a dataset from FiveThirtyEight hosted on their GitHub. Explore FiveThirtyEight data using Kaggle and all of the data sources available through the FiveThirtyEight organization page!
This dataset is maintained using GitHub's API and Kaggle's API.
This dataset is distributed under the Attribution 4.0 International (CC BY 4.0) license.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview:
This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:
Hastings, J.D., Weitl-Harms, S., Doty, J., Myers, Z. L., and Thompson, W., “Utilizing Large Language Models to Synthesize Product Desirability Datasets,” in Proceedings of the 2024 IEEE International Conference
on Big Data (BigData-24), Workshop on Large Language and Foundation Models (WLLFM-24), Dec. 2024.
https://arxiv.org/abs/2411.13485.
Briefly, each row in the datasets was produced as follows:
1) Word+Review: The LLM selected a word and synthesized a review that would align with a random target sentiment.
2) Review+Word: The LLM produced a review to align with the target sentiment score, and then selected a word appropriate for the review.
3) Supply-Word: A word was supplied to the LLM which was then scored, and a review was produced to align with that score.
For sentiment analysis and PDT testing, the two columns of main interest across the datasets are likely 'Selected Word' and 'Hypothetical Review'.
License:
This data is licensed under the CC Attribution 4.0 international license, and may be taken and used freely with credit given. Cite as:
Hastings, J., Weitl-Harms, S., Doty, J., Myers, Z., & Thompson, W. (2024). Synthetic Product Desirability Datasets for Sentiment Analysis Testing (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.14188456
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Assets:data tables each for city of Indianapolis (IN) and Baltimore (MD) in MS Excel and MS Word, ditto shapefiles in ESRI format; all files ZIPPEDContext – abstract of report published in a blog at the serial Geospatial World:This case study called into question police conduct and policy injustice discovered in two American big cities. My path to learn about Indianapolis (IN) and Baltimore (MD) policing patterns and crime events was due to availability of open data focused on use-of-force (UOF). My specific goal was to conduct geospatial data analytics aimed at these two cities using location and other key variables. Two spreadsheets captured small data that laid acceptable statistical groundwork for iterations of exploratory spatial data analysis (ESDA). Bivariate scatterplots revealed possible police misconduct. Parallel coordinate plotting – an innovative multivariate tool – was then used to display co-occurrences of plotted UOF and racial variables associated with key police districts in Indianapolis and Baltimore. A final summary visualization sought to cartographically and dramatically compare force and race variables by way of comparative plots, graphs, and maps. I closed with three action items pertaining to a “social-justice” framework for future data-visualization, to the heightening of standards for law enforcement reform, and for a need to make a hypothetical “citizen’s arrest” of police misconduct.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The data are obtained from a hypothetical email campaign by serving 3 different Subject Lines across a three month period. There exist three tables for you to play around.
For this campaign, we consider as valid responses only the emails which were opened at the Sent Date (i.e. Sent_Date should be equal to Responded_Date.
Build a model which can predict the Open Rate based on the Customer’s attributes and the SubjetLine_ID received!