23 datasets found
  1. r

    Addressing sample selection bias for machine learning methods (replication...

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Brewer; Alyssa Carlson (2025). Addressing sample selection bias for machine learning methods (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hZGRyZXNzaW5nLXNhbXBsZS1zZWxlY3Rpb24tYmlhcy1mb3ItbWFjaGluZS1sZWFybmluZy1tZXRob2RzLXJlcGxpY2F0aW9uLWRhdGE=
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW
    ZBW Journal Data Archive
    Authors
    Dylan Brewer; Alyssa Carlson
    Description

    Addressing sample selection bias for machine learning methods (replication data)

    Dylan Brewer and Alyssa Carlson

    Accepted at Journal of Applied Econometrics, 2023

    Overview

    This replication package contains files required to reproduce results, tables, and figures using Matlab and Stata. We divide the project into instructions to replicate the simulation, the result from Huang et al (2006), and the application.

    Simulation

    For reproducing the simulation results

    Included files in *\Simulation with short descriptions:

    • SSML_simfunc: function that produces individual simulations runs
    • SSML_simulation: script that loops over the SSML_simfunc for different DGP and multiple simulation runs
    • SSML_figures: script that generates all figures for the paper
    • SSML_compilefunc: function that compiles the results from SSML_simulation for the SSML_figures script

    Steps for replicating simulation:

    1. Save SSML_simfunc, SSML_simulation, SSML_figures, SSML_compilefunc to the same folder. This location will be referred to as the FILEPATH.
    2. Create OUTPUT folder inside the FILEPATH location.
    3. Change the FILEPATH location inside SSML_simulation and SSML_figures.
    4. Run SSML_simulation to produce simulation data and results.
    5. Run SSML_figures to produce figures.

    Huang et al replication

    For reproducing the Huang et. al. (2006) replication results.

    Included files in *\HuangetalReplication with short descriptions:

    • SSML_huangrep: script that replicates the results from Huang et. al. (2006)

    Obtaining the dataset:

    Go to https://archive.ics.uci.edu/dataset/14/breast+cancer and save file as "breast-cancer-wisconsin.data"

    Steps for replicating results:

    1. Save SSML_huangrep and the breast cancer data to the same folder. This location will be referred to as the FILEPATH.
    2. Change the FILEPATH location inside SSML_huangrep
    3. Run SSML_huangrep to produce results and figures.

    Application

    For reproducing the application section results.

    Included program files in *\Application with short descriptions:

    • G0_main_202308.do: Stata wrapper code that will run all application replication files
    • G1_cqclean_202308.do: Cleans election outcomes data
    • G2_cqopen_202308.do: Cleans open elections data
    • G3_demographics_cainc30_202308.do: Cleans demographics data
    • G4_fips_202308.do: Cleans FIPS code data
    • G5_klarnerclean_202308.do: Cleans Klarner gubernatorial data
    • G6_merge_202308.do: Merges cleaned datasets together
    • G7_summary_202308.do: Generates summary statistics tables and figures
    • G8_firststage_202308.do: Runs L1 penalized probit for the first stage
    • G9_prediction_202308.m: Trains learners and makes predictions
    • G10_figures_202308.m: Generates figures of prediction patterns
    • G11_final_202308.do: Generates final figures and tables of results
    • r1_lasso_alwayskeepCF_202308.do: Examines the effect of requiring the control function is not dropped from LASSO
    • latexTable.m: Code by Eli Duenisch to write LaTeX tables from Matlab (https://www.mathworks.com/matlabcentral/fileexchange/44274-latextable)

    Included non-confidential data in subdirectory `*\Application\Data`:

    Confidential data suppressed in subdirectory `*\Application\CD`:

    These data cannot be transferred as part of the data use agreement with the CQ Press. Thus, the files are not included.

    There is no batch download--downloads for each year must be done by hand. For each year, download as many state outcomes as possible and name the files YYYYa.csv, YYYYb.csv, etc. (Example: 1970a.csv, 1970b.csv, 1970c.csv, 1970d.csv). See line 18 of G1_cqclean_202308.do for file structure information.

    Steps for replicating application:

    1. Download confidential data from the CQ Press.
    2. Change the working directory in G0_main_202308.do on line 18 to the application folder.
    3. Change local matlabpath in G0_main_202308.do on line 18 to the appropriate location.
    4. Set directory and file path in G9_prediction_202308.m and G10_figures_202308.m as necessary.
    5. Run G0_main_202308.do in Stata to run all programs.
    6. All output (figures and tables) will be saved to subdirectory *\Application\Output.

    Contact

    Contact Dylan Brewer (brewer@gatech.edu) or Alyssa Carlson (carlsonah@missouri.edu) for help with replication.

  2. H

    Area Resource File (ARF)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D

  3. f

    FGDs patients’ characteristics Stata format dataset and its do file.

    • datasetcatalog.nlm.nih.gov
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ottaru, Theresia A.; Kivuyo, Sokoine L.; Wood, Christine V.; Shayo, Elizabeth H.; Mbugi, Erasto V.; Hirschhorn, Lisa R.; Karoli, Peter M.; Kaaya, Sylvia F.; Shayo, Grace A.; Mgina, Eric J.; Hawkins, Claudia A.; Mfinanga, Sayoki G. (2023). FGDs patients’ characteristics Stata format dataset and its do file. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001058989
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Ottaru, Theresia A.; Kivuyo, Sokoine L.; Wood, Christine V.; Shayo, Elizabeth H.; Mbugi, Erasto V.; Hirschhorn, Lisa R.; Karoli, Peter M.; Kaaya, Sylvia F.; Shayo, Grace A.; Mgina, Eric J.; Hawkins, Claudia A.; Mfinanga, Sayoki G.
    Description

    We imported the excel sheet FGD patients’ characteristics into the Stata software for conducting simple descriptive analysis. Therefore, a saved dataset and its do file has been shared with editors and reviewers for their reference. (ZIP)

  4. f

    Data.Evaluation Report 9-months pilot Open Science Support Desk

    • figshare.com
    • uvaauas.figshare.com
    Updated Jan 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    G. ter Riet; N.R. van Ulzen; F.A. van Nes (2021). Data.Evaluation Report 9-months pilot Open Science Support Desk [Dataset]. http://doi.org/10.21943/auas.13614689.v1
    Explore at:
    Dataset updated
    Jan 22, 2021
    Dataset provided by
    University of Amsterdam / Amsterdam University of Applied Sciences
    Authors
    G. ter Riet; N.R. van Ulzen; F.A. van Nes
    License

    http://rdm.uva.nl/en/support/confidential-data.htmlhttp://rdm.uva.nl/en/support/confidential-data.html

    Description

    Datasets related to the evaluation and report of the Urban Vitality (UV) Open science support desk. Data were exported from Qualtrics and saved as STATA (.dta) files and analyzed using STATA version 13.1. This item contains:1. Qualtrics-exports: two tab-separated value (.tsv) files2. STATA: two STATA data (.dta) files3. STATA: three STATA log (.txt) filesThe STATA analysis files are deposited in UvA/HvA figshare separately and are publcily available. More information is available in the report.

  5. Effects of community management on user activity in online communities

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alberto Cottica; Alberto Cottica (2025). Effects of community management on user activity in online communities [Dataset]. http://doi.org/10.5281/zenodo.1320261
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alberto Cottica; Alberto Cottica
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and code needed to reproduce the results of the paper "Effects of community management on user activity in online communities", available in draft here.

    Instructions:

    1. Unzip the files.
    2. Start with JSON files obtained from calling platform APIs: each dataset consists of one file for posts, one for comments, one for users. In the paper we use two datasets, one referring Edgeryders, the other to Matera 2019.
    3. Run them through edgesense (https://github.com/edgeryders/edgesense). Edgesense allows to set the length of the observation period. We set it to 1 week and 1 day for Edgeryders data, and to 1 day for Matera 2019 data. Edgesense stores its results in a file called JSON network.min.json, which we then rename to keep track of the data source and observation length.
    4. Launch Jupyter Notebook and run the notebook provided to convert the network.min.json files into CSV flat files, one for each netwrk file
    5. Launch Stata and open each flat csv files with it, then save it in Stata format.
    6. Use the provided Stata .do scripts to replicate results.

    Please note: I use both Stata and Jupyter Notebook interactively, running a block with a few lines of code at a time. Expect to have to change directories, file names etc.

  6. Repeated information of benefits reduce COVID-19 vaccination hesitancy:...

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    zip
    Updated Jun 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Max Burger; Max Burger; Matthias Mayer; Matthias Mayer; Ivo Steimanis; Ivo Steimanis (2022). Repeated information of benefits reduce COVID-19 vaccination hesitancy: Experimental evidence from Germany [Dataset]. http://doi.org/10.5281/zenodo.6242620
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 17, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Max Burger; Max Burger; Matthias Mayer; Matthias Mayer; Ivo Steimanis; Ivo Steimanis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Germany
    Description

    This replication package contains the raw data and code to replicate the findings reported in the paper. The data are licensed under a Creative Commons Attribution 4.0 International Public License. The code is licensed under a Modified BSD License. See LICENSE.txt for details.

    Software requirements

    All analysis were done in Stata version 16:

    • Add-on packages are included in scripts/libraries/stata and do not need to be installed by user. The names, installation sources, and installation dates of these packages are available in scripts/libraries/stata/stata.trk.

    Instructions

    1. Save the folder ‘replication_PLOS’ to your local drive.
    2. Open the master script ‘run.do’ and change the global pointing to the working direction (line 20) to the location where you save the folder on your local drive
    3. Run the master script ‘run.do’ to replicate the analysis and generate all tables and figures reported in the paper and supplementary online materials

    Datasets

    • Wave 1 – Survey experiment: ‘wave1_survey_experiment_raw.dta’
    • Wave 2 – Follow-up Survey: ‘wave2_follow_up_raw.dta'
    • Map: shape-files ‘plz2stellig.shp’ ‘OSM_PLZ.shp’, area codes ‘Postleitzahlengebiete-_OSM.csv’_, (all links to the sources can be found in the script ‘04_figure2_germany_map.do’)
    • Pretest: ‘pre-test_corona_raw.dta’
    • For Appendix S7: ‘alter_geschlecht_zensus_det.xlsx’, ‘vaccination_landkreis_raw.dta’, ‘census2020_age_gender.csv’ (all links to the sources can be found in the script ‘06_AppendixS7.do’)
    • For Appendix S10: ‘vaccination_landkreis_raw.dta’ (all links to the sources can be found in the script ‘07_AppendixS10.do’)

    Descriptions of scripts

    1_1_clean_wave1.do
    This script processes the raw data from wave 1, the survey experiment
    1_2_clean_wave2.do
    This script processes the raw data from wave 2, the follow-up survey
    1_3_merge_generate.do
    This script creates the datasets used in the main analysis and for robustness checks by merging the cleaned data from wave 1 and 2, tests the exclusion criteria and creates additional variables
    02_analysis.do
    This script estimates regression models in Stata, creates figures and tables, saving them to results/figures and results/tables
    03_robustness_checks_no_exclusion.do
    This script runs the main analysis using the dataset without applying the exclusion criteria. Results are saved in results/tables
    04_figure2_germany_map.do
    This script creates Figure 2 in the main manuscript using publicly available data on vaccination numbers in Germany.
    05_figureS1_dogmatism_scale.do
    This script creates Figure S1 using data from a pretest to adjust the dogmatism scale.
    06_AppendixS7.do
    This script creates the figures and tables provided in Appendix S7 on the representativity of our sample compared to the German average using publicly available data about the age distribution in Germany.
    07_AppendixS10.do
    This script creates the figures and tables provided in Appendix S10 on the external validity of vaccination rates in our sample using publicly available data on vaccination numbers in Germany.

  7. H

    Survey of Consumer Finances (SCF)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Survey of Consumer Finances (SCF) [Dataset]. http://doi.org/10.7910/DVN/FRMKMF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the survey of consumer finances (scf) with r the survey of consumer finances (scf) tracks the wealth of american families. every three years, more than five thousand households answer a battery of questions about income, net worth, credit card debt, pensions, mortgages, even the lease on their cars. plenty of surveys collect annual income, only the survey of consumer finances captures such detailed asset data. responses are at the primary economic unit-level (peu) - the economically dominant, financially interdependent family members within a sampled household. norc at the university of chicago administers the data collection, but the board of governors of the federal reserve pay the bills and therefore call the shots. if you were so brazen as to open up the microdata and run a simple weighted median, you'd get the wrong answer. the five to six thousand respondents actually gobble up twenty-five to thirty thousand records in the final pub lic use files. why oh why? well, those tables contain not one, not two, but five records for each peu. wherever missing, these data are multiply-imputed, meaning answers to the same question for the same household might vary across implicates. each analysis must account for all that, lest your confidence intervals be too tight. to calculate the correct statistics, you'll need to break the single file into five, necessarily complicating your life. this can be accomplished with the meanit sas macro buried in the 2004 scf codebook (search for meanit - you'll need the sas iml add-on). or you might blow the dust off this website referred to in the 2010 codebook as the home of an alternative multiple imputation technique, but all i found were broken links. perhaps it's time for plan c, and by c, i mean free. read the imputation section of the latest codebook (search for imputation), then give these scripts a whirl. they've got that new r smell. the lion's share of the respondents in the survey of consumer finances get drawn from a pretty standard sample of american dwellings - no nursing homes, no active-duty military. then there's this secondary sample of richer households to even out the statistical noise at the higher end of the i ncome and assets spectrum. you can read more if you like, but at the end of the day the weights just generalize to civilian, non-institutional american households. one last thing before you start your engine: read everything you always wanted to know about the scf. my favorite part of that title is the word always. this new github repository contains t hree scripts: 1989-2010 download all microdata.R initiate a function to download and import any survey of consumer finances zipped stata file (.dta) loop through each year specified by the user (starting at the 1989 re-vamp) to download the main, extract, and replicate weight files, then import each into r break the main file into five implicates (each containing one record per peu) and merge the appropriate extract data onto each implicate save the five implicates and replicate weights to an r data file (.rda) for rapid future loading 2010 analysis examples.R prepare two survey of consumer finances-flavored multiply-imputed survey analysis functions load the r data files (.rda) necessary to create a multiply-imputed, replicate-weighted survey design demonstrate how to access the properties of a multiply-imput ed survey design object cook up some descriptive statistics and export examples, calculated with scf-centric variance quirks run a quick t-test and regression, but only because you asked nicely replicate FRB SAS output.R reproduce each and every statistic pr ovided by the friendly folks at the federal reserve create a multiply-imputed, replicate-weighted survey design object re-reproduce (and yes, i said/meant what i meant/said) each of those statistics, now using the multiply-imputed survey design object to highlight the statistically-theoretically-irrelevant differences click here to view these three scripts for more detail about the survey of consumer finances (scf), visit: the federal reserve board of governors' survey of consumer finances homepage the latest scf chartbook, to browse what's possible. (spoiler alert: everything.) the survey of consumer finances wikipedia entry the official frequently asked questions notes: nationally-representative statistics on the financial health, wealth, and assets of american hous eholds might not be monopolized by the survey of consumer finances, but there isn't much competition aside from the assets topical module of the survey of income and program participation (sipp). on one hand, the scf interview questions contain more detail than sipp. on the other hand, scf's smaller sample precludes analyses of acute subpopulations. and for any three-handed martians in the audience, ther e's also a few biases between these two data sources that you ought to consider. the survey methodologists at the federal reserve take their job...

  8. H

    Replication Data for: Partisanship and Support for Devolving Concrete Policy...

    • dataverse.harvard.edu
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Doherty (2024). Replication Data for: Partisanship and Support for Devolving Concrete Policy Decisions to the States [Dataset]. http://doi.org/10.7910/DVN/AE8KCI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    David Doherty
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This archive includes materials needed to replicate analysis reported in Doherty, Touchton, Lyons. 202X. "Partisanship and Support for Devolving Concrete Policy Decisions to the States." Political Behavior. replication_data.dta: Stata formatted dataset with all variables used in the analysis. replication.do: DO file that executes all analysis reported in the article and outputs tables and figures to a subfolder named "tables" Users should save these two files to a folder, create a subfolder titled "tables" and change the path on the first line of the DO file to refer to the main folder.

  9. f

    Doherty_Schraeder_Dobbs_Replication.zip – Research Data for "Do Democratic...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dobbs, Kirstie; Doherty, David; Schraeder, Peter (2019). Doherty_Schraeder_Dobbs_Replication.zip – Research Data for "Do Democratic Revolutions 'Activate' Participants?: The Case of Tunisia" [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000161519
    Explore at:
    Dataset updated
    Feb 25, 2019
    Authors
    Dobbs, Kirstie; Doherty, David; Schraeder, Peter
    Area covered
    Tunisia
    Description

    This archive contains materials to replicate the analysis reported in: Doherty, David, Peter J. Schraeder, and Kirstie L. Dobbs. "Do Democratic Revolutions 'Activate' Participants?: The Case of Tunisia"The root directory includes five Stata DO files (run on Stata 14.1). The file replication.do calls the other four DO files. These other four DO files conduct analysis specific to a particular dataset. In the case of arab_barometer_w2.do and afrobarometer.do the files simply do necessary recoding and output summary statistics. The remaining two DO files use two datasets used to conduct the statistical analysis reported in the paper. They also complete recoding to ensure that variables from these two datasets are coded similarly. The file replication.do then stacks the recoded data to complete the core analysis reported. The directory includes four folders:1) prepped_data: this folder is where the two recoded datasets that are stacked for the core analysis are deposited. It is empty in this archive.2) private_data: Empty folder referred to in commented out code. The only file originally included in this folder was the full dataset from the original survey used in the analysis. The commented out code (top of "orig_survey.do") stripped out variables not used in the analysis and saved the resulting dataset in the raw_data folder.3) raw_data: Contains all datasets used in the analysis. The tunisia_2012_survey.dta file is from our original survey. The remaining files were downloaded from the Arab Barometer and AfroBarometer websites. 4) tables: Empty folder where tables and figures are saved.To run the analysis, users should simply set the directory at the top of the replication.do file.

  10. f

    Data from: Inconsistent Retirement Timing

    • figshare.com
    zip
    Updated Dec 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Philipp Schreiber; Christoph Merkle; Martin Weber (2021). Inconsistent Retirement Timing [Dataset]. http://doi.org/10.6084/m9.figshare.17197928.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 14, 2021
    Dataset provided by
    figshare
    Authors
    Philipp Schreiber; Christoph Merkle; Martin Weber
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AbstractWe study the effect of inconsistent time preferences on actual and planned retirement timing decisions in two independent datasets. Theory predicts that hyperbolic time preferences can lead to dynamically inconsistent retirement timing. In an online experiment with more than 2,000 participants, we find that time-inconsistent participants retire on average 1.75 years earlier than time-consistent participants do. The planned retirement age of non-retired participants decreases with age. This negative age effect is about twice as strong among time-inconsistent participants. The temptation of early retirement seems to rise in the final years of approaching retirement. Consequently, time-inconsistent participants have a higher probability of regretting their retirement decision. We find similar results for a representative household survey (German SAVE panel). Using smoking behavior and overdraft usage as time preference proxies, we confirm that time-inconsistent participants retire earlier and that non-retirees reduce their planned retirement age within the panel.MethodsWe conduct an online experiment in cooperation with a large and well-circulated German newspaper, the Frankfurter Allgemeine Zeitung (FAZ). Participants are recruited via a link on the newspaper's website and two announcements in the print edition. In total, 3,077 participants complete the experiment, which takes them on average 11 minutes. Participants answer questions about retirement planning, time preferences, risk preferences, financial literacy, and demographics. The initial sample for this study consists of 256 retired participants and 2,173 non-retired participants.Usage NotesOur dataset: STATA Do File is attached Additional Datasets: In addition, a German Household Panle is used in this paper. The data cannot be uploaded by us but is available via the Max Planck Institute (https://www.mpisoc.mpg.de/en/social-policy-mea/research/save-2001-2013/). We upload the Do-Files used in the analysis and the results in an excel format (xlsx).

  11. u

    Health Survey for England, 2000-2001: Small Area Estimation Teaching Dataset...

    • datacatalogue.ukdataservice.ac.uk
    Updated Jul 29, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Manchester, Cathie Marsh Centre for Census and Survey Research, ESDS Government (2011). Health Survey for England, 2000-2001: Small Area Estimation Teaching Dataset [Dataset]. http://doi.org/10.5255/UKDA-SN-6792-1
    Explore at:
    Dataset updated
    Jul 29, 2011
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    University of Manchester, Cathie Marsh Centre for Census and Survey Research, ESDS Government
    Area covered
    England
    Description

    The Health Survey for England, 2000-2001: Small Area Estimation Teaching Dataset was prepared as a resource for those interested in learning introductory small area estimation techniques. It was first presented as part of a workshop entitled 'Introducing small area estimation techniques and applying them to the Health Survey for England using Stata'. The data are accompanied by a guide that includes a practical case study enabling users to derive estimates of disability for districts in the absence of survey estimates. This is achieved using various models that combine information from ESDS government surveys with other aggregate data that are reliably available for sub-national areas. Analysis is undertaken using Stata statistical software; all relevant syntax is provided in the accompanying '.do' files.

    The data files included in this teaching resource contain HSE variables and data from the Census and Mid-year population estimates and projections that were developed originally by the National Statistical agencies, as follows:

    • The main data file, 'hse_data.dta', is a reduced version of the HSE for 2000 and 2001. In order to combine data from two years of the HSE in a consistent way some changes have been made to the weights in each year. Additionally, some recoding of the limiting long term illness (LLTI), disability and the age variable has also been undertaken.
    • File 'practical_1_task_5_data.dta' contains population counts and model mobility disability rates (estimated during practical 1) distinguishing single year of age and sex for the six case study districts.
    • File 'practical_2_data.dta' contains the aggregate data required for Practical 2, including age- and sex-specific rates of LLTI (Census) for six UK case study districts, age- and sex-specific rates of mobility disability for England (HSE), and population counts for the six districts.
    • File 'pop_data_practical_3.dta' contains population counts for the six districts (by age, sex and LLTI status) required for practical 3
    The original HSEs for 2000 and 2001 are held at the UK Data Archive under SNs 4628 and 4912 respectively. Full details of the recoding of HSE variables and how the aggregate data was produced can be found in the data documentation.

    This unrestricted access data collection is freely available to download under an Open Government Licence from the UK Data Service. Note that the files should be unzipped/saved to the C: drive of the computer to be used; all syntax assumes files are saved at this location.

  12. o

    Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2016

    • openicpsr.org
    • datasearch.gesis.org
    Updated May 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Uniform Crime Reporting (UCR) Program Data: Hate Crime Data 1992-2016 [Dataset]. http://doi.org/10.3886/E103500V3
    Explore at:
    Dataset updated
    May 18, 2018
    Dataset provided by
    University of Pennsylvania
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1992 - 2015
    Area covered
    United States
    Description

    Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. The data sets here combine all data from the years 1992-2015 into a single file. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). All the data was downloaded from NACJD as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here. https://github.com/jacobkap/crime_data. The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), changed the name of some UCR offense codes (e.g. from "agg asslt" to "aggravated assault"), made all character values lower case, reordered columns. I also added state, county, and place FIPS code from the LEAIC (crosswalk) and generated incident month, weekday, and month-day variables from the incident date variable included in the original data. The zip file contains the data in the following formats and a codebook: .csv - Microsoft Excel.dta - Stata.sav - SPSS.rda - RIf you have any questions, comments, or suggestions please contact me at jkkaplan6@gmail.com.

  13. H

    Replication Data for: How Do Electoral Incentives Affect Legislator...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Sep 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Fouirnaies; Andrew B. Hall (2021). Replication Data for: How Do Electoral Incentives Affect Legislator Behavior? Evidence from U.S. State Legislatures [Dataset]. http://doi.org/10.7910/DVN/LHTRWM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 28, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Alexander Fouirnaies; Andrew B. Hall
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States
    Description

    This folder contains the raw data and the state code to produce the dataset used in the paper "How Do Electoral Incentives Affect Legislator Behavior?". The code produces all tables and figures in the paper and in the appendix. To replicate the findings, download the replication folder with all materials and set the working directory to this folder. Run the file replicate_how_do_electoral_incentives_affect_legislator_behavior.do in Stata. This will produce the main dataset from the raw input data and produce all the tables and figures and save them in the folder tables_figures. The individual results can also replicated using the dataset termlimited.dta in the data_output folder and the relevant do files. The do file electoral_incentives.do shows what do file is needed to replicate a particular table or figure in the paper or appendix.

  14. H

    Replication Data for "Core Political Values and the Long-Term Shaping of...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 22, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Geoffrey Evans; Anja Neundorf (2018). Replication Data for "Core Political Values and the Long-Term Shaping of Partisanship" [Dataset]. http://doi.org/10.7910/DVN/VJTN9Z
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Geoffrey Evans; Anja Neundorf
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The article uses a dataset, which cannot be deposited online, but is freely available to registered users. The data of the British Household Panel Study can be requested via https://discover.ukdataservice.ac.uk/catalogue/?sn=5151. Here we provide a STATA do-file that will create the working file, recode the original data and run some robustness tests. The data was prepared in Stata and then saved as SPSS files .sav using Stattrans. This was necessary, as the main cross-lagged latent class models of the paper were estimated using LatentGOLD, which only reads .sav files. Here we also provide the syntax files that were used for estimating these models.

  15. D

    Replication Data for: A High Court Plays the Accordion: Validating Ex Ante...

    • dataverse.azure.uit.no
    • dataverse.no
    • +1more
    tsv, txt
    Updated Sep 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Henrik L. Bentsen; Gunnar Grendstad; William R. Shaffer; Eric N. Waltenburg; Eric N. Waltenburg; Henrik L. Bentsen; Gunnar Grendstad; William R. Shaffer (2023). Replication Data for: A High Court Plays the Accordion: Validating Ex Ante Case Complexity on Oral Arguments [Dataset]. http://doi.org/10.18710/DWIX6Y
    Explore at:
    tsv(235966), txt(213402), txt(6671)Available download formats
    Dataset updated
    Sep 28, 2023
    Dataset provided by
    DataverseNO
    Authors
    Henrik L. Bentsen; Gunnar Grendstad; William R. Shaffer; Eric N. Waltenburg; Eric N. Waltenburg; Henrik L. Bentsen; Gunnar Grendstad; William R. Shaffer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The data set (saved in Stata *.dta and .txt) contains all observations (Norwegian supreme court cases 2008-2018 decided in five-justice panels) and variables (independent variables measuring complexity of cases and the dependent variable measuring time in hours scheduled for oral arguments) relevant for a complete replication of the the study. ABSTRACT OF STUDY: While high courts with fixed time for oral arguments deprive researchers of the opportunity to extract temporal variance, courts that apply the “accordion model” institutional design and adjust the time for oral arguments according to the perceived complexity of a case are a boon for research that seeks to validate case complexity well ahead of the courts’ opinion writing. We analyse an original data set of all 1,402 merits decisions of the Norwegian Supreme Court from 2008 to 2018 where the justices set time for oral arguments to accommodate the anticipated difficulty of the case. Our validation model empirically tests whether and how attributes of a case associated with ex ante complexity are linked with time allocated for oral arguments. Cases that deal with international law and civil law, have several legal players, are cross-appeals from lower courts are indicative of greater case complexity. We argue that these results speak powerfully to the use of case attributes and/or the time reserved for oral arguments as ex ante measures of case complexity. To enhance the external validity of our findings, future studies should examine whether these results are confirmed in high courts with similar institutional design for oral arguments. Subsequent analyses should also test the degree to which complex cases and/or time for oral arguments have predictive validity on more divergent opinions among the justices and on the time courts and justices need to render a final opinion.

  16. o

    Expropriation of the Church's wealth and political conflict in 19th century...

    • openicpsr.org
    stata
    Updated Dec 17, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mateo Uribe-Castro (2018). Expropriation of the Church's wealth and political conflict in 19th century Colombia [Dataset]. http://doi.org/10.3886/E107803V2
    Explore at:
    stataAvailable download formats
    Dataset updated
    Dec 17, 2018
    Dataset provided by
    University of Maryland, College Park
    Authors
    Mateo Uribe-Castro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Colombia
    Description

    Replication files for the paper "Expropriation of the Church's wealth and political violence in 19th century Colombia."It includes a complete dataset (in folder data) and a Stata do-file to replicate the tables and figures from the paper.The other folders are now empty but the program is written to save figures and tables in them.

  17. d

    Replication package for \"Religion exhibits the greatest cultural diversity...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Oct 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knudsen, Anne Sofie Beck; Bentzen, Jeanet Sinding; Norenzayan, Ara; Lindbjerg Sperling, Lena (2025). Replication package for \"Religion exhibits the greatest cultural diversity across 117 countries\" [Dataset]. http://doi.org/10.7910/DVN/OQONVO
    Explore at:
    Dataset updated
    Oct 28, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Knudsen, Anne Sofie Beck; Bentzen, Jeanet Sinding; Norenzayan, Ara; Lindbjerg Sperling, Lena
    Description

    Replication package for: Bentzen, J.S., Knudsen, A.S.B., Sperling, L.L., & Norenzayan, A. (2025), "Religion exhibits the greatest cultural diversity across 117 countries", Nature Communications. ----------------------------------------------------------- FILES ----------------------------------------------------------- 1_prepare_data.do – Prepares variables from the integrated EVS–WVS dataset. 2_Fig_*.do – Scripts for generating all figures (main text and SI). cntr_id.dta – Crosswalk file mapping country identifiers. readme.txt – This file. ----------------------------------------------------------- INSTRUCTIONS ----------------------------------------------------------- 1. Download the integrated European Values Study (EVS) and World Values Survey (WVS) dataset (1981–2022). Detailed instructions are available here: https://europeanvaluesstudy.eu/methodology-data-documentation/integrated-values-surveys/data-and-documentation/ Access requires free registration. 2. Save the integrated file as: Integrated_values_surveys_1981-2022.dta 3. Open Stata (version 17 or higher) and set your working directory at the top of each script (see line 5 in 1_prepare_data.do). 4. Run the scripts in order: - 1_prepare_data.do - All 2_Fig_*.do scripts (each produces one or more figures for the main text and Supplementary Information).

  18. service trade data by mode of supply and service type.dta

    • figshare.com
    bin
    Updated Jul 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    riina kerner (2022). service trade data by mode of supply and service type.dta [Dataset]. http://doi.org/10.6084/m9.figshare.20337501.v4
    Explore at:
    binAvailable download formats
    Dataset updated
    Jul 22, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    riina kerner
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This is the extraction of services trade statistics, from WTO, used for modes of supply analysis. The data is saved as STATA software format and in Excel. Data is extracted from WTO database available https://www.wto.org/english/news_e/news19_e/serv_31jul19_e.htm

  19. Merged data set

    • figshare.com
    txt
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huafeng Zhang (2025). Merged data set [Dataset]. http://doi.org/10.6084/m9.figshare.28246769.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 21, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Huafeng Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data we use in this paper were gathered in the 6th round of Multiple Indicator Cluster Surveys (MICS6), which can be downloaded from https://mics.unicef.org/surveys. The MICS6 surveys are conducted by UNICEF (United Nations International Children's Emergency Fund). We merge the original data from 11 countries and saved the user data in Stata data. In addition, do-file for analysis is also published here.

  20. f

    Data from: Ghana EMBRACE Implementation Research

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Williams, John; Asante, Kwaku Poku; Owusu-Agyei, Seth; Yeji, Francis; okawa, Sumiyo; Addei, Sheila; Kikuchi, Kimiyo; Shibanuma, Akira; Ansah, Evelyn Korkor; Asare, Gloria Quansah; Gyapong, Margaret; Hodgson, Abraham; Tawiah, Charlotte; Oduro, Abraham; Nanishi, Keiko; Jimba, Masamine; Yasuoka, Junko (2021). Ghana EMBRACE Implementation Research [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000931489
    Explore at:
    Dataset updated
    May 18, 2021
    Authors
    Williams, John; Asante, Kwaku Poku; Owusu-Agyei, Seth; Yeji, Francis; okawa, Sumiyo; Addei, Sheila; Kikuchi, Kimiyo; Shibanuma, Akira; Ansah, Evelyn Korkor; Asare, Gloria Quansah; Gyapong, Margaret; Hodgson, Abraham; Tawiah, Charlotte; Oduro, Abraham; Nanishi, Keiko; Jimba, Masamine; Yasuoka, Junko
    Area covered
    Ghana
    Description

    The database is saved as .dta (Stata 13 or above) format.The dataset contains the pooled data from the baseline survey (conducted from July 1 to September 30, 2014) and the follow-up survey (conducted from October 1 to December 31, 2015). The dataset is .dta (Stata 13 or later) format.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dylan Brewer; Alyssa Carlson (2025). Addressing sample selection bias for machine learning methods (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9hZGRyZXNzaW5nLXNhbXBsZS1zZWxlY3Rpb24tYmlhcy1mb3ItbWFjaGluZS1sZWFybmluZy1tZXRob2RzLXJlcGxpY2F0aW9uLWRhdGE=

Addressing sample selection bias for machine learning methods (replication data)

Explore at:
Dataset updated
Oct 2, 2025
Dataset provided by
Journal of Applied Econometrics
ZBW
ZBW Journal Data Archive
Authors
Dylan Brewer; Alyssa Carlson
Description

Addressing sample selection bias for machine learning methods (replication data)

Dylan Brewer and Alyssa Carlson

Accepted at Journal of Applied Econometrics, 2023

Overview

This replication package contains files required to reproduce results, tables, and figures using Matlab and Stata. We divide the project into instructions to replicate the simulation, the result from Huang et al (2006), and the application.

Simulation

For reproducing the simulation results

Included files in *\Simulation with short descriptions:

  • SSML_simfunc: function that produces individual simulations runs
  • SSML_simulation: script that loops over the SSML_simfunc for different DGP and multiple simulation runs
  • SSML_figures: script that generates all figures for the paper
  • SSML_compilefunc: function that compiles the results from SSML_simulation for the SSML_figures script

Steps for replicating simulation:

  1. Save SSML_simfunc, SSML_simulation, SSML_figures, SSML_compilefunc to the same folder. This location will be referred to as the FILEPATH.
  2. Create OUTPUT folder inside the FILEPATH location.
  3. Change the FILEPATH location inside SSML_simulation and SSML_figures.
  4. Run SSML_simulation to produce simulation data and results.
  5. Run SSML_figures to produce figures.

Huang et al replication

For reproducing the Huang et. al. (2006) replication results.

Included files in *\HuangetalReplication with short descriptions:

  • SSML_huangrep: script that replicates the results from Huang et. al. (2006)

Obtaining the dataset:

Go to https://archive.ics.uci.edu/dataset/14/breast+cancer and save file as "breast-cancer-wisconsin.data"

Steps for replicating results:

  1. Save SSML_huangrep and the breast cancer data to the same folder. This location will be referred to as the FILEPATH.
  2. Change the FILEPATH location inside SSML_huangrep
  3. Run SSML_huangrep to produce results and figures.

Application

For reproducing the application section results.

Included program files in *\Application with short descriptions:

  • G0_main_202308.do: Stata wrapper code that will run all application replication files
  • G1_cqclean_202308.do: Cleans election outcomes data
  • G2_cqopen_202308.do: Cleans open elections data
  • G3_demographics_cainc30_202308.do: Cleans demographics data
  • G4_fips_202308.do: Cleans FIPS code data
  • G5_klarnerclean_202308.do: Cleans Klarner gubernatorial data
  • G6_merge_202308.do: Merges cleaned datasets together
  • G7_summary_202308.do: Generates summary statistics tables and figures
  • G8_firststage_202308.do: Runs L1 penalized probit for the first stage
  • G9_prediction_202308.m: Trains learners and makes predictions
  • G10_figures_202308.m: Generates figures of prediction patterns
  • G11_final_202308.do: Generates final figures and tables of results
  • r1_lasso_alwayskeepCF_202308.do: Examines the effect of requiring the control function is not dropped from LASSO
  • latexTable.m: Code by Eli Duenisch to write LaTeX tables from Matlab (https://www.mathworks.com/matlabcentral/fileexchange/44274-latextable)

Included non-confidential data in subdirectory `*\Application\Data`:

Confidential data suppressed in subdirectory `*\Application\CD`:

These data cannot be transferred as part of the data use agreement with the CQ Press. Thus, the files are not included.

There is no batch download--downloads for each year must be done by hand. For each year, download as many state outcomes as possible and name the files YYYYa.csv, YYYYb.csv, etc. (Example: 1970a.csv, 1970b.csv, 1970c.csv, 1970d.csv). See line 18 of G1_cqclean_202308.do for file structure information.

Steps for replicating application:

  1. Download confidential data from the CQ Press.
  2. Change the working directory in G0_main_202308.do on line 18 to the application folder.
  3. Change local matlabpath in G0_main_202308.do on line 18 to the appropriate location.
  4. Set directory and file path in G9_prediction_202308.m and G10_figures_202308.m as necessary.
  5. Run G0_main_202308.do in Stata to run all programs.
  6. All output (figures and tables) will be saved to subdirectory *\Application\Output.

Contact

Contact Dylan Brewer (brewer@gatech.edu) or Alyssa Carlson (carlsonah@missouri.edu) for help with replication.

Search
Clear search
Close search
Google apps
Main menu