63 datasets found

g
MOCK Qualtrics dataset
rubenarslan.github.io
cran.r-universe.dev
Updated Aug 1, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruben Arslan (2018). MOCK Qualtrics dataset [Dataset]. http://doi.org/10.5281/zenodo.1326520
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.1326520
Dataset updated
Aug 1, 2018
Dataset provided by
MPI Human Development, Berlin
Authors
Ruben Arslan
Time period covered
2018
Area covered
Nowhere
Variables measured
Q7, Q10, ResponseSet
Description
a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name label n_missing
ResponseSet NA 0
Q7 NA 0
Q10 NA 0

Note

This dataset was automatically described using the codebook R package (version 0.9.5).
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
z
GAPs Data Repository on Return: Guideline, Data Samples and Codebook
zenodo.org
data.niaid.nih.gov
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeynep Sahin Mencutek; Zeynep Sahin Mencutek; Fatma Yılmaz-Elmas; Fatma Yılmaz-Elmas (2025). GAPs Data Repository on Return: Guideline, Data Samples and Codebook [Dataset]. http://doi.org/10.5281/zenodo.14862490
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14862490
Dataset updated
Feb 13, 2025
Dataset provided by
RedCAP
Authors
Zeynep Sahin Mencutek; Zeynep Sahin Mencutek; Fatma Yılmaz-Elmas; Fatma Yılmaz-Elmas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The GAPs Data Repository provides a comprehensive overview of available qualitative and quantitative data on national return regimes, now accessible through an advanced web interface at https://data.returnmigration.eu/.

This updated guideline outlines the complete process, starting from the initial data collection for the return migration data repository to the development of a comprehensive web-based platform. Through iterative development, participatory approaches, and rigorous quality checks, we have ensured a systematic representation of return migration data at both national and comparative levels.

The Repository organizes data into five main categories, covering diverse aspects and offering a holistic view of return regimes: country profiles, legislation, infrastructure, international cooperation, and descriptive statistics. These categories, further divided into subcategories, are based on insights from a literature review, existing datasets, and empirical data collection from 14 countries. The selection of categories prioritizes relevance for understanding return and readmission policies and practices, data accessibility, reliability, clarity, and comparability. Raw data is meticulously collected by the national experts.

The transition to a web-based interface builds upon the Repository’s original structure, which was initially developed using REDCap (Research Electronic Data Capture). It is a secure web application for building and managing online surveys and databases.The REDCAP ensures systematic data entries and store them on Uppsala University’s servers while significantly improving accessibility and usability as well as data security. It also enables users to export any or all data from the Project when granted full data export privileges. Data can be exported in various ways and formats, including Microsoft Excel, SAS, Stata, R, or SPSS for analysis. At this stage, the Data Repository design team also converted tailored records of available data into public reports accessible to anyone with a unique URL, without the need to log in to REDCap or obtain permission to access the GAPs Project Data Repository. Public reports can be used to share information with stakeholders or external partners without granting them access to the Project or requiring them to set up a personal account. Currently, all public report links inserted in this report are also available on the Repository’s webpage, allowing users to export original data.

This report also includes a detailed codebook to help users understand the structure, variables, and methodologies used in data collection and organization. This addition ensures transparency and provides a comprehensive framework for researchers and practitioners to effectively interpret the data.

The GAPs Data Repository is committed to providing accessible, well-organized, and reliable data by moving to a centralized web platform and incorporating advanced visuals. This Repository aims to contribute inputs for research, policy analysis, and evidence-based decision-making in the return and readmission field.

Explore the GAPs Data Repository at https://data.returnmigration.eu/.
T
Meta-Analysis Codebook of Final Articles with Effect Sizes, Sample Sizes,...
ldbase.org
csv
Updated Jul 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Amy R. Napoli; Dr. Jamie M. Quinn; Dr. Sarah G. Wood; Mia C. Daucourt; Sara A. Hart (2021). Meta-Analysis Codebook of Final Articles with Effect Sizes, Sample Sizes, and Moderators [Dataset]. https://ldbase.org/datasets/421c34e2-7c32-4bf2-ae72-db5e30bae07a
Explore at:
csvAvailable download formats
Dataset updated
Jul 7, 2021
Authors
Dr. Amy R. Napoli; Dr. Jamie M. Quinn; Dr. Sarah G. Wood; Mia C. Daucourt; Sara A. Hart
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
The data are in long form, with some studies having multiple lines and includes a sample of children ranging from 3.54 to 13.75 years old. The main effect size is the r, correlation coefficient, and the accompanying sample size is also included. Each article is coded to include a study number, the article name, and its authors, as well as a X moderators. The moderators are as follows:
- grade_new2 = sample grade category, where 1 = preschool/kindergarten, 2 = secondary
- HME_comp_new = HME component, where 1 = direct activities, 2 = indirect activities, 3 = combination direct and indirect activities, 4 = parent attitudes and/or beliefs, 5 = parent math expectations, 6 = spatial activities, 7 = math talk
- hme_type_nocombo = HME measurement method, where 1 = frequency-based scale, 2 = rating scale, 3 = checklist, 4 = observation
- obs_pr = two-level HME measurement method variable, where 1 = observation-based, 2 = parent-report
- math_dom_nospat = math domain, where 1 = arithmetic operations, 2 = relations, 3 = numbering, 4 = multiple domains
- symbolic_nonsymbolic = refers to math assessment, where 1 = symbolic, 2 = non-symbolic, 3 = combination symbolic and non-symbolic
- timed_new = refers to math assessment, where 1 = timed, 2 = untimed, . = combination timed and untimed
- composite = refers to math assessment, where 1 = composite, 2 = single math assessment
- std_new = refers to math assessment, where 1 = standardized, 2 = unstandardized, . = combination standardized and unstandardized
- hme_calc = hme calculation method, where 1 = latent factor score, 2 = sum score, 3 = single item
- age = sample age in years
- long_new = refers to effect size, where 1 = longitudinal relation, 2 = concurrent relation
- low_SES = sample SES, where 1 = low SES (50% or more), 2 = average SES, 3 = high SES (50% or more)
- parent_ed = sample SES in terms of parent education level, based on the percentage of parents reported to have completed any post-secondary education (included a vocational certification, attended some college, and/or completed an associate’s, bachelor’s, or graduate degree program). The percentage was converted into a decimal value ranging from .00 to 1.00.
d
Health and Retirement Study (HRS)
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Z
Synthetic datasets of the UK Biobank cohort
data.niaid.nih.gov
zenodo.org
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vanoli, Jacopo (2025). Synthetic datasets of the UK Biobank cohort [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13983169
Explore at:
Dataset updated
Feb 6, 2025
Dataset provided by
Gasparrini, Antonio
Vanoli, Jacopo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The original datasets are described in the article by Vanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.

synthbdbasevar: baseline variables, mostly collected at recruitment.

synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.

synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.

asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).

Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).

The first part merges all the data including the annual PM2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Human Resources Data Set
kaggle.com
Updated Oct 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dr. Rich (2020). Human Resources Data Set [Dataset]. http://doi.org/10.34740/kaggle/dsv/1572001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1572001
Dataset updated
Oct 19, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dr. Rich
Description
Updated 30 January 2023

Version 14 of Dataset

License Update:

There has been some confusion around licensing for this data set. Dr. Carla Patalano and Dr. Rich Huebner are the original authors of this dataset.

We provide a license to anyone who wishes to use this dataset for learning or teaching. For the purposes of sharing, please follow this license:

CC-BY-NC-ND This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Codebook

https://rpubs.com/rhuebner/hrd_cb_v14

PLEASE NOTE -- I recently updated the codebook - please use the above link. A few minor discrepancies were identified between the codebook and the dataset. Please feel free to contact me through LinkedIn (www.linkedin.com/in/RichHuebner) to report discrepancies and make requests.

Context

HR data can be hard to come by, and HR professionals generally lag behind with respect to analytics and data visualization competency. Thus, Dr. Carla Patalano and I set out to create our own HR-related dataset, which is used in one of our graduate MSHRM courses called HR Metrics and Analytics, at New England College of Business. We created this data set ourselves. We use the data set to teach HR students how to use and analyze the data in Tableau Desktop - a data visualization tool that's easy to learn.

This version provides a variety of features that are useful for both data visualization AND creating machine learning / predictive analytics models. We are working on expanding the data set even further by generating even more records and a few additional features. We will be keeping this as one file/one data set for now. There is a possibility of creating a second file perhaps down the road where you can join the files together to practice SQL/joins, etc.

Note that this dataset isn't perfect. By design, there are some issues that are present. It is primarily designed as a teaching data set - to teach human resources professionals how to work with data and analytics.

Content

We have reduced the complexity of the dataset down to a single data file (v14). The CSV revolves around a fictitious company and the core data set contains names, DOBs, age, gender, marital status, date of hire, reasons for termination, department, whether they are active or terminated, position title, pay rate, manager name, and performance score.

Recent additions to the data include: - Absences - Most Recent Performance Review Date - Employee Engagement Score

Acknowledgements

Dr. Carla Patalano provided the baseline idea for creating this synthetic data set, which has been used now by over 200 Human Resource Management students at the college. Students in the course learn data visualization techniques with Tableau Desktop and use this data set to complete a series of assignments.

Inspiration

We've included some open-ended questions that you can explore and try to address through creating Tableau visualizations, or R or Python analyses. Good luck and enjoy the learning!

Is there any relationship between who a person works for and their performance score?

What is the overall diversity profile of the organization?

What are our best recruiting sources if we want to ensure a diverse organization?

Can we predict who is going to terminate and who isn't? What level of accuracy can we achieve on this?

Are there areas of the company where pay is not equitable?

There are so many other interesting questions that could be addressed through this interesting data set. Dr. Patalano and I look forward to seeing what we can come up with.

If you have any questions or comments about the dataset, please do not hesitate to reach out to me on LinkedIn: http://www.linkedin.com/in/RichHuebner

You can also reach me via email at: Richard.Huebner@go.cambridgecollege.edu
p
4. codebook all field studies CSV.csv
psycharchives.org
Updated Aug 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). 4. codebook all field studies CSV.csv [Dataset]. https://psycharchives.org/en/item/5bb80531-2812-4a0a-9b75-b396c8543d34
Explore at:
Dataset updated
Aug 5, 2022
License
https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988
Description
Citizen Science (CS) projects play a crucial role in engaging citizens in conservation efforts. While implicitly mostly considered as an outcome of CS participation, citizens may also have a certain attitude toward engagement in CS when starting to participate in a CS project. Moreover, there is a lack of CS studies that consider changes over longer periods of time. Therefore, this research presents two-wave data from four field studies of a CS project about urban wildlife ecology using cross-lagged panel analyses. We investigated the influence of attitudes toward engagement in CS on self-related, ecology-related, and motivation-related outcomes. We found that positive attitudes toward engagement in CS at the beginning of the CS project had positive influences on participants’ psychological ownership and pride in their participation, their attitudes toward and enthusiasm about wildlife, and their internal and external motivation two months later. We discuss the implications for CS research and practice. Dataset for: Greving, H., Bruckermann, T., Schumann, A., Stillfried, M., Börner, K., Hagen, R., Kimmig, S. E., Brandt, M., & Kimmerle, J. (2023). Attitudes Toward Engagement in Citizen Science Increase Self-Related, Ecology-Related, and Motivation-Related Outcomes in an Urban Wildlife Project. BioScience, 73(3), 206–219. https://doi.org/10.1093/biosci/biad003: Codebook (CSV format) of the variables of all field studies
2023 CEV Data: Current Population Survey Civic Engagement and Volunteering...
data.americorps.gov
catalog.data.gov
+1more
application/rdfxml +5
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmeriCorps (2025). 2023 CEV Data: Current Population Survey Civic Engagement and Volunteering Supplement [Dataset]. https://data.americorps.gov/dataset/2023-CEV-Data-Current-Population-Survey-Civic-Enga/be5g-4c5r
Explore at:
xml, csv, application/rdfxml, tsv, application/rssxml, jsonAvailable download formats
Dataset updated
Jan 16, 2025
Dataset authored and provided by
AmeriCorpshttp://www.americorps.gov/
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
The Current Population Survey Civic Engagement and Volunteering (CEV) Supplement is the most robust longitudinal survey about volunteerism and other forms of civic engagement in the United States. Produced by AmeriCorps in partnership with the U.S. Census Bureau, the CEV takes the pulse of our nation’s civic health every two years. The data on this page was collected in September 2023. The next wave of the CEV will be administered in September 2025.

The CEV can generate reliable estimates at the national level, within states and the District of Columbia, and in the largest twelve Metropolitan Statistical Areas to support evidence-based decision making and efforts to understand how people make a difference in communities across the country.

Click on "Export" to download and review an excerpt from the 2023 CEV Analytic Codebook that shows the variables available in the analytic CEV datasets produced by AmeriCorps.

Click on "Show More" to download and review the following 2023 CEV data and resources provided as attachments:

1) 2023 CEV Dataset Fact Sheet – brief summary of technical aspects of the 2023 CEV dataset. 2) CEV FAQs – answers to frequently asked technical questions about the CEV 3) Constructs and measures in the CEV 4) 2023 CEV Analytic Data and Setup Files – analytic dataset in Stata (.dta), R (.rdata), SPSS (.sav), and Excel (.csv) formats, codebook for analytic dataset, and Stata code (.do) to convert raw dataset to analytic formatting produced by AmeriCorps. These files were updated on January 16, 2025 to correct erroneous missing values for the ssupwgt variable. 5) 2023 CEV Technical Documentation – codebook for raw dataset and full supplement documentation produced by U.S. Census Bureau 6) 2023 CEV Raw Data and Read In Files – raw dataset in Stata (.dta) format, Stata code (.do) and dictionary file (.dct) to read ASCII dataset (.dat) into Stata using layout files (.lis)
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
D
SNARE Codebook
dataverse.nl
docx, pdf
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L. Laninga-Wijnen; L. Laninga-Wijnen; J.K. Dijkstra; J.K. Dijkstra; A. Franken; M.C. Gremmen; M.C. Gremmen; Z. Harakeh; K. Pattiselanno; L.G.M. van Rijsewijk; L.G.M. van Rijsewijk; W.A.M. Vollebergh; W.A.M. Vollebergh; R. Veenstra; R. Veenstra; A. Franken; Z. Harakeh; K. Pattiselanno (2023). SNARE Codebook [Dataset]. http://doi.org/10.34894/JX2FYB
Explore at:
docx(27039), pdf(1111872)Available download formats
Unique identifier
https://doi.org/10.34894/JX2FYB
Dataset updated
Mar 2, 2023
Dataset provided by
DataverseNL
Authors
L. Laninga-Wijnen; L. Laninga-Wijnen; J.K. Dijkstra; J.K. Dijkstra; A. Franken; M.C. Gremmen; M.C. Gremmen; Z. Harakeh; K. Pattiselanno; L.G.M. van Rijsewijk; L.G.M. van Rijsewijk; W.A.M. Vollebergh; W.A.M. Vollebergh; R. Veenstra; R. Veenstra; A. Franken; Z. Harakeh; K. Pattiselanno
License
https://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/JX2FYBhttps://dataverse.nl/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.34894/JX2FYB
Time period covered
2011 - 2013
Dataset funded by
NWO
Description
The file describes the data of the SNARE project (Social Networks and Risk Behavior in Early Adolescence). This was a longitudinal research project on the social development of early adolescents with a specific focus on adolescents’ peer relationships and their involvement in risky behavior, including both self-reported data and peer-nomination data. All first-year and second-year students in two secondary schools in the Netherlands were approached to take part in the project (Cohort 1) at the beginning of the academic year 2011–2012. A second cohort of students entering their first year in these secondary schools was asked to take part in the project the following academic year 2012–2013 (Cohort 2). Data were collected three times in one academic year (in the fall, winter, and spring), starting in 2011–2012 (Cohort 1) and 2012–2013 (Cohort 2), respectively. In total, 12 waves of data have been collected. Before data collection started, students received an information letter describing the goal of the study and offering the possibility to refrain from participation. Parents who did not wish their children to participate in the study were asked to indicate this and students were made aware that they could cease their participation at any time. The survey was completed in the classroom by computer, supervised by a researcher, using Bright Answer socio software (SNARE software 2011). The privacy and anonymity of the students were warranted, and the study was approved by the Internal Review Board (IRB) of Utrecht University (see also Franken et al. 2016; the project name is “Social Network Processes and Social Development of Children and Adolescents”).
WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal,...
agdatacommons.nal.usda.gov
txt
Updated Oct 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA FNS Office of Policy Support (2024). WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal, Infant Year 5 Year Datasets [Dataset]. http://doi.org/10.15482/USDA.ADC/1528196
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1528196
Dataset updated
Oct 28, 2024
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Authors
USDA FNS Office of Policy Support
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
The WIC Infant and Toddler Feeding Practices Study–2 (WIC ITFPS-2) (also known as the “Feeding My Baby Study”) is a national, longitudinal study that captures data on caregivers and their children who participated in the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) around the time of the child’s birth. The study addresses a series of research questions regarding feeding practices, the effect of WIC services on those practices, and the health and nutrition outcomes of children on WIC. Additionally, the study assesses changes in behaviors and trends that may have occurred over the past 20 years by comparing findings to the WIC Infant Feeding Practices Study–1 (WIC IFPS-1), the last major study of the diets of infants on WIC. This longitudinal cohort study has generated a series of reports. These datasets include data from caregivers and their children during the prenatal period and during the children’s first five years of life (child ages 1 to 60 months). A full description of the study design and data collection methods can be found in Chapter 1 of the Second Year Report (https://www.fns.usda.gov/wic/wic-infant-and-toddler-feeding-practices-st...). A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Processing methods and equipment used Data in this dataset were primarily collected via telephone interview with caregivers. Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. The study team cleaned the raw data to ensure the data were as correct, complete, and consistent as possible. Study date(s) and duration Data collection occurred between 2013 and 2019. Study spatial scale (size of replicates and spatial scale of study area) Respondents were primarily the caregivers of children who received WIC services around the time of the child’s birth. Data were collected from 80 WIC sites across 27 State agencies. Level of true replication Unknown Sampling precision (within-replicate sampling or pseudoreplication) This dataset includes sampling weights that can be applied to produce national estimates. A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Level of subsampling (number and repeat or within-replicate sampling) A full description of the sampling and weighting procedures can be found in Appendix B-1 of the Fourth Year Report (https://fns-prod.azureedge.net/sites/default/files/resource-files/WIC-IT...). Study design (before–after, control–impacts, time series, before–after-control–impacts) Longitudinal cohort study. Description of any data manipulation, modeling, or statistical analysis undertaken Each entry in the dataset contains caregiver-level responses to telephone interviews. Also available in the dataset are children’s length/height and weight data, which were objectively collected while at the WIC clinic or during visits with healthcare providers. In addition, the file contains derived variables used for analytic purposes. The file also includes weights created to produce national estimates. The dataset does not include any personally-identifiable information for the study children and/or for individuals who completed the telephone interviews. Description of any gaps in the data or other limiting factors Please refer to the series of annual WIC ITFPS-2 reports (https://www.fns.usda.gov/wic/infant-and-toddler-feeding-practices-study-2-fourth-year-report) for detailed explanations of the study’s limitations. Outcome measurement methods and equipment used The majority of outcomes were measured via telephone interviews with children’s caregivers. Dietary intake was assessed using the USDA Automated Multiple Pass Method (https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-h...). Children’s length/height and weight data were objectively collected while at the WIC clinic or during visits with healthcare providers. Resources in this dataset:Resource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSV. File Name: itfps2_enrollto60m_publicuse.csvResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data Codebook. File Name: ITFPS2_EnrollTo60m_PUF_Codebook.pdfResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Enroll60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Enroll to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data CSV. File Name: ampm_1to60_ana_publicuse.csvResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Tot Codebook.pdfResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data Codebook. File Name: AMPM_1to60_Ana Codebook.pdfResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R Data. File Name: ITFP@_Year5_Ana_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Ana to 60 Months Public Use Data SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Tot to 60 Months Public Use Data CSV. File Name: ampm_1to60_tot_publicuse.csvResource Description: ITFP2 Year 5 Tot to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Tot_60_SAS_SPSS_STATA_R.zipResource Description: ITFP2 Year 5 Tot to 60 Months Public Use SAS SPSS STATA R DataResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSV. File Name: ampm_foodgroup_1to60m_publicuse.csvResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CSVResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use Data Codebook. File Name: AMPM_FoodGroup_1to60m_Codebook.pdfResource Description: ITFP2 Year 5 Food Group to 60 Months Public Use Data CodebookResource Title: ITFP2 Year 5 Food Group to 60 Months Public Use SAS SPSS STATA R Data. File Name: ITFP@_Year5_Foodgroup_60_SAS_SPSS_STATA_R.zipResource Title: WIC Infant and Toddler Feeding Practices Study-2 Data File Training Manual. File Name: WIC_ITFPS-2_DataFileTrainingManual.pdf
e
Dataset and Codebook of Experiment 3 for: The costs of shifting from...
b2find.eudat.eu
Updated Jul 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Dataset and Codebook of Experiment 3 for: The costs of shifting from dual-task to single-task processing: Applying the fade-out paradigm to dual tasking - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/44e4e76d-7d00-5c65-a687-d8aefa095045
Explore at:
Dataset updated
Jul 23, 2025
Description
Dataset and Codebook of Experiment 3 for: Jung, A. C., Lück, I., & Fischer, R. (2024). The costs of shifting from dual-task to single-task processing: Applying the fade-out paradigm to dual tasking. Journal of Experimental Psychology: Learning, Memory, and Cognition. Advance online publication. https://dx.doi.org/10.1037/xlm0001414 Cognitive control processes mirror fast and dynamic adaptation toward a change in the environment. When performing dual tasks, mental representations of dual-task-specific control requirements and the task-pairset are established that help to manage dual-task processing (Hirsch et al., 2017, 2018; Hommel, 2004, 2020). In the present study, we investigated to which extent such higher order representations of dual-task processing persist even if major characteristics of the task context change, for example, if one of the tasks of a dual task becomes irrelevant. For this, we adapted the fade-out paradigm (Mayr & Liebscher, 2001) to a dual-task setting and tested whether fade-out costs appear. Performance of pure Task 1 single tasking was compared to the performance of Task 1 processing right after dual-task trials (fade-out phase). Results showed that performance in this fade-out block did not immediately drop to single-task performance (fadeout costs), indicating the persistence of task-pair set representations (Experiments 1 and 3, N = 40 each). In addition, automatic stimulus–response translation processes continued within the fade-out phase, resulting in ongoing between-task interference. Furthermore, the frequency of between-task interference in dual-task blocks was manipulated (75% vs. 25% incongruence) between participants to establish conflict-biased control states of increased versus relaxed task shielding. These different control states, however, did not modulate fade-out costs (Experiment 2, N = 80). Nevertheless, the persistence of these control adaptations was reflected in manipulation-dependent between-task interference during fade-out trials. Implications of this new evidence are discussed.
e
Datasets and Codebook for: DEVELOPMENT AND VALIDATION OF THE PARENTAL...
b2find.eudat.eu
Updated Jul 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Datasets and Codebook for: DEVELOPMENT AND VALIDATION OF THE PARENTAL EXPECTATIONS SCALE (PES) IN PARENTS FROM ROMANIA - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/8e7cb879-d32c-5d52-bfab-30390a8b5441
Explore at:
Dataset updated
Jul 23, 2025
Description
Danila, I., Balaszi, R., Taut, D., Baban, A., & Foran, H. (2023). Development and validation of the Parental Expectations Scale (PES) in parents from Romania. European Journal of Psychological Assessment. https://econtent.hogrefe.com/doi/10.1027/1015-5759/a000774 Social cognitive models of parenting consider the role of unrealistic parental expectations (UE) regarding children’s abilities and behaviors as antecedents to the occurrence of child abuse. However, existing self-report measures of UE yield inconsistent results, often failing to differentiate aggressive and non-aggressive parents, raising questions about their validity and utility in understanding maladaptive parenting. To address these concerns, we developed and tested a new measure of parental UE in two samples of parents. The first sample (N = 179) was used to test the initial structure of the scale, while the second sample (N = 249) was used to replicate the structure and examine the concurrent validity, criterion validity, and internal consistency of the new measure. The final scale demonstrated adequate internal consistency, criterion, and concurrent validity. More unrealistic expectations predicted unique variance in parental negative behavior after controlling for other related variables. The current study provides preliminary evidence for the reliability and validity of the Parental Expectations Scale (PES), discussing its utility in the clinical assessment of parents at risk for child abuse and in tailoring parenting interventions
Data from: Cognition in Social Engineering Empirical Research: a Systematic...
zenodo.org
bin, txt, zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavlo Burda; Pavlo Burda; Luca Allodi; Luca Allodi; Nicola Zannone; Nicola Zannone (2023). Cognition in Social Engineering Empirical Research: a Systematic Literature Review [Dataset]. http://doi.org/10.5281/zenodo.8380243
Explore at:
txt, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8380243
Dataset updated
Oct 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pavlo Burda; Pavlo Burda; Luca Allodi; Luca Allodi; Nicola Zannone; Nicola Zannone
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# Description of contents

This repository contains the codebook, dataset and analysis scripts in R used for the following publication: Pavlo Burda, Luca Allodi, Nicola Zannone. "Cognition in Social Engineering Empirical Research: a Systematic Literature Review". ACM Transactions on Computer-Human Interaction (TOCHI).
The repository consists of the following files:

## dataset_and_codebook.xlsx contains the dataset, the codebook and a detailed description of contents.

## scripts/ contains the R scripts used for the analysis

## readme.txt contains this readme

# Dataset and codebook

The dataset_and_codebook.xlsx contains the following sheets:

## Codebook
Contains a detailed description of the dataset (tables, columns, fields, etc.).
The codebook describes the concepts and variables that are present in the dataset. This includes explanations on meaning, numerical values, classification schemes and labels.

## Hypotheses table
Contains the hypotheses for all analyzed papers and is used in the results of the paper.
Each row is a hypothesis of an included paper, cells contain one or more values (e.g., value1,value2,...) with or without sub values (e.g., value1,value2(sub-value1, ...)). Any content that is in square brackets [] is ignored in the analysis. Empty cells mean that there is no applicable value for that column.

## Papers table
Contains the analyzed papers and is used in the results of the paper. It is also used in the overview table (Table 6) in Appendix C.
Each row is an included paper, cells contain up to two values (e.g., value1, value2) or the word 'multiple' in case of more than two values. Empty cells mean that there is no applicable value for that column.

## Values table
Contains the description of cell values of 'Hypotheses' and 'Papers' and sub-values (specific variables in a study) that belong to a value.
It has a hierarchical structure from left to right where each field on the right column falls under the first non-empty field on the immediate left-top.

# Reproducing results with scripts/ (generate figures)
To run the R scripts included in the 'scripts' directory it is sufficient to follow the instructions in and run the 'RUN_ALL_SCRIPTS.R' in 'scripts' directory.
The scripts use the TSV (Tab Separated Values) format of the dataset, which is the exact copy of the 'Papers' and 'Hypotheses' tables in the 'dataset_and_codebook.xlsx' file.
The resulting figures are stored in the 'scripts/results' directory.
The Dynamics of Collective Action Corpus
zenodo.org
data.niaid.nih.gov
bin
Updated Oct 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley (2023). The Dynamics of Collective Action Corpus [Dataset]. http://doi.org/10.5281/zenodo.8414335
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8414335
Dataset updated
Oct 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dustin S. Stoltz; Dustin S. Stoltz; Marshall A. Taylor; Marshall A. Taylor; Jennifer S.K. Dudley; Jennifer S.K. Dudley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.

In R, use the load() function to load the objects `dca_dtm` and `dca_meta`. To associate the `dca_meta` to the `dca_dtm` , match the "pdf_file" variable in`dca_meta` to the rownames of `dca_dtm`.
WIC Participant and Program Characteristics 2020
agdatacommons.nal.usda.gov
docx
Updated Jan 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Food and Nutrition Service, Office of Policy Support (2025). WIC Participant and Program Characteristics 2020 [Dataset]. http://doi.org/10.15482/USDA.ADC/1527885
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1527885
Dataset updated
Jan 22, 2025
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Food and Nutrition Servicehttps://www.fns.usda.gov/
Authors
USDA Food and Nutrition Service, Office of Policy Support
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Background: In 1986, the Congress enacted Public Laws 99-500 and 99-591, requiring a biennial report on the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). In response to these requirements, FNS developed a prototype system that allowed for the routine acquisition of information on WIC participants from WIC State Agencies. Since 1992, State Agencies have provided electronic copies of these data to FNS on a biennial basis.FNS and the National WIC Association (formerly National Association of WIC Directors) agreed on a set of data elements for the transfer of information. In addition, FNS established a minimum standard dataset for reporting participation data. For each biennial reporting cycle, each State Agency is required to submit a participant-level dataset containing standardized information on persons enrolled at local agencies for the reference month of April. The 2020 Participant and Program Characteristics (PC2020) is the 17th to be completed using the prototype PC reporting system. In April 2020, there were 89 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the U.S. Virgin Islands, and 33 Indian Tribal Organizations (ITOs).Processing methods and equipment used: Specifications on formats (“Guidance for States Providing Participant Data”) were provided to all State agencies in January 2020. This guide specified 20 minimum dataset (MDS) elements and 11 supplemental dataset (SDS) elements to be reported on each WIC participant. Each State Agency was required to submit all 20 MDS items and any SDS items collected by the State agency. Study date(s) and duration The information for each participant was from the participants’ most current WIC certification as of April 2020.Study spatial scale (size of replicates and spatial scale of study area): In April 2020, there were 89 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the U.S. Virgin Islands, and 33 Indian Tribal Organizations (ITOs).Level of true replication: UnknownSampling precision (within-replicate sampling or pseudoreplication):State Agency Data Submissions. PC2020 is a participant dataset consisting of 7,036,867 active records. The records, submitted to USDA by the State Agencies, comprise a census of all WIC enrollees, so there is no sampling involved in the collection of this data.PII Analytic Datasets. State agency files were combined to create a national census participant file of approximately 7 million records. The census dataset contains potentially personally identifiable information (PII) and is therefore not made available to the public.National Sample Dataset. The public use SAS analytic dataset made available to the public has been constructed from a nationally representative sample drawn from the census of WIC participants, selected by participant category. The national sample consists of 1 percent of the total number of participants, or 70,368 records. The distribution by category is 5,469 pregnant women, 6,131 breastfeeding women, 4,373 postpartum women, 16,817 infants, and 37,578 children.Level of subsampling (number and repeat or within-replicate sampling): The proportionate (or self-weighting) sample was drawn by WIC participant category: pregnant women, breastfeeding women, postpartum women, infants, and children. In this type of sample design, each WIC participant has the same probability of selection across all strata. Sampling weights are not needed when the data are analyzed. In a proportionate stratified sample, the largest stratum accounts for the highest percentage of the analytic sample.Study design (before–after, control–impacts, time series, before–after-control–impacts): None – Non-experimentalDescription of any data manipulation, modeling, or statistical analysis undertaken: Each entry in the dataset contains all MDS and SDS information submitted by the State agency on the sampled WIC participant. In addition, the file contains constructed variables used for analytic purposes. To protect individual privacy, the public use file does not include State agency, local agency, or case identification numbers.Description of any gaps in the data or other limiting factors: All State agencies provided data on a census of their WIC participants.Resources in this dataset:Resource Title: WIC PC 2020 National Sample File Public Use Codebook.; File Name: PC2020 National Sample File Public Use Codebook.docx; Resource Description: WIC PC 2020 National Sample File Public Use CodebookResource Title: WIC PC 2020 Public Use CSV Data.; File Name: wicpc2020_public_use.csv; Resource Description: WIC PC 2020 Public Use CSV DataResource Title: WIC PC 2020 Data Set SAS, R, SPSS, Stata.; File Name: PC2020 Ag Data Commons.zipResource; Description: WIC PC 2020 Data Set SAS, R, SPSS, Stata One dataset in multiple formats
g
Uniform Crime Reporting Program Data: Offenses Known and Clearances by...
datasearch.gesis.org
doi.org
Updated Jun 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2018). Uniform Crime Reporting Program Data: Offenses Known and Clearances by Arrest, 1960-2016 [Dataset]. http://doi.org/10.3886/E100707V3-5862
Explore at:
Unique identifier
https://doi.org/10.3886/E100707V3-5862
Dataset updated
Jun 12, 2018
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
This version (V3) fixes a bug in Version 2 where 1993 data did not properly deal with missing values, leading to enormous counts of crime being reported. This is a collection of Offenses Known and Clearances By Arrest data from 1960 to 2016. The monthly zip files contain one data file per year(57 total, 1960-2016) as well as a codebook for each year. These files have been read into R using the ASCII and setup files from ICPSR (or from the FBI for 2016 data) using the package asciiSetupReader. The end of the zip folder's name says what data type (R, SPSS, SAS, Microsoft Excel CSV, feather, Stata) the data is in. Due to file size limits on open ICPSR, not all file types were included for all the data. The files are lightly cleaned. What this means specifically is that column names and value labels are standardized. In the original data column names were different between years (e.g. the December burglaries cleared column is "DEC_TOT_CLR_BRGLRY_TOT" in 1975 and "DEC_TOT_CLR_BURG_TOTAL" in 1977). The data here have standardized columns so you can compare between years and combine years together. The same thing is done for values inside of columns. For example, the state column gave state names in some years, abbreviations in others. For the code uses to clean and read the data, please see my GitHub file here. https://github.com/jacobkap/crime_data/blob/master/R_code/offenses_known.RThe zip files labeled "yearly" contain yearly data rather than monthly. These also contain far fewer descriptive columns about the agencies in an attempt to decrease file size. Each zip folder contains two files: a data file in whatever format you choose and a codebook. The data file is aggregated yearly and has already combined every year 1960-2016. For the code I used to do this, see here https://github.com/jacobkap/crime_data/blob/master/R_code/yearly_offenses_known.R.If you find any mistakes in the data or have any suggestions, please email me at jkkaplan6@gmail.comAs a description of what UCR Offenses Known and Clearances By Arrest data contains, the following is copied from ICPSR's 2015 page for the data.The Uniform Crime Reporting Program Data: Offenses Known and Clearances By Arrest dataset is a compilation of offenses reported to law enforcement agencies in the United States. Due to the vast number of categories of crime committed in the United States, the FBI has limited the type of crimes included in this compilation to those crimes which people are most likely to report to police and those crimes which occur frequently enough to be analyzed across time. Crimes included are criminal homicide, forcible rape, robbery, aggravated assault, burglary, larceny-theft, and motor vehicle theft. Much information about these crimes is provided in this dataset. The number of times an offense has been reported, the number of reported offenses that have been cleared by arrests, and the number of cleared offenses which involved offenders under the age of 18 are the major items of information collected.
d
Replication Data for: Responsiveness of decision-makers to stakeholder...
dataone.org
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lei, Yuxuan (2023). Replication Data for: Responsiveness of decision-makers to stakeholder preferences in the European Union legislative process [Dataset]. http://doi.org/10.7910/DVN/RH5H3H
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/RH5H3H
Dataset updated
Nov 8, 2023
Dataset provided by
Harvard Dataverse
Authors
Lei, Yuxuan
Area covered
European Union
Description
This dataset contains original quantitative datafiles, analysis data, a codebook, R scripts, syntax for replication, the original output from R studio and figures from a statistical program. The analyses can be found in Chapter 5 of my PhD dissertation, i.e., ‘Political Factors Affecting the EU Legislative Decision-Making Speed’. The data supporting the findings of this study are accessible and replicable. Restrictions apply to the availability of these data, which were used under license for this study. The datafiles include: File name of R script: Chapter 5 script.R File name of syntax: Syntax for replication 5.0.docx File name of the original output from R studio: The original output 5.0.pdf File name of code book: Codebook 5.0.txt File name of the analysis data: data5.0.xlsx File name of the dataset: Original quantitative data for Chapter 5.xlsx File name of the dataset: Codebook of policy responsiveness.pdf File name of figures: Chapter 5 Figures.zip Data analysis software: R studio R version 4.1.0 (2021-05-18) -- "Camp Pontanezen" Copyright (C) 2021 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin17.0 (64-bit)
e
Dataset and Codebook for: Long-term efficacy of exercise across...
b2find.eudat.eu
Updated Oct 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Dataset and Codebook for: Long-term efficacy of exercise across diagnostically heterogenous mental disorders and the mediating role of affect regulation skills - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ae7c302f-9b99-57d4-825d-c0776977acbe
Explore at:
Dataset updated
Oct 14, 2022
Description
Background: Exercise interventions are efficacious in reducing disorder-specific symptoms in various mental disorders. However, little is known about long-term transdiagnostic efficacy of exercise across heterogenous mental disorders and the potential mechanisms underlying treatment effects. Methods: Physically inactive outpatients, with depressive disorders, anxiety disorders, insomnia or attention deficit hyperactivity disorder were randomized to a standardized 12-week exercise intervention, combining moderate exercise with behaviour change techniques (BCTs) (n = 38), or a passive control group (n = 36). Primary outcome was global symptom severity (Symptom Checklist-90, SCL-90-R) and secondary outcomes were self-reported exercise (Physical Activity, Exercise, and Sport Questionnaire), exercise-specific affect regulation (Physical Activity-related Health Competence Questionnaire) and depression (SCL-90-R) assessed at baseline (T1), post-treatment (T2) and one year after post-treatment (T3). Intention-to-treat analyses were conducted using linear mixed models and structural equations modeling. Results: From T1 to T3, the intervention group significantly improved on global symptom severity (d = -0.43, p = .031), depression among a depressed subsample (d = -0.62, p = .014), exercise (d = 0.45, p= .011) and exercise-specific affect regulation (d = 0.44, p = .028) relative to the control group. The intervention group was more likely to reveal clinically significant changes from T1 to T3 (p = .033). Increases in exercise-specific affect regulation mediated intervention effects on global symptom severity (ß = -0.28, p = .037) and clinically significant changes (ß = -0.24, p = .042). Conclusions: The exercise intervention showed long-term efficacy among a diagnostically heterogeneous outpatient sample and led to long-lasting exercise behaviour change. The long-term increases in exercise-specific affect regulation within exercise interventions seems to be essential for long-lasting symptom reduction beyond an intervention period.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ruben Arslan (2018). MOCK Qualtrics dataset [Dataset]. http://doi.org/10.5281/zenodo.1326520

MOCK Qualtrics dataset

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.1326520

Dataset updated

Aug 1, 2018

Dataset provided by

MPI Human Development, Berlin

Authors

Ruben Arslan

Time period covered

2018

Area covered

Nowhere

Variables measured

Q7, Q10, ResponseSet

Description

a MOCK dataset used to show how to import Qualtrics metadata into the codebook R package

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

name	label	n_missing
ResponseSet	NA	0
Q7	NA	0
Q10	NA	0

Note

This dataset was automatically described using the codebook R package (version 0.9.5).

Clear search

Close search

Google apps

Main menu

MOCK Qualtrics dataset

Table of variables

Note

Film Circulation dataset

GAPs Data Repository on Return: Guideline, Data Samples and Codebook

Meta-Analysis Codebook of Final Articles with Effect Sizes, Sample Sizes,...

Health and Retirement Study (HRS)

Synthetic datasets of the UK Biobank cohort

Human Resources Data Set

Version 14 of Dataset

License Update:

Codebook

Context

Content

Acknowledgements

Inspiration

4. codebook all field studies CSV.csv

2023 CEV Data: Current Population Survey Civic Engagement and Volunteering...

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

SNARE Codebook

WIC Infant and Toddler Feeding Practices Study-2 (WIC ITFPS-2): Prenatal,...

Dataset and Codebook of Experiment 3 for: The costs of shifting from...

Datasets and Codebook for: DEVELOPMENT AND VALIDATION OF THE PARENTAL...

Data from: Cognition in Social Engineering Empirical Research: a Systematic...

The Dynamics of Collective Action Corpus

WIC Participant and Program Characteristics 2020

Uniform Crime Reporting Program Data: Offenses Known and Clearances by...

Replication Data for: Responsiveness of decision-makers to stakeholder...

Dataset and Codebook for: Long-term efficacy of exercise across...

MOCK Qualtrics datasetSee More Versions

Table of variables

Note

MOCK Qualtrics dataset