https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt
and go_INF.txt
). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r
. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade
, (ii) coll
(for "collocate"), (iii) BE going to
(for frequency of the collocates with be going to) and (iv) will
(for frequency of the collocates with will); it is available in the input_data_raw.txt
. 2-script-create-motion-chart-input-data.R
processes the input_data_raw.txt
for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt
). The output from the second script is input_data_futurate.txt
.input_data_futurate.txt
contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R
), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R
).Future Constructions.Rproj
file to open an RStudio session whose working directory is associated with the contents of this repository.Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
If using this dataset, please cite the following paper above and the current Zenodo repository:A. Mundanad Narayanan, R. Zink, and A. Bertrand, "EEG miniaturization limits for stimulus decoding with EEG sensor networks", Journal of Neural Engineering, vol. 18, 2021, doi: 10.1088/1741-2552/ac2629
Experiment*************
This dataset contains 255-channel electroencephalography (EEG) data collected during an auditory attention decoding experiment (AAD). The EEG was recorded using a SynAmps RT device (Compumedics, Australia) at a sampling rate of 1 kHz and using active Ag/Cl electrodes. The electrodes were placed on the head according to the international 10-5 (5%) system. 30 normal hearing male subjects between 22 and 35 years old participated in the experiment. All of them signed an informed consent form approved by the KU Leuven ethical committee.
Two Dutch stories narrated by different male speakers divided into two parts of 6 minutes each were used as the stimuli in the experiment [1]. A single trial of the experiment involved the presentation of these two parts (one of both stories) to the subject through insert phones (Etymotic ER3A) at 60dBA. These speech stimuli were filtered using a head-related transfer function (HRTF) such that the stories seemed to arrive from two distinct spatial locations, namely left and right with respect to the subject with 180 degrees separation. In each trial, the subjects were asked to attend to only one ear while ignoring the other. Four trials of 6 minutes each were carried out, in which each story part is used twice. The order of presentations was randomized and balanced over different subjects. Thus approximately 24 minutes of EEG data was recorded per subject.
File organization and details********************************
The EEG data of each of the 30 subjects are uploaded as a ZIP file with the name Sx.tar.gzip here x=0,1,2,..,29. When a zip file is extracted, the EEG data are in their original raw format as recorded by the CURRY software [2]. The data files of each recording consist of four files with the same name but different extensions, namely, .dat, ,dap, .rs3 and .ceo. The name of each file follows the following convention: Sx_AAD_P. With P taking one of the following values for each file:1. 1L2. 1R3. 2L4. 2R
The letter 'L' or 'R' in P indicates the attended direction of each subject in a recording: left and right respectively. A MATLAB function to read the software is provided in the directory called scripts. A python function to read the file is available in this Github repository [3].The original version of stimuli presented to subjects, i.e. without the HRTF filtering, can be found after extracting the stimuli.zip file in WAV format. There are 4 WAV files corresponding to the two parts of each of the two stories. These files have been sampled at 44.1 kHz. The order of presentation of these WAV files is given in the table below: Stimuli presentation and attention information of files
Trial (P) Stimuli: Left-ear Stimuli: Right-ear Attention
1L part1_track1_dry part1_track2_dry Left
1R part1_track1_dry part1_track2_dry Right
2L part2_track2_dry part2_track1_dry Left
2R part2_track2_dry part2_track1_dry Right
Additional files (after extracting scripts.zip and misc.zip):
scripts/sample_script.m: Demonstrates reading an EEG-AAD recording and extracting the start and end of the experiment.
misc/channel-layout.jpeg: The 255-channel EEG cap layout
misc/eeg255ch_locs.csv: The channel names, numbers and their spherical (theta and phi) scalp coordinates.
[1] Radioboeken voor kinderen, http://radioboeken.eu/kinderradioboeken.php?lang=NL, 2007 (Accessed: 8 Feb 2021)
[2] CURRY 8 X – Data Acquisition and Online Processing, https://compumedicsneuroscan.com/product/curry-data-acquisition-online-processing-x/ (Accessed: 8, Feb, 2021)
[3] Abhijith Mundanad Narayanan, "EEG analysis in python", 2021. https://github.com/mabhijithn/eeg-analyse , (Accessed: 8 Feb, 2021)
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Cover crops provide many agroecosystem services, including weed suppression, which is partially exerted through release of allelopathic benzoxazinoid (BX) compounds. This research characterizes (1) changes in concentrations of BX compounds in shoots, roots, and soil at three growth stages (GS) of cereal rye (Secale cereale L.), and (2) their degradation over time following termination. Concentrations of shoot dominant BX compounds, DIBOA-glc and DIBOA, were least at GS 83 (boot). The root dominant BX compound, HMBOA-glc, concentration was least at GS 54 (elongation). Rhizosphere soil BX concentrations were 1000 times smaller than in root tissues. Dominant compounds in soil were HMBOA-glc and HMBOA. Concentrations of BX compounds were similar for soil near root crowns and between-rows. Soil BX concentrations following cereal rye termination declined exponentially over time in three of four treatments: incorporated shoots (S) and roots (R), no-till S+R (cereal rye rolled flat), and no-till R (shoots removed), but not in no-till S. On the day following cereal rye termination, soil concentrations of HMBOA-glc and HMBOA in these three treatments increased above initial concentrations. Concentrations of these two compounds decreased the fastest while DIBOA-glc declined the slowest (half-life of 4 d in no-till S+R soil). Placement of shoots on the surface of an area where cereal rye had not grown (no-till S) did not increase soil concentrations of BX compounds. The short duration and complex dynamics of BX compounds in soil prior to and following termination illustrate the limited window for enhancing weed suppression by cereal rye allelochemicals; valuable information for programs breeding for enhanced weed suppression. In addition to the data analyzed for this article, we also include the R code. Resources in this dataset:Resource Title: BX data following termination. File Name: FinalBXsForMatt-20200908.csvResource Description: For each sample, gives the time, depth, location, and plot treatment, and then the compound concentrations. This is the principal data set analyzed with the R (anal2-cleaned.r) code, see that code for use.Resource Title: BX compounds from 3rd sampling time before termination. File Name: soil2-20201123.csvResource Description: These data are for comparison with the post termination data. They were taken at the 3rd sampling time (pre-termination), a day prior to termination. Each sample is identified with a treatment, date, and plot location, in addition to the BX concentrations. See R code (anal2-cleaned.r) for how this file is used.Resource Title: Soil location (within row versus between row) values of BX compounds. File Name: s2b.csvResource Description: Each row gives the average BX compound for each soil location (within row versus between row) for the second sample for each plot. These data are combined with bx3 (the data set read in from the file , "FinalBXsForMatt-20200908.csv"). See R (anal2-cleaned.r) code for use.Resource Title: R code for analysis of the decay (post-termination) BX data.. File Name: anal2-cleaned.rResource Description: This is the R code used to analyze the termination data. It also creates and writes out some data subsets (used for analysis and plots) that are later read in.Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Tissue BX compounds. File Name: tissues20210728b.csvResource Description: Data file holding results from a tissue analysis for BX compounds, in ug, from shoots and roots, and at various sampling times. Read into the R file, anal1-cleaned.r where it is used in a statistical analysis and to create figures.Resource Title: BX compounds from soil with a live rye cover crop. File Name: soil2-20201214.csvResource Description: BX compounds (in ng/g dry wt), by treatment, sampling time, date, and plot ID. These are data are read into the R program, anal1-cleaned.r, for analysis and to create figures. These are soil samples taken from locations with a live rye plant cover crop.Resource Title: R code for BX analyses of soil under rye and plant tissues. File Name: anal1-cleaned.rResource Description: R code for analysis of the soil BX compounds under a live rye cover crop at different growing stages, and for the analysis of tissue BX compounds. In addition to statistical analyses, code in this file creates figures, also some statistical output that is used to create a file that is later read in for figure creation (s2-CLD20220730-Stage.csv).Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Description of data files for anal2-cleaned.r. File Name: readme2.txtResource Description: Describes the input files used in the R code in anal2-cleaned.r, including descriptions and formats for each field. The file also describes some output (results) files that were uploaded to this site. This is a plain ASCII text file.Resource Title: Estimates produced by anal2-cleaned.r from statistical modeling.. File Name: Estimates20201110.csvResource Description: Estimates produced by anal2-cleaned.r from statistical modeling (see readme2.txt)Resource Title: Summary statistics from anal2-cleaned.r. File Name: CV20210412.csvResource Description: Summary statistics from anal2-cleaned.r, used for plotsResource Title: Data summaries (same as CV20210412.csv), rescaled. File Name: RESCALE-20210412.csvResource Description: Same as "CV20210412.csv" except log of data have been rescaled to minimum at least zero and maximum one, see readme2.txtResource Title: Statistical summaries for different stages. File Name: s2-CLD20220730-Stage.csvResource Description: Statistical summaries used for creating a figure (not used in paper), used in anal1-cleaned.r; data for soil BX under living rye.Resource Title: Description of data files for anal1-cleaned.r. File Name: readme1.txtResource Description: Contains general descriptions of data imported into anal1-cleaned.r, and a description of each field. Also contains some descriptions of files output by anal1-cleaned.r, used to create tables or figures.
Market basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
César E. Corona-González, Claudia Rebeca De Stefano-Ramos, Juan Pablo Rosado-Aíza, Fabiola R Gómez-Velázquez, David I. Ibarra-Zarate, Luz María Alonso-Valerdi
César E. Corona-González
https://orcid.org/0000-0002-7680-2953
a00833959@tec.mx
Psychophysiological data from Mexican children with learning difficulties who strengthen reading and math skills by assistive technology
2023
The current dataset consists of psychometric and electrophysiological data from children with reading or math learning difficulties. These data were collected to evaluate improvements in reading or math skills resulting from using an online learning method called Smartick.
The psychometric evaluations from children with reading difficulties encompassed: spelling tests, where 1) orthographic and 2) phonological errors were considered, 3) reading speed, expressed in words read per minute, and 4) reading comprehension, where multiple-choice questions were given to the children. The last 2 parameters were determined according to the standards from the Ministry of Public Education (Secretaría de Educación Pública in Spanish) in Mexico. On the other hand, group 2 assessments embraced: 1) an assessment of general mathematical knowledge, as well as 2) the hits percentage, and 3) reaction time from an arithmetical task. Additionally, selective attention and intelligence quotient (IQ) were also evaluated.
Then, individuals underwent an EEG experimental paradigm where two conditions were recorded: 1) a 3-minute eyes-open resting state and 2) performing either reading or mathematical activities. EEG recordings from the reading experiment consisted of reading a text aloud and then answering questions about the text. Alternatively, EEG recordings from the math experiment involved the solution of two blocks with 20 arithmetic operations (addition and subtraction). Subsequently, each child was randomly subcategorized as 1) the experimental group, who were asked to engage with Smartick for three months, and 2) the control group, who were not involved with the intervention. Once the 3-month period was over, every child was reassessed as described before.
The dataset contains a total of 76 subjects (sub-), where two study groups were assessed: 1) reading difficulties (R) and 2) math difficulties (M). Then, each individual was subcategorized as experimental subgroup (e), where children were compromised to engage with Smartick, or control subgroup (c), where they did not get involved with any intervention.
Every subject was followed up on for three months. During this period, each subject underwent two EEG sessions, representing the PRE-intervention (ses-1) and the POST-intervention (ses-2).
The EEG recordings from the reading difficulties group consisted of a resting state condition (run-1) and while performing active reading and reading comprehension activities (run-2). On the other hand, EEG data from the math difficulties group was collected from a resting state condition (run-1) and when solving two blocks of 20 arithmetic operations (run-2 and run-3). All EEG files were stored in .set format. The nomenclature and description from filenames are shown below:
Nomenclature | Description |
---|---|
sub- | Subject |
M | Math group |
R | Reading group |
c | Control subgroup |
e | Experimental subgroup |
ses-1 | PRE-intervention |
ses-2 | POST-Intervention |
run-1 | EEG for baseline |
run-2 | EEG for reading activity, or the first block of math |
run-3 | EEG for the second block of math |
Example: the file sub-Rc11_ses-1_task-SmartickDataset_run-2_eeg.set is related to: - The 11th subject from the reading difficulties group, control subgroup (sub-Rc11). - EEG recording from the PRE-intervention (ses-1) while performing the reading activity (run-2)
Psychometric data from the reading difficulties group:
Psychometric data from the math difficulties group:
Psychometric data can be found in the 01_Psychometric_Data.xlsx file
Engagement percentage be found in the 05_SessionEngagement.xlsx file
Seventy-six Mexican children between 7 and 13 years old were enrolled in this study.
The sample was recruited through non-profit foundations that support learning and foster care programs.
g.USBamp RESEARCH amplifier
The stimuli nested folder contains all stimuli employed in the EEG experiments.
Level 1 - Math: Images used in the math experiment. - Reading: Images used in the reading experiment.
Level 2
- Math
* POST_Operations: arithmetic operations from the POST-intervention.
* PRE_Operations: arithmetic operations from the PRE-intervention.
- Reading
* POST_Reading1: text 1 and text-related comprehension questions from the POST-intervention.
* POST_Reading2: text 2 and text-related comprehension questions from the POST-intervention.
* POST_Reading3: text 3 and text-related comprehension questions from the POST-intervention.
* PRE_Reading1: text 1 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading2: text 2 and text-related comprehension questions from the PRE-intervention.
* PRE_Reading3: text 3 and text-related comprehension questions from the PRE-intervention.
Level 3 - Math * Operation01.jpg to Operation20.jpg: arithmetical operations solved during the first block of the math
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of 5269 reaction profiles computed in a high-throughput manner at B3LYP-D3(BJ)/def2-TZVP//B3LYP-D3(BJ)/def2-SVP level of theory with the help of autodE and Gaussian 16. Reaction IDs and SMILES, activation energies (G_act; in kcal/mol) and reaction energies (G_r; in kcal/mol) for each computed reaction profile are provided in CSV format (full_dataset.csv). XYZ-files for each reactant (both the original and stereo-constrained versions), TS and product species as well as a CSV file containing computed electronic energies and thermal corrections are available in a compressed archive file, full_dataset_profiles.tar.gz.
The files have been organized per reaction profile, identified through the reaction ID. Within each directory, reactant XYZ-files are of the form r#####.xyz, product XYZ-files are of the form p#####.xyz, and transition state XYZ-files are of the form TS_#####.xyz. If the reactant dipole conformer had to be corrected to enforce stereochemical compatibility, the latter XYZ-files are included under to form of r#####_alt.xyz. The energies for all of these species are summarized per directory in energies.csv.
Additionally, all the benchmarking data are made available in the benchmarking_data.tar.gz directory
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
This data release presents the data, JAGS models, and R code used to manipulate data and to produce results and figures presented in the USGS Open File Report, "Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species, (https://doi.org/10.5066/P93YTR3X). The zip folder is provided so that other can reproduce results from the integrated population model, inspect model structure and posterior simulations, conduct analyses not present in the report, and use and modify the code. Raw source data can be sourced from the USGS Bird Banding Laboratory, USFWS Surveys and Monitoring Branch, National Oceanic and Atmospheric administration, and Ducks Unlimited Canada. The zip file contains the following objects when extracted: Readme.txt: A plain text file describing each file in this directory. Figures-Pintail-IPM.r: R code that generates report figures in png, pdf, and eps format. Generates Figures 2-11 and calls source code for figures 12 and 13 found in other files. * get pintail IPM data.r: R source code that must be run to format data for the IPM code file. * getbandrecovs.r: R code that takes Bird Banding Lab data for pintail band releases and recoveries and formats for analysis. This file is called by 'get pintail IPM data.r'. File was originally written by Scott Boomer (USFWS) and modified by Erik Osnas for use for the IPM. * Model_1_post.txt: Text representation of the posterior simulations from Model 1. This file can be read by the R function dget() to produce an R list object that contain posterior draws from Model 1. The list is the BUGSoutput$sims.list object from a call to rjags::jags. * Model_2_post.txt: As above but for Model 2. * Model_S1_post.txt: As above but for Model S1. * Pintail IPM.r: This is the main file that defines the IPM models in JAGS, structures the data for JAGS, defines initial values, and calls runs the models. Outputs are text files that contains JAGS model files, R work spaces that contains all data models, and results, include the output from the jags() function. From this the BUGSoutput$sims.list object was written to text for each model. * MSY_metrics.txt: Summary of results produced from running code in source_figure_12.R. This table is a text representation of a summary of the maximum sustained yield analysis at various mean rainfall levels, used for Table 1 of report and can be reproduced by running the code in source_figure_12.R. To understand the structure of this file, you must consult the code file and understand the structure of the R objects created from that code. Otherwise, consult Figure 12 and Table 1 in report. * source_figure_12.R: R code to produce Figure 12. Code is written to work with Rworkspace output from Model 1, but can be modified to use the Model_1_post.txt file without re-running the model. This would allow use of the same posterior realizations as used in the report. * source_figure_13.R: This is the code used to product the results for Figure 13. Required here is the posterior from Model 1 and data for the Prairie Parkland Model based on Jim Devries/Ducks Unlimited data. These are described in the report text. * Data: A directory that contains the raw data used for this report. * Data/2015_LCC_Networks_shapefile: A directory that contain ESRI shapefiles for used in Figure 1 and to define the boundaries of the Landscape Conservation Cooperatives. Found at (https://www.sciencebase.gov/catalog/item/55b943ade4b09a3b01b65d78) * Data/bndg_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for banded pintail from 1960 to 2014. Obtained from the USGS Bird Banding Lab. This file is used by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, https://www.pwrc.usgs.gov/BBL/manual/summary.cfm * Data/cponds.csv: A comma delimited file of estimated Canadian ponds based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, point estimate, and estimated standard error. * Data/enc_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for encounters of banded pintail. Obtained from the USGS Bird Banding Lab. This file is use by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, (https://www.pwrc.usgs.gov/BBL/manual/enc.cfm) * Data/nopiBPOP19552014.csv: A comma delimited file of estimated northern pintail based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, pintail point estimate (bpop), and pintail estimated standard error (bpopSE), mean latitude of the pintail population (lat), latitude variance of the pintail population (latVAR), mean longitude of the pintail population (lon), and the variance in longitude of the pintail population (lonVAR). * Data/Summary Climate Data California CV 2.csv: Rainfall data for the California central valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Summary data MAV.csv: Rainfall data for the Mississippi Aluvial valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 1961 2011 NOPI.txt: Comma delimited text file for pintail wing age data for 1961 to 2011 from the Parts Collection Survey. Each row is an individual wing with sex cohorts 4 = male, 5 = female and age cohorts 1 = After Hatch Year and 2 = Hatch Year. Wt is a weighting factor that determines how many harvested pintails this wing represent. See USFWS documentation for the Part Collection survey for descriptions. Summing Wt for each age, sex, and year gives an estimate of the number of pintail harvested. Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 2012 2013 NOPI.csv: Same as 'Wing data 1961 2011 NOPI.txt' but for years 2012 and 2013.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
!!!WARNING~~~This dataset has a large number of flaws and is unable to properly answer many questions that people generally use it to answer, such as whether national hate crimes are changing (or at least they use the data so improperly that they get the wrong answer). A large number of people using this data (academics, advocates, reporting, US Congress) do so inappropriately and get the wrong answer to their questions as a result. Indeed, many published papers using this data should be retracted. Before using this data I highly recommend that you thoroughly read my book on UCR data, particularly the chapter on hate crimes (https://ucrbook.com/hate-crimes.html) as well as the FBI's own manual on this data. The questions you could potentially answer well are relatively narrow and generally exclude any causal relationships. ~~~WARNING!!!For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 9 release notes:Adds 2021 data.Version 8 release notes:Adds 2019 and 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will be the last UCR hate crime data they release. Changes .rda file to .rds.Version 7 release notes:Changes release notes description, does not change data.Version 6 release notes:Adds 2018 dataVersion 5 release notes:Adds data in the following formats: SPSS, SAS, and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Adds data for 1991.Fixes bug where bias motivation "anti-lesbian, gay, bisexual, or transgender, mixed group (lgbt)" was labeled "anti-homosexual (gay and lesbian)" prior to 2013 causing there to be two columns and zero values for years with the wrong label.All data is now directly from the FBI, not NACJD. The data initially comes as ASCII+SPSS Setup files and read into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. Version 4 release notes: Adds data for 2017.Adds rows that submitted a zero-report (i.e. that agency reported no hate crimes in the year). This is for all years 1992-2017. Made changes to categorical variables (e.g. bias motivation columns) to make categories consistent over time. Different years had slightly different names (e.g. 'anti-am indian' and 'anti-american indian') which I made consistent. Made the 'population' column which is the total population in that agency. Version 3 release notes: Adds data for 2016.Order rows by year (descending) and ORI.Version 2 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The Hate Crime data is an FBI data set that is part of the annual Uniform Crime Reporting (UCR) Program data. This data contains information about hate crimes reported in the United States. Please note that the files are quite large and may take some time to open.Each row indicates a hate crime incident for an agency in a given year. I have made a unique ID column ("unique_id") by combining the year, agency ORI9 (the 9 character Originating Identifier code), and incident number columns together. Each column is a variable related to that incident or to the reporting agency. Some of the important columns are the incident date, what crime occurred (up to 10 crimes), the number of victims for each of these crimes, the bias motivation for each of these crimes, and the location of each crime. It also includes the total number of victims, total number of offenders, and race of offenders (as a group). Finally, it has a number of columns indicating if the victim for each offense was a certain type of victim or not (e.g. individual victim, business victim religious victim, etc.). The only changes I made to the data are the following. Minor changes to column names to make all column names 32 characters or fewer (so it can be saved in a Stata format), made all character values lower case, reordered columns. I also generated incident month, weekday, and month-day variables from the incident date variable included in the original data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.
The code to derive the dataset is given as follows:
### BEGIN R DATA PROCESSING SCRIPT
library(tesseract)
library(pdftools)
pdfs <- list.files("[path to your output directory containing all PDF files]")
meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)
meta$Doc.Date <- as.character(meta$Doc.Date)
meta.clean <- meta[-which(meta$Doc.Date=="" | grepl("/0000",meta$Doc.Date)),]
for(i in 1:nrow(meta.clean)){
meta.clean$Doc.Date[i] <- gsub("00","01",meta.clean$Doc.Date[i])
if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean$Doc.Date[i]<-format(strptime(meta.clean$Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
}
}
meta.clean$Doc.Date <- strptime(meta.clean$Doc.Date,format = "%m/%d/%Y")
meta.clean <- meta.clean[order(meta.clean$Doc.Date),]
docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])))
tmp_files <- c()
for(k in 1:pdf_prop$pages){
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}
img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean$File.Name[i])), format = 'tiff', pages = NULL, dpi = 700,filenames = tmp_files)
txt <- ""
for(j in 1:length(img_file)){
extract <- ocr(img_file[j], engine = tesseract("eng"))
#unlink(img_file)
txt <- paste(txt,extract,collapse = " ")
}
docs <- rbind(docs,data.frame(content=iconv(tolower(gsub("\\s+"," ",gsub("[[:punct:]]|[
]"," ",txt))),to="UTF-8"),dpub=format(meta.clean$Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
}
write.table(docs,"[path to your output directory]/documents.csv", row.names = F)
### END R DATA PROCESSING SCRIPT
I offer here both dataset and computing code related to a stem analysis algorithm to reconstruct height growth of trees. First, the dataset has time series records of tree height for Nothofagus alpina ("rauli"), N. dombeyi ("coigue"), N. obliqua ("roble"), and Pseudotsuga menziesii ("Douglas-fir"). The data come from stem analysis sample trees in both southern Chile and the Inland Northwest, USA. These trees are part of the ones used in an article about a new algorithm for reconstructing tree height growth. The article is published in Methods in Ecology and Evolution (https://doi.org/10.1111/2041-210x.13616). Second, I provide an R code implementing the proposed algorithm for a given dataset as example.
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='... Visit https://dataone.org/datasets/sha256%3Aae94fefb74f928a3d482eee20abf33cf04d988555ef2beef2977eba7d5504bd7 for complete metadata about this dataset.
This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and _location. All _location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `