28 datasets found
  1. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  2. f

    Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...

    • scielo.figshare.com
    • datasetcatalog.nlm.nih.gov
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo (2023). HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE USING R SOFTWARE AND RSTUDIO [Dataset]. http://doi.org/10.6084/m9.figshare.19899537.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    ABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.

  3. f

    Data and R code for "New methods for quantifying the effects of catchment...

    • smithsonian.figshare.com
    txt
    Updated Jul 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald Weller; Matthew Baker; Ryan King (2024). Data and R code for "New methods for quantifying the effects of catchment spatial patterns on aquatic responses" [Dataset]. http://doi.org/10.25573/serc.23557056.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jul 13, 2024
    Dataset provided by
    Smithsonian Environmental Research Center
    Authors
    Donald Weller; Matthew Baker; Ryan King
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This figshare item provides data and R code to reproduce the analysis in the following paper:Weller, DE; ME Baker, and RS King. 2023. New methods for quantifying the effects of catchment spatial patterns on aquatic responses. Landscape Ecology. https://doi.org/10.1007/s10980-023-01706-xThis figshare item provides 14 files: five data files (.csv files), a list of models to be fitted by the R code (Modlist.csv), and seven files of R code (.R files). The file 0SpatialAnalysis.txt provides more information on the spatial analysis we used to generate distance distributions.Data filesThe five data files are· subestPCB.csv· cdist.csv· hdist.csv· ldist.csv· tdist.csvThe file subestPCB.csv provides catchment id numbers, names, and average measured PCB concentrations from fish tissues for 14 study subestuaries. The remaining four files provide the distance distributions for commercial land, high-density residential land, low-density residential land, and all land. Each distance file has four columns, junk, count, catchment id, and distance. Information in the junk column is not used. Count provides land area as the number of 30 by 30 meter (0.09 hectare) pixels. The variable called distance provides the distance to the subestuary shoreline in decameters.R codeThe R codes reproduce the statistical analysis and most of the tables and figures from the published paper.We ran the codes using Rstudio. We invoked Rstudio’s New Project … > Existing Directory option to establish the directory containing the data files and R codes files as an Rstudio project. Then we ran five R codes in sequence according to the initial numbers in the file names (1ReadData.R, 2FitModels.R, 3Tables.R, 4Figures.R, and 5FigureS3.R). Each program adds to the objects saved in the R workspace within the Rstudio project. Figures and tables are saved in the subdirectory FiguresTables.The five numbered R files also use functions from two other files: DistWeightFunctionsV01.R and AuxillaryFunctionsV01.R.The first R program expects the five data files (subestPCB.csv, cdist.csv, hdist.csv, ldist.csv, and tdist.csv) to reside in the same directory as the program and the Rstudio project.Comments in the R files provide additional information on how each one works.

  4. c

    Research data supporting 'Lithic Technological Change and Behavioral...

    • repository.cam.ac.uk
    bin, docx, xlsx
    Updated Sep 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carroll, Peyton (2020). Research data supporting 'Lithic Technological Change and Behavioral Responses to the Last Glacial Maximum Across Southwestern Europe' [Dataset]. http://doi.org/10.17863/CAM.56697
    Explore at:
    xlsx(56230 bytes), bin(6066 bytes), bin(46471 bytes), xlsx(542779 bytes), docx(347181 bytes)Available download formats
    Dataset updated
    Sep 8, 2020
    Dataset provided by
    Apollo
    University of Cambridge
    Authors
    Carroll, Peyton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was used to collect and analyze data for the MPhil Thesis, "Lithic Technological Change and Behavioral Responses to the Last Glacial Maximum Across Southwestern Europe." This dataset contains the raw data collected from published literature, and the R code used to run correspondence analysis on the data and create graphical representations of the results. It also contains notes to aid in interpreting the dataset, and a list detailing how variables in the dataset were grouped for use in analysis. The file "Diss Data.xlsx" contains the raw data collected from publications on Upper Paleolithic archaeological sites in France, Spain, and Italy. This data is the basis for all other files included in the repository. The document "Diss Data Notes.docx" contains detailed information about the raw data, and is useful for understanding its context. "Revised Variable Groups.docx" lists all of the variables from the raw data considered "tool types" and the major categories into which they were sorted for analysis. "Group Definitions.docx" provides the criteria considered to make the groups listed in the "Revised Variable Groups" document. "r_diss_data.xlsx" contains only the variables from the raw data that were considered for correspondence analysis carried-out in RStudio. The document "ca_barplot.R" contains the RStudio code written to perform correspondence analysis and percent composition analysis on the data from "R_Diss_Data.xlsx". This file also contains code for creating scatter plots and bar graphs displaying the results from the CA and Percent Comp tests. The RStudio packages used to carry out the analysis and to create graphical representations of the analysis results are listed under "Software/Usage Instructions." "climate_curve.R" contains the RStudio code used to create climate curves from NGRIP and GRIP data available open-access from the Neils Bohr Institute Center of Ice and Climate. The link to access this data is provided in "Related Resources" below.

  5. R scripts used to analyze rodent call statistics generated by 'DeepSqueak'

    • figshare.com
    zip
    Updated May 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathijs Blom (2021). R scripts used to analyze rodent call statistics generated by 'DeepSqueak' [Dataset]. http://doi.org/10.6084/m9.figshare.14696304.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 28, 2021
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Mathijs Blom
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The scripts in this folder weer used to combine all call statistic files per day into one file, resulting in nine files containing all call statistics per data. The script ‘merging_dataset.R’ was used to combine all days worth of call statistics and create subsets of two frequency ranges (18-32 and 32-96). The script ‘camera_data’ was used to combine all camera and observation data.

  6. r

    Data from: Working with a linguistic corpus using R: An introductory note...

    • researchdata.edu.au
    • bridges.monash.edu
    Updated May 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia (2022). Working with a linguistic corpus using R: An introductory note with Indonesian Negating Construction [Dataset]. http://doi.org/10.4225/03/5a7ee2ac84303
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg; I Made Rajeg; Karlina Denistia
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is a repository for codes and datasets for the open-access paper in Linguistik Indonesia, the flagship journal for the Linguistic Society of Indonesia (Masyarakat Linguistik Indonesia [MLI]) (cf. the link in the references below).


    To cite the paper (in APA 6th style):

    Rajeg, G. P. W., Denistia, K., & Rajeg, I. M. (2018). Working with a linguistic corpus using R: An introductory note with Indonesian negating construction. Linguistik Indonesia, 36(1), 1–36. doi: 10.26499/li.v36i1.71


    To cite this repository:
    Click on the Cite (dark-pink button on the top-left) and select the citation style through the dropdown button (default style is Datacite option (right-hand side)

    This repository consists of the following files:
    1. Source R Markdown Notebook (.Rmd file) used to write the paper and containing the R codes to generate the analyses in the paper.
    2. Tutorial to download the Leipzig Corpus file used in the paper. It is freely available on the Leipzig Corpora Collection Download page.
    3. Accompanying datasets as images and .rds format so that all code-chunks in the R Markdown file can be run.
    4. BibLaTeX and .csl files for the referencing and bibliography (with APA 6th style).
    5. A snippet of the R session info after running all codes in the R Markdown file.
    6. RStudio project file (.Rproj). Double click on this file to open an RStudio session associated with the content of this repository. See here and here for details on Project-based workflow in RStudio.
    7. A .docx template file following the basic stylesheet for Linguistik Indonesia

    Put all these files in the same folder (including the downloaded Leipzig corpus file)!

    To render the R Markdown into MS Word document, we use the bookdown R package (Xie, 2018). Make sure this package is installed in R.

    Yihui Xie (2018). bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.6.


  7. Large Landslide Exposure in Metropolitan Cities

    • zenodo.org
    bin, csv
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joaquin V. Ferrer; Joaquin V. Ferrer (2024). Large Landslide Exposure in Metropolitan Cities [Dataset]. http://doi.org/10.5281/zenodo.13842843
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Sep 27, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joaquin V. Ferrer; Joaquin V. Ferrer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Sep 27, 2024
    Description

    These datasets (.Rmd, .Rroj., .rds) are ready to use within the R software for statistical programming with the R Studio Graphical User Interface (https://posit.co/download/rstudio-desktop/). Please copy the folder structure into one single directory and follow the instructions given in the .Rmd file. Files and data are listed and described as follows:

    Main directory files: results_fpath

    • Code containing statisticla analysis and ploting: 20240927_code.Rmd
    • 1_melted_lan_df.rds: Landslide time series database covering 1,085 landslides intersected with settlement footprints from 1985-2015.
    • 4_cities_lan.df.rds: City and landslide data for these 1,085 landslides intersected with settlement footprints from 1985-2015.
    • 7_zoib_nested_pop_pressure_model: brms statistical model file.
    • ghs_stat_fua_comb.gpkg: Urban center data from the GHSL - Global Human Settlement Layer.

    Population estimation files: wpop_files

    • 2015_ls_pop.csv: Estimates of population on landslides using the 100x100 population density grid from the WorldPop dataset.

    Steepness and elevation analysis derived from SRTM and processed in Google Earth Engine for landslides, mountain regions and urban centers in cities: gee_files

    • 1_mr_met.csv: Elevation and mean slope for mountain region areas in cities
    • 2_uc_met.csv: Elevation and mean slope for urban centers (defined by in the GHSL data) in cities

    Standard deviation analysis derived from SRTM and processed in Google Earth Engine for mean slope in mountain regions and urban centers in cities: gee_sd

    • gee_mr.csv: Mean slope and standard deviation for mountain region
    • gee_uc.csv: Mean slope and standard deviation for urban centers (defined by in the GHSL data)
  8. H

    Political Analysis Using R: Example Code and Data, Plus Data for Practice...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jamie Monogan (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 28, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Jamie Monogan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

  9. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  10. Codes in R for spatial statistics analysis, ecological response models and...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya (2025). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. http://doi.org/10.5281/zenodo.7603557
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    D. W. Rössel-Ramírez; D. W. Rössel-Ramírez; J. Palacio-Núñez; J. Palacio-Núñez; S. Espinosa; S. Espinosa; J. F. Martínez-Montoya; J. F. Martínez-Montoya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

    It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

    In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

    Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

    After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

    Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

    Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

    On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

    Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

    Validation set

    Model

    True

    False

    Presence

    A

    B

    Background

    C

    D

    We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

    The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

    Regarding the model evaluation and estimation, we selected the following estimators:

    1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

    2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

  11. o

    Population Pyramid Data and R Script for the US, States, and Counties 1970 -...

    • openicpsr.org
    delimited
    Updated Jan 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathanael Rosenheim (2020). Population Pyramid Data and R Script for the US, States, and Counties 1970 - 2017 [Dataset]. http://doi.org/10.3886/E117081V2
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jan 23, 2020
    Dataset provided by
    Texas A&M University
    Authors
    Nathanael Rosenheim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States, Counties, States
    Description

    Population pyramids provide a way to visualize the age and sex composition of a geographic region, such as a nation, state, or county. A standard population pyramid divides sex into two bar charts or histograms, one for the male population and one for the female population. The two charts mirror each other and are divide age into 5-year cohorts. The shape of a population pyramid provides insights into a region’s fertility, mortality, and migration patterns. When a region has high fertility and mortality, but low migration the visualization will look like a pyramid, with the youngest age cohort (0-4 years) representing the largest percent of the population and each older cohort representing a progressively smaller percent of the population.

    In many regions fertility and mortality have decreased significantly since 1970, as people live longer and women have fewer children. With lower fertility and mortality, population pyramids are shaped more like a pillar.

    While population pyramids can be made for any geographic region, when interpreting population pyramids for smaller areas (like counties) the most important force that shapes the pyramid is often in- and out-migration (Wang and vom Hofe, 2006, p. 65). For smaller regions, population pyramids can have unique shapes.

    This data archive provides the resources needed to generate population pyramids for the United States, individual states, and any county within the United States. Population pyramids usually require significant data cleaning and graph making skills to generate one pyramid. With this data archive the data cleaning has been completed and the R script provides reusable code to quickly generate graphs. The final output is an image file with six graphs on one page. The final layout makes it easy to compare changes in population age and sex composition for any state and any county in the US for 1970, 1980, 1990, 2000, 2010, and 2017.

  12. Course "Data Analysis for Medical Research using R"

    • osf.io
    Updated Jan 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monika Hebeisen (2024). Course "Data Analysis for Medical Research using R" [Dataset]. http://doi.org/10.17605/OSF.IO/R73BV
    Explore at:
    Dataset updated
    Jan 15, 2024
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Monika Hebeisen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Short R training course for medical students and other clinical researchers. Course duration is 2x3.5 hours. Basic concepts of R for data analysis are introduced. Components: setup of an R Project in R Studio, data loading from comma-separated value files or Excel files, data preparation, creation of descriptive tables, graphics, and simple statistical tests. Course structure: sandwich principle, where short receptive phases of learning are alternated with expressive phases. Receptive phases are performed with a simple real medical data example on slides and in demonstrations in R Studio, expressive phases with a more complex real dataset from a published RCT, with which the same basic data analysis steps are trained in simple exercises in the lectures and the primary outcome of the RCT is recalculated. Along the way, basic concepts of the R programming language are taught, such as the definition of objects and classes, data frames, and subsetting, functions and their arguments, together with help files, how to find and load relevant packages, and handling of missing values. After the course, participants should be able to perform an analysis with own data in R, potentially by consulting additional sources for help such as R help files and internet searches.

  13. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  14. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  15. m

    ANN dual-fluid PV/T data and code in R programming - Architecture of the...

    • data.mendeley.com
    Updated Mar 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hasila Jarimi (2021). ANN dual-fluid PV/T data and code in R programming - Architecture of the Artificial Neural Network [Dataset]. http://doi.org/10.17632/gxxszgy85t.1
    Explore at:
    Dataset updated
    Mar 8, 2021
    Authors
    Hasila Jarimi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides the artificial neural network architecture for a dual-fluid photovoltaic thermal (PV/T) collector which was experimentally tested in the outdoor environment of Malaysia. The system was set up and tested in three modes, which are (i) air mode, (ii) water mode and (iii) simultaneous mode. For modes (i), (ii) and (iii) air flows through the cooling channels, water flows through the cooling channels and both air and water flow together.

    To create this dataset, the following steps were carried out:

    1. Select input variables: 5 data inputs were selected, which are Ambient temperature, wind speed, solar irradiance, inlet air temperature and inlet water temperature.
    2. Select Algorithm: for training, the Backpropagation neural network (BPNN) was used.
    3. Select output variables: 6 data output were selected, which are PV surface temperature, PV temperature, temperature of the back plate, the temperature of the outlet air and outlet water, in addition to the electrical efficiency.

    Step 1: Import the data Step 2: Normalize the data Step 3: Split the dataset into training and testing data Step 4: Create the NN model in R studio.

    The package 'Neuralnet' in the R programming language was used. The coding in R studio is provided in the attached file.

  16. e

    Replication Data for: Quantifying landscape-level land use intensity...

    • b2find.eudat.eu
    Updated Oct 15, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Replication Data for: Quantifying landscape-level land use intensity patterns through radar-based remote sensing - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/cdc81d26-261f-5d7b-882a-b1691a13ee97
    Explore at:
    Dataset updated
    Oct 15, 2017
    Description

    The data package consists of four (zipped) folders and this metadata file: FINAL DATA --> Godwit Early Establishement - Data collected by a large team lead by Professor Theunis Piersma (permission to use this data). Folder contains the locations of black tailed godwits counts early 1 March to 1 May 2016. Data were used to calculate habitat selection indices of early establishing black-tailed godwits. --> LogRatio_Date_Pairs Log ratio change detection analyses carried out on Sentinel SAR1 data clipped out for the south-west Friesland study site. Used to produce Table 1, Figure 1c &d, Fig 3 and Fig 6 --> Processed Modis EVI, Comparison between Sentinel Radar and Optical Modis EVI Data used to produce figure 1a & b --> Processed_Radar_Imagery, contains the variance in Sentinel Radar and standard deviation of change in Sentinel Radar values between 31 March and 17 July 2016 --> Shapefiles, contains all point, line and polygon shapefiles used to intersect the spatial analysis data. --> StDev_LogRatioChange31Mar_17Jul2016, standard deviation of change in Sentinel Radar values between 31 March and 17 July 2016 - used in final ESRI mxd file to produce figures 3a and 6a --> Temperature_Stavoren, downloaded from KNMI website and used to produce temperature variation in Figure 1a & 1c --> two data files parcel55_combined_veg_radar_evi.txt parcel55_detailed_transects_2016.txt compiled data used to produce figures 4 and 5 FINAL FIGURES Final high quality figures produced for the final print. FINAL MAP PROJECTS Esri (ArcMap 10.3) map Project used to produce figures 3a & 6a. FINAL SCRIPTS R version 3.4.0, R Studio 5.4.1 Analysis scripts used to produce the figures for this manuscript and labelled as such. In addition a pre-processing script that allows the user to stack Sentinel SAR1 data after it has been downloaded from the Sentinel Scientific Hub and processed with SNAP v6.0. Add orbit file > ratiometric calibration > geometric terrain correction Accepted by Journal of Applied Ecology, October 2017.

  17. data and codes

    • figshare.com
    zip
    Updated Feb 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingsong Liu (2021). data and codes [Dataset]. http://doi.org/10.6084/m9.figshare.14102627.v5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 26, 2021
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Qingsong Liu
    License

    https://www.gnu.org/copyleft/gpl.htmlhttps://www.gnu.org/copyleft/gpl.html

    Description

    Steps to run the code:1. download our material.2. unzip the data and put it in the same folder of code.rmd.3. run the code.rmd in R studio. Or you can copy the code block in "R codes and results.html" in R console to get the results.

  18. Google Capstone Project - BellaBeats

    • kaggle.com
    Updated Jan 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Porzelius (2023). Google Capstone Project - BellaBeats [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-capstone-project-bellabeats
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jason Porzelius
    Description

    Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

    Section 1 - Ask: A. Guiding Questions: Who are the key stakeholders and what are their goals for the data analysis project? What is the business task that this data analysis project is attempting to solve?

    B. Key Tasks: Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team. Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

    Section 2 - Prepare: A. Guiding Questions: Where is the data stored and organized? Are there any problems with the data? How does the data help answer the business question?

    B. Key Tasks: Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016. *Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDaymerged.csv -dailyActivitymerged.csv Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual IDs in the dailyActivity_merged dataset. *Due to the small number of participants (...

  19. S

    Data collection materials (including Bibliometric and thematic analysis)

    • scidb.cn
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QIAN ZHANG (2025). Data collection materials (including Bibliometric and thematic analysis) [Dataset]. http://doi.org/10.57760/sciencedb.28312
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 25, 2025
    Dataset provided by
    Science Data Bank
    Authors
    QIAN ZHANG
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    220 articles were found in the first database search, 123 of which came from Scopus and 97 from Web of Science. However, Bibliometrix R tool (Biblioshiny) is limited to analysing literature from a single online database. Specifically, it can analyse only the two databases, Web of Science and Scopus, individually and cannot combine them. We utilised the R Studio tool to combine and eliminate duplicate entries in the data obtained in txt format from Web of Science and the data obtained in bib format from Scopus using code. After running the R tool code, 170 articles remained for additional screening and selection after 48 duplicate articles were removed and 2 articles were missing. The 170 records were then exported as xlsx files and added to the "Biblioshiny" Bibliometrix R tool for a comprehensive bibliometric analysis.The systematic approach of literature identification and screening was conducted following the PRISMA 2020 paradigm(Page et al., 2021), as depicted in Figure 3. After duplicate entries were removed, 170 records were left for further filtering based on the selection criteria listed in Table 1.33 records that did not meet the inclusion requirements—aside from the second criterion—were eliminated after a preliminary review of each record's title and abstract. Out of the remaining 137 entries, we were able to collect the whole texts of 120 reports for analysis. Despite our best efforts, which included both automatic (Zotero) and human (internet search) methods, we were unable to overcome the limitations that prevented us from obtaining the remaining 17 data. These limitations included limited availability, restricted access, and/or technical issues. Afterwards, the writers conducted a thorough examination of the complete text of each report based on specific criteria for inclusion and exclusion. This was done to identify any discussions that were not mentioned in the abstract. A total of 67 studies were excluded from the analysis. Out of these, 35 studies did not discuss tourist identity and instead focused on related identities that could be influenced by heritage tourism,10 examined tourist identity in the context of heritage tourism, but did not specifically focus on its construction. Lastly, 22 studies explored the construction of identity in heritage tourism, but primarily relied on surveys of local residents rather than focusing on the tourist group. Despite doing a thorough search for supplementary records by examining citations, no additional papers were discovered. Consequently, the total count of publications that were considered for thematic analysis in the review was 53. Throughout the entire selection process, the writers carefully reviewed every publication and discussed any discrepancies in their findings until they all agreed.

  20. Data from: Improving taxonomic practices and enhancing its extensibility -...

    • figshare.com
    xlsx
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Bond; Rebecca Godwin; jordan colby; Lacie Newton; Xavier Zahnle; Ingi Agnarsson; Chris Hamilton; Matjaž Kuntner (2022). Improving taxonomic practices and enhancing its extensibility - an example from araneology [Dataset]. http://doi.org/10.6084/m9.figshare.17263835.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jan 1, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Jason Bond; Rebecca Godwin; jordan colby; Lacie Newton; Xavier Zahnle; Ingi Agnarsson; Chris Hamilton; Matjaž Kuntner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We downloaded nearly all the taxonomic works documented in the WSC during the time period of 2008-2018. Each investigator documented authorship and the number of new species described per publication; our review focused exclusively on newly described taxa during the study time period. Only a few non-English works were omitted from the study for which we were unable to find a translation that allowed confident data scoring. Table 1 below lists the parameters reviewed and how they were scored. Binary scorings were based on interrogative NO/YES (0/1) responses, whereas others were quantitative (documenting absolute numbers of observations or counts). Generally, we assessed the following: 1) what type of data were used to establish species constructs; 2) how species were illustrated; 3) was raw specimen data downloadable; 4) how many specimens were examined for each species; and 5) what sexes were available for each species. The number of specimens available was tabulated as (1), (2), or (>2); >2 is somewhat arbitrary and underestimates the paucity of data associated with some species, but objectively captures the variation in the data set without documenting the absolute number of specimens for all species. A study was classified as ‘integrative’ if morphological data (i.e., genitalic or other) was used in combination with at least one other data source. For species concept, we assessed each paper to determine whether the author stated explicitly what species concept they used to delineate taxa.The data were tabulated in a MS Excel spreadsheet; summary statistics and bar graphs were produced using the base R statistics packages and carried out in R-Studio [29].

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
Organization logo

Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem

Related Article
Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
bin, application/gzip, zip, text/x-pythonAvailable download formats
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
License

https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

Description
Replication pack, FSE2018 submission #164:
------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
A Case Study of the PyPI Ecosystem

**Note:** link to data artifacts is already included in the paper. 
Link to the code will be included in the Camera Ready version as well.


Content description
===================

- **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
 described below
- **settings.py** - settings template for the code archive.
- **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
 This dataset only includes stats aggregated by the ecosystem (PyPI)
- **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
 statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
 themselves, which take around 2TB.
- **build_model.r, helpers.r** - R files to process the survival data 
  (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
  `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
  **dataset_full_Jan_2018.tgz**)
- **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
- LICENSE - text of GPL v3, under which this dataset is published
- INSTALL.md - replication guide (~2 pages)
Replication guide
=================

Step 0 - prerequisites
----------------------

- Unix-compatible OS (Linux or OS X)
- Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
- R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)

Depending on detalization level (see Step 2 for more details):
- up to 2Tb of disk space (see Step 2 detalization levels)
- at least 16Gb of RAM (64 preferable)
- few hours to few month of processing time

Step 1 - software
----------------

- unpack **ghd-0.1.0.zip**, or clone from gitlab:

   git clone https://gitlab.com/user2589/ghd.git
   git checkout 0.1.0
 
 `cd` into the extracted folder. 
 All commands below assume it as a current directory.
  
- copy `settings.py` into the extracted folder. Edit the file:
  * set `DATASET_PATH` to some newly created folder path
  * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
- install docker. For Ubuntu Linux, the command is 
  `sudo apt-get install docker-compose`
- install libarchive and headers: `sudo apt-get install libarchive-dev`
- (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
 Without this dependency, you might get an error on the next step, 
 but it's safe to ignore.
- install Python libraries: `pip install --user -r requirements.txt` . 
- disable all APIs except GitHub (Bitbucket and Gitlab support were
 not yet implemented when this study was in progress): edit
 `scraper/init.py`, comment out everything except GitHub support
 in `PROVIDERS`.

Step 2 - obtaining the dataset
-----------------------------

The ultimate goal of this step is to get output of the Python function 
`common.utils.survival_data()` and save it into a CSV file:

  # copy and paste into a Python console
  from common import utils
  survival_data = utils.survival_data('pypi', '2008', smoothing=6)
  survival_data.to_csv('survival_data.csv')

Since full replication will take several months, here are some ways to speedup
the process:

####Option 2.a, difficulty level: easiest

Just use the precomputed data. Step 1 is not necessary under this scenario.

- extract **dataset_minimal_Jan_2018.zip**
- get `survival_data.csv`, go to the next step

####Option 2.b, difficulty level: easy

Use precomputed longitudinal feature values to build the final table.
The whole process will take 15..30 minutes.

- create a folder `
Search
Clear search
Close search
Google apps
Main menu