100+ datasets found
  1. Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  2. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  3. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  4. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  5. Data from: Data and code from: Environmental influences on drying rate of...

    • catalog.data.gov
    • datasetcatalog.nlm.nih.gov
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data and code from: Environmental influences on drying rate of spray applied disinfestants from horticultural production services [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-environmental-influences-on-drying-rate-of-spray-applied-disinfestants-
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files

  6. Cyclistic_Divvy_data

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
    Explore at:
    zip(21440758 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Rami Ghaith
    Description

    The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

    The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

    Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.

  7. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  8. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  9. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  10. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • datasets.ai
    • agdatacommons.nal.usda.gov
    • +2more
    45, 47, 57, 8
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Agriculture (2024). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://datasets.ai/datasets/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    8, 45, 57, 47Available download formats
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Department of Agriculture
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826.

    Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464).


    Resources in this dataset:

    • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format.

      File Name: PBMC7_AllCells.zip

      Resource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows:

      • matrix of gene counts* (matrix.mtx.gx)
      • gene names (features.tsv.gz)
      • cell IDs (barcodes.tsv.gz)

      *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation.

      Data can be read into R using the function Read10X().


    • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata.

      File Name: PBMC7_AllCells_meta.csv

      Resource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include:

      • nCount_RNA = the number of transcripts detected in a cell
      • nFeature_RNA = the number of genes detected in a cell
      • Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells
      • prcntMito = percent mitochondrial reads in a cell
      • Scrublet = doublet probability score assigned to a cell
      • seurat_clusters = cluster ID assigned to a cell
      • PaperIDs = sample ID for a cell
      • celltypes = cell type ID assigned to a cell

      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates.

        File Name: PBMC7_AllCells_PCAcoord.csv

        Resource Description: .csv file containing first 100 PCA coordinates for cells.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates.

        File Name: PBMC7_AllCells_tSNEcoord.csv

        Resource Description: .csv file containing t-SNE coordinates for all cells.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates.

        File Name: PBMC7_AllCells_UMAPcoord.csv

        Resource Description: .csv file containing UMAP coordinates for all cells.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates.

        File Name: PBMC7_CD4only_tSNEcoord.csv

        Resource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates.

        File Name: PBMC7_CD4only_UMAPcoord.csv

        Resource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates.

        File Name: PBMC7_GDonly_UMAPcoord.csv

        Resource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates.

        File Name: PBMC7_GDonly_tSNEcoord.csv

        Resource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information.

        File Name: UnfilteredGeneInfo.txt

        Resource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.


      • Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat.

        File Name: PBMC7.tar

        Resource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  11. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  12. H

    Survey of Consumer Finances (SCF)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Survey of Consumer Finances (SCF) [Dataset]. http://doi.org/10.7910/DVN/FRMKMF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the survey of consumer finances (scf) with r the survey of consumer finances (scf) tracks the wealth of american families. every three years, more than five thousand households answer a battery of questions about income, net worth, credit card debt, pensions, mortgages, even the lease on their cars. plenty of surveys collect annual income, only the survey of consumer finances captures such detailed asset data. responses are at the primary economic unit-level (peu) - the economically dominant, financially interdependent family members within a sampled household. norc at the university of chicago administers the data collection, but the board of governors of the federal reserve pay the bills and therefore call the shots. if you were so brazen as to open up the microdata and run a simple weighted median, you'd get the wrong answer. the five to six thousand respondents actually gobble up twenty-five to thirty thousand records in the final pub lic use files. why oh why? well, those tables contain not one, not two, but five records for each peu. wherever missing, these data are multiply-imputed, meaning answers to the same question for the same household might vary across implicates. each analysis must account for all that, lest your confidence intervals be too tight. to calculate the correct statistics, you'll need to break the single file into five, necessarily complicating your life. this can be accomplished with the meanit sas macro buried in the 2004 scf codebook (search for meanit - you'll need the sas iml add-on). or you might blow the dust off this website referred to in the 2010 codebook as the home of an alternative multiple imputation technique, but all i found were broken links. perhaps it's time for plan c, and by c, i mean free. read the imputation section of the latest codebook (search for imputation), then give these scripts a whirl. they've got that new r smell. the lion's share of the respondents in the survey of consumer finances get drawn from a pretty standard sample of american dwellings - no nursing homes, no active-duty military. then there's this secondary sample of richer households to even out the statistical noise at the higher end of the i ncome and assets spectrum. you can read more if you like, but at the end of the day the weights just generalize to civilian, non-institutional american households. one last thing before you start your engine: read everything you always wanted to know about the scf. my favorite part of that title is the word always. this new github repository contains t hree scripts: 1989-2010 download all microdata.R initiate a function to download and import any survey of consumer finances zipped stata file (.dta) loop through each year specified by the user (starting at the 1989 re-vamp) to download the main, extract, and replicate weight files, then import each into r break the main file into five implicates (each containing one record per peu) and merge the appropriate extract data onto each implicate save the five implicates and replicate weights to an r data file (.rda) for rapid future loading 2010 analysis examples.R prepare two survey of consumer finances-flavored multiply-imputed survey analysis functions load the r data files (.rda) necessary to create a multiply-imputed, replicate-weighted survey design demonstrate how to access the properties of a multiply-imput ed survey design object cook up some descriptive statistics and export examples, calculated with scf-centric variance quirks run a quick t-test and regression, but only because you asked nicely replicate FRB SAS output.R reproduce each and every statistic pr ovided by the friendly folks at the federal reserve create a multiply-imputed, replicate-weighted survey design object re-reproduce (and yes, i said/meant what i meant/said) each of those statistics, now using the multiply-imputed survey design object to highlight the statistically-theoretically-irrelevant differences click here to view these three scripts for more detail about the survey of consumer finances (scf), visit: the federal reserve board of governors' survey of consumer finances homepage the latest scf chartbook, to browse what's possible. (spoiler alert: everything.) the survey of consumer finances wikipedia entry the official frequently asked questions notes: nationally-representative statistics on the financial health, wealth, and assets of american hous eholds might not be monopolized by the survey of consumer finances, but there isn't much competition aside from the assets topical module of the survey of income and program participation (sipp). on one hand, the scf interview questions contain more detail than sipp. on the other hand, scf's smaller sample precludes analyses of acute subpopulations. and for any three-handed martians in the audience, ther e's also a few biases between these two data sources that you ought to consider. the survey methodologists at the federal reserve take their job...

  13. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  14. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2018 [Dataset]. http://doi.org/10.3886/E105403
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 4 release notes:Adds data for 2018Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  15. s

    Velcro R Import Data India – Buyers & Importers List

    • seair.co.in
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim Solutions, Velcro R Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in/velcro-r-import-data.aspx
    Explore at:
    .text/.csv/.xml/.xls/.binAvailable download formats
    Dataset authored and provided by
    Seair Exim Solutions
    Area covered
    India
    Description

    Access updated Velcro R import data India with HS Code, price, importers list, Indian ports, exporting countries, and verified Velcro R buyers in India.

  16. v

    Global import data of Code R Black

    • volza.com
    csv
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Volza FZ LLC (2025). Global import data of Code R Black [Dataset]. https://www.volza.com/imports-united-states/united-states-import-data-of-code+r+black-from-italy
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 14, 2025
    Dataset authored and provided by
    Volza FZ LLC
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Count of importers, Sum of import value, 2014-01-01/2021-09-30, Count of import shipments
    Description

    32 Global import shipment records of Code R Black with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.

  17. d

    Data and scripts associated with a manuscript investigating impacts of solid...

    • search.dataone.org
    • data.nceas.ucsb.edu
    • +1more
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg (2023). Data and scripts associated with a manuscript investigating impacts of solid phase extraction on freshwater organic matter optical signatures and mass spectrometry pairing [Dataset]. http://doi.org/10.15485/1995543
    Explore at:
    Dataset updated
    Aug 21, 2023
    Dataset provided by
    ESS-DIVE
    Authors
    Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg
    Time period covered
    Aug 30, 2021 - Sep 15, 2021
    Area covered
    Description

    This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.

  18. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  19. WoSIS snapshot - December 2023

    • data.isric.org
    • repository.soilwise-he.eu
    Updated Dec 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ISRIC - World Soil Information (2023). WoSIS snapshot - December 2023 [Dataset]. https://data.isric.org/geonetwork/srv/api/records/e50f84e1-aa5b-49cb-bd6b-cd581232a2ec
    Explore at:
    www:link-1.0-http--related, www:link-1.0-http--link, www:download-1.0-ftp--downloadAvailable download formats
    Dataset updated
    Dec 20, 2023
    Dataset provided by
    International Soil Reference and Information Centre
    Authors
    ISRIC - World Soil Information
    Time period covered
    Jan 1, 1918 - Dec 1, 2022
    Area covered
    Description

    ABSTRACT: The World Soil Information Service (WoSIS) provides quality-assessed and standardized soil profile data to support digital soil mapping and environmental applications at broad scale levels. Since the release of the ‘WoSIS snapshot 2019’ many new soil data were shared with us, registered in the ISRIC data repository, and subsequently standardized in accordance with the licenses specified by the data providers. The source data were contributed by a wide range of data providers, therefore special attention was paid to the standardization of soil property definitions, soil analytical procedures and soil property values (and units of measurement). We presently consider the following soil chemical properties (organic carbon, total carbon, total carbonate equivalent, total Nitrogen, Phosphorus (extractable-P, total-P, and P-retention), soil pH, cation exchange capacity, and electrical conductivity) and physical properties (soil texture (sand, silt, and clay), bulk density, coarse fragments, and water retention), grouped according to analytical procedures (aggregates) that are operationally comparable. For each profile we provide the original soil classification (FAO, WRB, USDA, and version) and horizon designations as far as these have been specified in the source databases. Three measures for 'fitness-for-intended-use' are provided: positional uncertainty (for site locations), time of sampling/description, and a first approximation for the uncertainty associated with the operationally defined analytical methods. These measures should be considered during digital soil mapping and subsequent earth system modelling that use the present set of soil data. DATA SET DESCRIPTION: The 'WoSIS 2023 snapshot' comprises data for 228k profiles from 217k geo-referenced sites that originate from 174 countries. The profiles represent over 900k soil layers (or horizons) and over 6 million records. The actual number of measurements for each property varies (greatly) between profiles and with depth, this generally depending on the objectives of the initial soil sampling programmes. The data are provided in TSV (tab separated values) format and as GeoPackage. The zip-file (446 Mb) contains the following files: - Readme_WoSIS_202312_v2.pdf: Provides a short description of the dataset, file structure, column names, units and category values (this file is also available directly under 'online resources'). The pdf includes links to tutorials for downloading the TSV files into R respectively Excel. See also 'HOW TO READ TSV FILES INTO R AND PYTHON' in the next section. - wosis_202312_observations.tsv: This file lists the four to six letter codes for each observation, whether the observation is for a site/profile or layer (horizon), the unit of measurement and the number of profiles respectively layers represented in the snapshot. It also provides an estimate for the inferred accuracy for the laboratory measurements. - wosis_202312_sites.tsv: This file characterizes the site location where profiles were sampled. - wosis_2023112_profiles: Presents the unique profile ID (i.e. primary key), site_id, source of the data, country ISO code and name, positional uncertainty, latitude and longitude (WGS 1984), maximum depth of soil described and sampled, as well as information on the soil classification system and edition. Depending on the soil classification system used, the number of fields will vary . - wosis_202312_layers: This file characterises the layers (or horizons) per profile, and lists their upper and lower depths (cm). - wosis_202312_xxxx.tsv : This type of file presents results for each observation (e.g. “xxxx” = “BDFIOD” ), as defined under “code” in file wosis_202312_observation.tsv. (e.g. wosis_202311_bdfiod.tsv). - wosis_202312.gpkg: Contains the above datafiles in GeoPackage format (which stores the files within an SQLite database). HOW TO READ TSV FILES INTO R AND PYTHON: A) To read the data in R, please uncompress the ZIP file and specify the uncompressed folder. setwd("/YourFolder/WoSIS_2023_December/") ## For example: setwd('D:/WoSIS_2023_December/') Then use read_tsv to read the TSV files, specifying the data types for each column (c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time). observations = readr::read_tsv('wosis_202312_observations.tsv', col_types='cccciid') observations ## show columns and first 10 rows sites = readr::read_tsv('wosis_202312_sites.tsv', col_types='iddcccc') sites profiles = readr::read_tsv('wosis_202312_profiles.tsv', col_types='icciccddcccccciccccicccci') profiles layers = readr::read_tsv('wosis_202312_layers.tsv', col_types='iiciciiilcc') layers ## Do this for each observation 'XXXX', e.g. file 'Wosis_202312_orgc.tsv': orgc = readr::read_tsv('wosis_202312_orgc.tsv', col_types='iicciilccdccddccccc') orgc Note: One may also use the following R code (example is for file 'observations.tsv'): observations <- read.table("wosis_202312_observations.tsv", sep = "\t", header = TRUE, quote = "", comment.char = "", stringsAsFactors = FALSE ) B) To read the files into python first decompress the files to your selected folder. Then in python: # import the required library import pandas as pd # Read the observations data observations = pd.read_csv("wosis_202312_observations.tsv", sep="\t") # print the data frame header and some rows observations.head() # Read the sites data sites = pd.read_csv("wosis_202312_sites.tsv", sep="\t") # Read the profiles data profiles = pd.read_csv("wosis_202312_profiles.tsv", sep="\t") # Read the layers data layers = pd.read_csv("wosis_202312_layers.tsv", sep="\t") # Read the soil property data, e.g. 'cfvo' (do this for each observation) cfvo = pd.read_csv("wosis_202312_cfvo.tsv", sep="\t") CITATION: Calisto, L., de Sousa, L.M., Batjes, N.H., 2023. Standardised soil profile data for the world (WoSIS snapshot – December 2023), https://doi.org/10.17027/isric-wdcsoils-20231130 Supplement to: Batjes N.H., Calisto, L. and de Sousa L.M., 2023. Providing quality-assessed and standardised soil data to support global mapping and modelling (WoSIS snapshot 2023). Earth System Science Data, https://doi.org/10.5194/essd-16-4735-2024.

  20. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law...

    • openicpsr.org
    Updated Mar 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law Enforcement Officers Killed and Assaulted (LEOKA) 1960-2024 [Dataset]. http://doi.org/10.3886/E102180V15
    Explore at:
    Dataset updated
    Mar 25, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1960 - 2024
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 15 release notes:Adds .parquet file formatVersion 14 release notes:Adds 2023 and 2024 dataVersion 13 release notes:Adds 2022 dataVersion 12 release notes:Adds 2021 data.Version 11 release notes:Adds 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will (probably, I haven't seen confirmation either way) be the last LEOKA data they release. Changes .rda file to .rds.Version 10 release notes:Changes release notes description, does not change data.Version 9 release notes:Adds data for 2019.Version 8 release notes:Fix bug for years 1960-1971 where the number of months reported variable was incorrectly down by 1 month. I recommend caution when using these years as they only report either 0 or 12 months of the year, which differs from every other year in the data. Added the variable officers_killed_total which is the sum of officers_killed_by_felony and officers_killed_by_accident.Version 7 release notes:Adds data from 2018Version 6 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 5 release notes: Adds data for 1960-1974 and 2017. Note: many columns (including number of female officers) will always have a value of 0 for years prior to 1971. This is because those variables weren't collected prior to 1971. These should be NA, not 0 but I'm keeping it as 0 to be consistent with the raw data. Removes support for .csv and .sav files.Adds a number_of_months_reported variable for each agency-year. A month is considered reported if the month_indicator column for that month has a value of "normal update" or "reported, not data."The formatting of the monthly data has changed from wide to long. This means that each agency-month has a single row. The old data had each agency being a single row with each month-category (e.g. jan_officers_killed_by_felony) being a column. Now there will just be a single column for each category (e.g. officers_killed_by_felony) and the month can be identified in the month column. This also results in most column names changing. As such, be careful when aggregating the monthly data since some variables are the same every month (e.g. number of officers employed is measured annually) so aggregating will be 12 times as high as the real value for those variables. Adds a date column. This date column is always set to the first of the month. It is NOT the date that a crime occurred or was reported. It is only there to make it easier to create time-series graphs that require a date input.All the data in this version was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. Data is the same as from NACJD but using all FBI files makes cleaning easier as all column names are already identical. Version 4 release notes: Add data for 2016.Order rows by year (descending) and ORI.Version 3 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The LEOKA data sets contain highly detailed data about the number of officers/civilians employed by an agency and how many officers were killed or assaulted. All the data was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. About 7% of all agencies in the data report more officers or civilians than population. As such, I removed the officers/civilians per 1,000 population variables. You should exercise caution if deciding to generate and use these variables yourself. Several agency had impossible large (>15) officer deaths in a single month. For those months I changed the value to NA. The UCR Handbook (https://ucr.fbi.gov/additional-ucr-publications/ucr_handbook.pdf/view) describes the LEOKA data as follows:"The UCR Program collects data from all contributing agencies ... on officer line-of-duty deaths and assaults. Reporting agencies must submit data on ... their own duly sworn officers f

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Organization logo

Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research

Explore at:
txtAvailable download formats
Dataset updated
Dec 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Kingsley Okoye; Samira Hosseini
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

Search
Clear search
Close search
Google apps
Main menu