100+ datasets found
  1. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    HIV Prevention Trials Networkhttp://www.hptn.org/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  2. Data from: Ecosystem-Level Determinants of Sustained Activity in Open-Source...

    • zenodo.org
    application/gzip, bin +2
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb (2024). Ecosystem-Level Determinants of Sustained Activity in Open-Source Projects: A Case Study of the PyPI Ecosystem [Dataset]. http://doi.org/10.5281/zenodo.1419788
    Explore at:
    bin, application/gzip, zip, text/x-pythonAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Marat Valiev; Marat Valiev; Bogdan Vasilescu; James Herbsleb; Bogdan Vasilescu; James Herbsleb
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Description
    Replication pack, FSE2018 submission #164:
    ------------------------------------------
    
    **Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: 
    A Case Study of the PyPI Ecosystem
    
    **Note:** link to data artifacts is already included in the paper. 
    Link to the code will be included in the Camera Ready version as well.
    
    
    Content description
    ===================
    
    - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files 
     described below
    - **settings.py** - settings template for the code archive.
    - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset.
     This dataset only includes stats aggregated by the ecosystem (PyPI)
    - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level
     statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages
     themselves, which take around 2TB.
    - **build_model.r, helpers.r** - R files to process the survival data 
      (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, 
      `common.cache/survival_data.pypi_2008_2017-12_6.csv` in 
      **dataset_full_Jan_2018.tgz**)
    - **Interview protocol.pdf** - approximate protocol used for semistructured interviews.
    - LICENSE - text of GPL v3, under which this dataset is published
    - INSTALL.md - replication guide (~2 pages)
    Replication guide
    =================
    
    Step 0 - prerequisites
    ----------------------
    
    - Unix-compatible OS (Linux or OS X)
    - Python interpreter (2.7 was used; Python 3 compatibility is highly likely)
    - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible)
    
    Depending on detalization level (see Step 2 for more details):
    - up to 2Tb of disk space (see Step 2 detalization levels)
    - at least 16Gb of RAM (64 preferable)
    - few hours to few month of processing time
    
    Step 1 - software
    ----------------
    
    - unpack **ghd-0.1.0.zip**, or clone from gitlab:
    
       git clone https://gitlab.com/user2589/ghd.git
       git checkout 0.1.0
     
     `cd` into the extracted folder. 
     All commands below assume it as a current directory.
      
    - copy `settings.py` into the extracted folder. Edit the file:
      * set `DATASET_PATH` to some newly created folder path
      * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` 
    - install docker. For Ubuntu Linux, the command is 
      `sudo apt-get install docker-compose`
    - install libarchive and headers: `sudo apt-get install libarchive-dev`
    - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools`
     Without this dependency, you might get an error on the next step, 
     but it's safe to ignore.
    - install Python libraries: `pip install --user -r requirements.txt` . 
    - disable all APIs except GitHub (Bitbucket and Gitlab support were
     not yet implemented when this study was in progress): edit
     `scraper/init.py`, comment out everything except GitHub support
     in `PROVIDERS`.
    
    Step 2 - obtaining the dataset
    -----------------------------
    
    The ultimate goal of this step is to get output of the Python function 
    `common.utils.survival_data()` and save it into a CSV file:
    
      # copy and paste into a Python console
      from common import utils
      survival_data = utils.survival_data('pypi', '2008', smoothing=6)
      survival_data.to_csv('survival_data.csv')
    
    Since full replication will take several months, here are some ways to speedup
    the process:
    
    ####Option 2.a, difficulty level: easiest
    
    Just use the precomputed data. Step 1 is not necessary under this scenario.
    
    - extract **dataset_minimal_Jan_2018.zip**
    - get `survival_data.csv`, go to the next step
    
    ####Option 2.b, difficulty level: easy
    
    Use precomputed longitudinal feature values to build the final table.
    The whole process will take 15..30 minutes.
    
    - create a folder `
  3. Friends - R Package Dataset

    • kaggle.com
    zip
    Updated Nov 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Yukio Imafuko (2024). Friends - R Package Dataset [Dataset]. https://www.kaggle.com/datasets/lucasyukioimafuko/friends-r-package-dataset
    Explore at:
    zip(2018791 bytes)Available download formats
    Dataset updated
    Nov 11, 2024
    Authors
    Lucas Yukio Imafuko
    Description

    The whole data and source can be found at https://emilhvitfeldt.github.io/friends/

    "The goal of friends to provide the complete script transcription of the Friends sitcom. The data originates from the Character Mining repository which includes references to scientific explorations using this data. This package simply provides the data in tibble format instead of json files."

    Content

    • friends.csv - Contains the scenes and lines for each character, including season and episodes.
    • friends_emotions.csv - Contains sentiments for each scene - for the first four seasons only.
    • friends_info.csv - Contains information regarding each episode, such as imdb_rating, views, episode title and directors.

    Uses

    • Text mining, sentiment analysis and word statistics.
    • Data visualizations.
  4. NYC STEW-MAP Staten Island organizations' website hyperlink webscrape

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). NYC STEW-MAP Staten Island organizations' website hyperlink webscrape [Dataset]. https://catalog.data.gov/dataset/nyc-stew-map-staten-island-organizations-website-hyperlink-webscrape
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Area covered
    New York, Staten Island
    Description

    The data represent web-scraping of hyperlinks from a selection of environmental stewardship organizations that were identified in the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017). There are two data sets: 1) the original scrape containing all hyperlinks within the websites and associated attribute values (see "README" file); 2) a cleaned and reduced dataset formatted for network analysis. For dataset 1: Organizations were selected from from the 2017 NYC Stewardship Mapping and Assessment Project (STEW-MAP) (USDA 2017), a publicly available, spatial data set about environmental stewardship organizations working in New York City, USA (N = 719). To create a smaller and more manageable sample to analyze, all organizations that intersected (i.e., worked entirely within or overlapped) the NYC borough of Staten Island were selected for a geographically bounded sample. Only organizations with working websites and that the web scraper could access were retained for the study (n = 78). The websites were scraped between 09 and 17 June 2020 to a maximum search depth of ten using the snaWeb package (version 1.0.1, Stockton 2020) in the R computational language environment (R Core Team 2020). For dataset 2: The complete scrape results were cleaned, reduced, and formatted as a standard edge-array (node1, node2, edge attribute) for network analysis. See "READ ME" file for further details. References: R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Version 4.0.3. Stockton, T. (2020). snaWeb Package: An R package for finding and building social networks for a website, version 1.0.1. USDA Forest Service. (2017). Stewardship Mapping and Assessment Project (STEW-MAP). New York City Data Set. Available online at https://www.nrs.fs.fed.us/STEW-MAP/data/. This dataset is associated with the following publication: Sayles, J., R. Furey, and M. Ten Brink. How deep to dig: effects of web-scraping search depth on hyperlink network analysis of environmental stewardship organizations. Applied Network Science. Springer Nature, New York, NY, 7: 36, (2022).

  5. d

    Health and Retirement Study (HRS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D

  6. d

    Data from: Decision-Support Framework for Linking Regional-Scale Management...

    • catalog.data.gov
    • data.usgs.gov
    Updated Nov 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species [Dataset]. https://catalog.data.gov/dataset/decision-support-framework-for-linking-regional-scale-management-actions-to-continental-sc
    Explore at:
    Dataset updated
    Nov 20, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This data release presents the data, JAGS models, and R code used to manipulate data and to produce results and figures presented in the USGS Open File Report, "Decision-Support Framework for Linking Regional-Scale Management Actions to Continental-Scale Conservation of Wide-Ranging Species, (https://doi.org/10.5066/P93YTR3X). The zip folder is provided so that other can reproduce results from the integrated population model, inspect model structure and posterior simulations, conduct analyses not present in the report, and use and modify the code. Raw source data can be sourced from the USGS Bird Banding Laboratory, USFWS Surveys and Monitoring Branch, National Oceanic and Atmospheric administration, and Ducks Unlimited Canada. The zip file contains the following objects when extracted: Readme.txt: A plain text file describing each file in this directory. Figures-Pintail-IPM.r: R code that generates report figures in png, pdf, and eps format. Generates Figures 2-11 and calls source code for figures 12 and 13 found in other files. * get pintail IPM data.r: R source code that must be run to format data for the IPM code file. * getbandrecovs.r: R code that takes Bird Banding Lab data for pintail band releases and recoveries and formats for analysis. This file is called by 'get pintail IPM data.r'. File was originally written by Scott Boomer (USFWS) and modified by Erik Osnas for use for the IPM. * Model_1_post.txt: Text representation of the posterior simulations from Model 1. This file can be read by the R function dget() to produce an R list object that contain posterior draws from Model 1. The list is the BUGSoutput$sims.list object from a call to rjags::jags. * Model_2_post.txt: As above but for Model 2. * Model_S1_post.txt: As above but for Model S1. * Pintail IPM.r: This is the main file that defines the IPM models in JAGS, structures the data for JAGS, defines initial values, and calls runs the models. Outputs are text files that contains JAGS model files, R work spaces that contains all data models, and results, include the output from the jags() function. From this the BUGSoutput$sims.list object was written to text for each model. * MSY_metrics.txt: Summary of results produced from running code in source_figure_12.R. This table is a text representation of a summary of the maximum sustained yield analysis at various mean rainfall levels, used for Table 1 of report and can be reproduced by running the code in source_figure_12.R. To understand the structure of this file, you must consult the code file and understand the structure of the R objects created from that code. Otherwise, consult Figure 12 and Table 1 in report. * source_figure_12.R: R code to produce Figure 12. Code is written to work with Rworkspace output from Model 1, but can be modified to use the Model_1_post.txt file without re-running the model. This would allow use of the same posterior realizations as used in the report. * source_figure_13.R: This is the code used to product the results for Figure 13. Required here is the posterior from Model 1 and data for the Prairie Parkland Model based on Jim Devries/Ducks Unlimited data. These are described in the report text. * Data: A directory that contains the raw data used for this report. * Data/2015_LCC_Networks_shapefile: A directory that contain ESRI shapefiles for used in Figure 1 and to define the boundaries of the Landscape Conservation Cooperatives. Found at (https://www.sciencebase.gov/catalog/item/55b943ade4b09a3b01b65d78) * Data/bndg_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for banded pintail from 1960 to 2014. Obtained from the USGS Bird Banding Lab. This file is used by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, https://www.pwrc.usgs.gov/BBL/manual/summary.cfm * Data/cponds.csv: A comma delimited file of estimated Canadian ponds based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, point estimate, and estimated standard error. * Data/enc_1430_yr1960up_DBISC_03042014.csv: A comma delimited file for encounters of banded pintail. Obtained from the USGS Bird Banding Lab. This file is use by 'getbandrecovs.r' to produce and 'm-array' used in the Integrated Population Model (IPM). A data dictionary describing the codes for each field can be found here, (https://www.pwrc.usgs.gov/BBL/manual/enc.cfm) * Data/nopiBPOP19552014.csv: A comma delimited file of estimated northern pintail based on counts from the North American Breeding Waterfowl and Habitat Survey, 1955-2014. Given is the year, pintail point estimate (bpop), and pintail estimated standard error (bpopSE), mean latitude of the pintail population (lat), latitude variance of the pintail population (latVAR), mean longitude of the pintail population (lon), and the variance in longitude of the pintail population (lonVAR). * Data/Summary Climate Data California CV 2.csv: Rainfall data for the California central valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Summary data MAV.csv: Rainfall data for the Mississippi Aluvial valley downloaded from National Climate Data Center (www.ncdc.noaa.gov/cdo-web/) as described in report text (https://doi.org/10.5066/P93YTR3X) and publication found at https://doi.org/10.1002/jwmg.21124 . Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 1961 2011 NOPI.txt: Comma delimited text file for pintail wing age data for 1961 to 2011 from the Parts Collection Survey. Each row is an individual wing with sex cohorts 4 = male, 5 = female and age cohorts 1 = After Hatch Year and 2 = Hatch Year. Wt is a weighting factor that determines how many harvested pintails this wing represent. See USFWS documentation for the Part Collection survey for descriptions. Summing Wt for each age, sex, and year gives an estimate of the number of pintail harvested. Used in 'get pintail IPM data.r' for IPM. * Data/Wing data 2012 2013 NOPI.csv: Same as 'Wing data 1961 2011 NOPI.txt' but for years 2012 and 2013.

  7. Data Mining Project - Boston

    • kaggle.com
    zip
    Updated Nov 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston
    Explore at:
    zip(59313797 bytes)Available download formats
    Dataset updated
    Nov 25, 2019
    Authors
    SophieLiu
    Area covered
    Boston
    Description

    Context

    To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

    Use of Data Files

    You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

    This loads the file into R

    df<-read.csv('uber.csv')

    The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

    df_black<-subset(uber_df, uber_df$name == 'Black')

    This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

    write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

    The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

    getwd()

    The output will be the file path to your working directory. You will find the file you just created in that folder.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  8. Data from: A dataset to model Levantine landcover and land-use change...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Kempf; Michael Kempf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 16, 2023
    Area covered
    Levant
    Description

    Overview

    This dataset is the repository for the following paper submitted to Data in Brief:

    Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

    The Data in Brief article contains the supplement information and is the related data paper to:

    Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

    Description/abstract

    The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

    Folder structure

    The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

    “code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

    “MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

    “mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

    “yield_productivity” contains .csv files of yield information for all countries listed above.

    “population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

    “GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

    “built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

    Code structure

    1_MODIS_NDVI_hdf_file_extraction.R


    This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.


    2_MERGE_MODIS_tiles.R


    In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").


    3_CROP_MODIS_merged_tiles.R


    Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
    The repository provides the already clipped and merged NDVI datasets.


    4_TREND_analysis_NDVI.R


    Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
    To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.


    5_BUILT_UP_change_raster.R


    Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.


    6_POPULATION_numbers_plot.R


    For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.


    7_YIELD_plot.R


    In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.


    8_GLDAS_read_extract_trend


    The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
    Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
    From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
    From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

  9. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

  10. u

    Data from: Benzoxazinoids in roots and shoots of cereal rye (Secale cereale)...

    • agdatacommons.nal.usda.gov
    • s.cnmilf.com
    • +1more
    application/csv
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Clifford P. Rice; Briana A. Otte; Matthew Kramer; Harry H. Schomberg; Steven B. Mirsky; Katherine L. Tully (2025). Data from: Benzoxazinoids in roots and shoots of cereal rye (Secale cereale) and their fates in soil after cover crop termination [Dataset]. http://doi.org/10.15482/USDA.ADC/1526330
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Clifford P. Rice; Briana A. Otte; Matthew Kramer; Harry H. Schomberg; Steven B. Mirsky; Katherine L. Tully
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Cover crops provide many agroecosystem services, including weed suppression, which is partially exerted through release of allelopathic benzoxazinoid (BX) compounds. This research characterizes (1) changes in concentrations of BX compounds in shoots, roots, and soil at three growth stages (GS) of cereal rye (Secale cereale L.), and (2) their degradation over time following termination. Concentrations of shoot dominant BX compounds, DIBOA-glc and DIBOA, were least at GS 83 (boot). The root dominant BX compound, HMBOA-glc, concentration was least at GS 54 (elongation). Rhizosphere soil BX concentrations were 1000 times smaller than in root tissues. Dominant compounds in soil were HMBOA-glc and HMBOA. Concentrations of BX compounds were similar for soil near root crowns and between-rows. Soil BX concentrations following cereal rye termination declined exponentially over time in three of four treatments: incorporated shoots (S) and roots (R), no-till S+R (cereal rye rolled flat), and no-till R (shoots removed), but not in no-till S. On the day following cereal rye termination, soil concentrations of HMBOA-glc and HMBOA in these three treatments increased above initial concentrations. Concentrations of these two compounds decreased the fastest while DIBOA-glc declined the slowest (half-life of 4 d in no-till S+R soil). Placement of shoots on the surface of an area where cereal rye had not grown (no-till S) did not increase soil concentrations of BX compounds. The short duration and complex dynamics of BX compounds in soil prior to and following termination illustrate the limited window for enhancing weed suppression by cereal rye allelochemicals; valuable information for programs breeding for enhanced weed suppression. In addition to the data analyzed for this article, we also include the R code. Resources in this dataset:Resource Title: BX data following termination. File Name: FinalBXsForMatt-20200908.csvResource Description: For each sample, gives the time, depth, location, and plot treatment, and then the compound concentrations. This is the principal data set analyzed with the R (anal2-cleaned.r) code, see that code for use.Resource Title: BX compounds from 3rd sampling time before termination. File Name: soil2-20201123.csvResource Description: These data are for comparison with the post termination data. They were taken at the 3rd sampling time (pre-termination), a day prior to termination. Each sample is identified with a treatment, date, and plot location, in addition to the BX concentrations. See R code (anal2-cleaned.r) for how this file is used.Resource Title: Soil location (within row versus between row) values of BX compounds. File Name: s2b.csvResource Description: Each row gives the average BX compound for each soil location (within row versus between row) for the second sample for each plot. These data are combined with bx3 (the data set read in from the file , "FinalBXsForMatt-20200908.csv"). See R (anal2-cleaned.r) code for use.Resource Title: R code for analysis of the decay (post-termination) BX data.. File Name: anal2-cleaned.rResource Description: This is the R code used to analyze the termination data. It also creates and writes out some data subsets (used for analysis and plots) that are later read in.Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Tissue BX compounds. File Name: tissues20210728b.csvResource Description: Data file holding results from a tissue analysis for BX compounds, in ug, from shoots and roots, and at various sampling times. Read into the R file, anal1-cleaned.r where it is used in a statistical analysis and to create figures.Resource Title: BX compounds from soil with a live rye cover crop. File Name: soil2-20201214.csvResource Description: BX compounds (in ng/g dry wt), by treatment, sampling time, date, and plot ID. These are data are read into the R program, anal1-cleaned.r, for analysis and to create figures. These are soil samples taken from locations with a live rye plant cover crop.Resource Title: R code for BX analyses of soil under rye and plant tissues. File Name: anal1-cleaned.rResource Description: R code for analysis of the soil BX compounds under a live rye cover crop at different growing stages, and for the analysis of tissue BX compounds. In addition to statistical analyses, code in this file creates figures, also some statistical output that is used to create a file that is later read in for figure creation (s2-CLD20220730-Stage.csv).Resource Software Recommended: R version 3.6.3,url: https://www.R-project.org/ Resource Title: Description of data files for anal2-cleaned.r. File Name: readme2.txtResource Description: Describes the input files used in the R code in anal2-cleaned.r, including descriptions and formats for each field. The file also describes some output (results) files that were uploaded to this site. This is a plain ASCII text file.Resource Title: Estimates produced by anal2-cleaned.r from statistical modeling.. File Name: Estimates20201110.csvResource Description: Estimates produced by anal2-cleaned.r from statistical modeling (see readme2.txt)Resource Title: Summary statistics from anal2-cleaned.r. File Name: CV20210412.csvResource Description: Summary statistics from anal2-cleaned.r, used for plotsResource Title: Data summaries (same as CV20210412.csv), rescaled. File Name: RESCALE-20210412.csvResource Description: Same as "CV20210412.csv" except log of data have been rescaled to minimum at least zero and maximum one, see readme2.txtResource Title: Statistical summaries for different stages. File Name: s2-CLD20220730-Stage.csvResource Description: Statistical summaries used for creating a figure (not used in paper), used in anal1-cleaned.r; data for soil BX under living rye.Resource Title: Description of data files for anal1-cleaned.r. File Name: readme1.txtResource Description: Contains general descriptions of data imported into anal1-cleaned.r, and a description of each field. Also contains some descriptions of files output by anal1-cleaned.r, used to create tables or figures.

  11. g

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

    • datasearch.gesis.org
    • openicpsr.org
    Updated Feb 19, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
    Explore at:
    Dataset updated
    Feb 19, 2020
    Dataset provided by
    da|ra (Registration agency for social science and economic data)
    Authors
    Kaplan, Jacob
    Description

    For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.

  12. Modeling data and data for figures and text

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Modeling data and data for figures and text [Dataset]. https://catalog.data.gov/dataset/modeling-data-and-data-for-figures-and-text
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    The data in this archive in in a zipped R data binary format, https://cran.r-project.org/doc/manuals/r-release/R-data.html. These data can be read by using the open source and free to use statistical software package R, https://www.r-project.org/. The data are organized following the figure numbering in the manuscript, e.g. Figure 1a is fig1a, and contains the same labeling as the figures including units and variable names. For a full explanation of the figure, please see the captions in the manuscript. To open this data file, use the following commands in R. load(‘JKelly_NH4NO3_JGR_2018.rdata’) To list the contents of the file, use the following command in R ls() The data for each figure is contained in the data object with the figures name. To list the data, simply type the name of the figure returned from the ls() command. The original model output and emissions used for this study are located on the ASM archived storage at /asm/ROMO/finescale/sjv2013. These data are in NetCDF format with self contained metadata with descriptive headers containing variable names, units, and simulation times. This dataset is associated with the following publication: Kelly, J., C. Parworth, Q. Zhang, D. Miller, K. Sun, M. Zondlo , K. Baker, A. Wisthaler, J. Nowak , S. Pusede , R. Cohen , A. Weinheimer , A. Beyersdorf , G. Tonnesen, J. Bash, L. Valin, J. Crawford, A. Fried , and J. Walega. Modeling NH4NO3 Over the San Joaquin Valley During the 2013 DISCOVER‐AQ Campaign. JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES. American Geophysical Union, Washington, DC, USA, 123(9): 4727-4745, (2018).

  13. Cyclistic_data_visualization

    • kaggle.com
    Updated Jun 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mark Woychick
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

    Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

    Content

    I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

    Acknowledgements

    Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

    Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

    Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

    Inspiration

    This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

    One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

    I am also interested to see what others did with this analysis - what were the findings and insights you found?

  14. d

    Data from: Reference transcriptomics of porcine peripheral immune cells...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +2more
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing [Dataset]. https://catalog.data.gov/dataset/data-from-reference-transcriptomics-of-porcine-peripheral-immune-cells-created-through-bul-e667c
    Explore at:
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().

  15. Z

    Dispa-SET Output files for the JRC report "Power System Flexibility in a...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    De Felice, Matteo (2024). Dispa-SET Output files for the JRC report "Power System Flexibility in a variable climate" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3778132
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    JRC
    Authors
    De Felice, Matteo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find the model results of the report:

    De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.

    This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.

    We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.

    A description of the methodology and the data sources with the references can be found into the report.

    Linked resources

    Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc

    Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability

    Update

    [29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year

  16. d

    Data from: A new algorithm for reconstructing tree height growth with stem...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Salas-Eljatib (2021). A new algorithm for reconstructing tree height growth with stem analysis data [Dataset]. http://doi.org/10.5061/dryad.qnk98sfgc
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 20, 2021
    Dataset provided by
    Dryad
    Authors
    Christian Salas-Eljatib
    Time period covered
    May 10, 2021
    Description

    I offer here both dataset and computing code related to a stem analysis algorithm to reconstruct height growth of trees. First, the dataset has time series records of tree height for Nothofagus alpina ("rauli"), N. dombeyi ("coigue"), N. obliqua ("roble"), and Pseudotsuga menziesii ("Douglas-fir"). The data come from stem analysis sample trees in both southern Chile and the Inland Northwest, USA. These trees are part of the ones used in an article about a new algorithm for reconstructing tree height growth. The article is published in Methods in Ecology and Evolution (https://doi.org/10.1111/2041-210x.13616). Second, I provide an R code implementing the proposed algorithm for a given dataset as example.

  17. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  18. Server-side I/O request arrival traces

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, csv +1
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francieli ZANON BOITO; Francieli ZANON BOITO; Jean Luca BEZ; Jean Luca BEZ (2020). Server-side I/O request arrival traces [Dataset]. http://doi.org/10.5281/zenodo.3340631
    Explore at:
    application/gzip, csv, text/x-pythonAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francieli ZANON BOITO; Francieli ZANON BOITO; Jean Luca BEZ; Jean Luca BEZ
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset generated for the "On server-side file access pattern matching" paper (Boito et al., HPCS 2019).

    The traces were obtained following the methodology described in the paper. In addition to the two data sets discussed in the paper, we are also making available an extra data set of server traces.

    Traces from I/O nodes

    • IOnode_traces/output/commands has the list of commands used to generate them. Each test is identified by a label, and the test_info.csv file contains the mapping of labels to access patterns. Some files include information about experiments with 8 I/O nodes, but these were removed from the data set because they had some errors.
    • IOnode_traces/output contains .map files that detail the mapping of clients to I/O nodes for each experiment, and .out files, which contain the output of the benchmark.
    • IOnode_traces/ contains one folder per experiment. Inside this folder, there is one folder per I/O node, and inside these folders there are tracefiles for the read and write portions of the experiments. Due to a mistake during the integration between IOFSL and AGIOS, read requests appear as "W", and writes as "R". Once accounted for when processing the traces, that has no impact on results.
    • pattern_length.csv contains the average pattern length for each experiment and operation (average number of requests per second), obtained with the get_pattern_length.py script.

    Each line of a trace looks like this:

    277004729325 00000000eaffffffffffff1f729db77200000000000000000000000000000000 W 0 262144

    The first number is an internal timestamp in nanoseconds, the second value is the file handle, and the third is the type of the request (inverted, "W" for reads and "R" for writes). The last two numbers give the request offset and size in bytes, respectively.

    Traces from parallel file sytem data servers

    These traces are inside the server_traces/ folder. Each experiment has two concurrent applications, "app1" and "app2", and its traces are inside a folder named accordingly:

    NOOP\_app1\_(identification of app1)\_app2\_(identification of app2)\_(repetition)\_pvfstrace/

    Each application is identified by:

    (contig/noncontig)\_(number and size of requests per process)\_(number of processes)\_(number of client machines)\_(nto1/nton regarding the number of files)

    Inside each folder there are eight trace files, two per data server, one for the read portion and another for the write portion. Each line looks like this:

    [D 02:54:58.386900] REQ SCHED SCHEDULING, handle: 5764607523034231596, queue_element: 0x2a11360, type: 0, offset: 458752, len: 32768

    The part between [] is a timestamp, "handle" gives the file handle, "type" is 0 for reads and 1 for writes, "offset" and "len" (length) are in bytes.

    • server_traces/pattern_length.csv contains the average pattern length for each experiment and operation, obtained with the server_traces/count_pattern_length.py script.

    Extra traces from data servers

    These traces were not used for the paper because we do not have performance measurements for them with different scheduling policies, so it would not be possible to estimate the results of using the pattern matching approach to select scheduling policies. Still, we share them in the extra_server_traces/ folder in the hope they will be useful. They were obtained in the same experimental campaign than the other data server traces, and have the same format. The difference is that these traces are for single-application scenarios.

  19. Data files and code on the comparison of SARS-CoV-2 with non-segmented RNA...

    • springernature.figshare.com
    pptx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaodi Chen (2023). Data files and code on the comparison of SARS-CoV-2 with non-segmented RNA viruses [Dataset]. http://doi.org/10.6084/m9.figshare.12482813.v1
    Explore at:
    pptxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Xiaodi Chen
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This fileset contains 15 data files and 1 ReadMe file.The data files are as follows:Five Results files in .fasta file format. These are: Result_MacroDomain_.fasta, Result_Spike1_.fasta, Result_Spike2_.fasta, Result_Spike2protein_.fasta and Result_Viroporin_.fasta.Two power point presentations (.pptx file format). These are: Analysis by MegaX_.pptx and Open Reading Frames_Conserved Domain Found in ORF and CDD_Children_.pptx.Three data files in .nwk file format. These are: NeurotropicRNA.noSegmentR1_original tree.nwk, MacroDomainGen_BootstrapTree.nwk and ViroporinGen_BootstrapTree.nwk.One code file in .R file format. This is: Protein Alignment RStudio_msa package_.R.One file in .tex file format. This is: Covid19_.tex.Two files in .txt file format. These are: Covid19_.txt and texshade.sty package_.txt.One file in .sty file format. This is: texshade_.sty.The 5 fasta files contain the results of the multiple protein sequence alignment.The power point presentations Open Reading Frames_Conserved Domain Found in ORF and CDD_Children.pptx contains the search results (snapshot figures) obtained by the Open Reading Frame (ORF) finder and Conserved Domains Database (CDD) database (NCBI). This file provides evidence to show how Figure2 and 3 were made.The power point presentation Analysis by MegaX_.pptx contains the evidence (parameters) to show how the sequences were aligned and how the tree files were made in Figure 1 by MegaX software.The three .nwk files (in Newick tree format) were produced using the MEGAX software. These files contain the data used to construct the phylogenetic trees shown in figures 1, 2B and 2C of the article.The R file contains all the codes required to produce figures 2 and 3 in the article.The Covid19.tex file works together with R Studio, containing the msa package (an R package for Multiple Sequence Alignment) to make Figures 2 and 3 in the article. The sty file is a system file for LaTex and contains codes. Study aims and methodology: The primary objective of the current study was to determine the possible evolutionary and molecular relationships between SARS-CoV-2 and non-segmented RNA viruses, especially the viruses that can infect the nervous system in infants and children.The whole-genome sequences of 35 non-segmented RNA viruses including 13 CoVs were retrieved from the National Center for Biotechnology Information (NCBI), for the purpose of phylogenetic analysis, which was conducted with MEGAX (Penn State University, PA, USA). All genomic sequences were aligned with the ClustalW algorithm and phylogenetic prediction inferred by the maximum likelihood method and Tamura-Nei model. RStudio (RStudio, Inc., Boston, MA, USA) with msa package was used for multiple protein sequence alignment. For more details on the methodology, please read the related article.

  20. o

    Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law...

    • openicpsr.org
    Updated Mar 25, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Kaplan (2018). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting Program Data: Law Enforcement Officers Killed and Assaulted (LEOKA) 1960-2024 [Dataset]. http://doi.org/10.3886/E102180V15
    Explore at:
    Dataset updated
    Mar 25, 2018
    Dataset provided by
    Princeton University
    Authors
    Jacob Kaplan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1960 - 2024
    Area covered
    United States
    Description

    For a comprehensive guide to this data and other UCR data, please see my book at ucrbook.comVersion 15 release notes:Adds .parquet file formatVersion 14 release notes:Adds 2023 and 2024 dataVersion 13 release notes:Adds 2022 dataVersion 12 release notes:Adds 2021 data.Version 11 release notes:Adds 2020 data. Please note that the FBI has retired UCR data ending in 2020 data so this will (probably, I haven't seen confirmation either way) be the last LEOKA data they release. Changes .rda file to .rds.Version 10 release notes:Changes release notes description, does not change data.Version 9 release notes:Adds data for 2019.Version 8 release notes:Fix bug for years 1960-1971 where the number of months reported variable was incorrectly down by 1 month. I recommend caution when using these years as they only report either 0 or 12 months of the year, which differs from every other year in the data. Added the variable officers_killed_total which is the sum of officers_killed_by_felony and officers_killed_by_accident.Version 7 release notes:Adds data from 2018Version 6 release notes:Adds data in the following formats: SPSS and Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 5 release notes: Adds data for 1960-1974 and 2017. Note: many columns (including number of female officers) will always have a value of 0 for years prior to 1971. This is because those variables weren't collected prior to 1971. These should be NA, not 0 but I'm keeping it as 0 to be consistent with the raw data. Removes support for .csv and .sav files.Adds a number_of_months_reported variable for each agency-year. A month is considered reported if the month_indicator column for that month has a value of "normal update" or "reported, not data."The formatting of the monthly data has changed from wide to long. This means that each agency-month has a single row. The old data had each agency being a single row with each month-category (e.g. jan_officers_killed_by_felony) being a column. Now there will just be a single column for each category (e.g. officers_killed_by_felony) and the month can be identified in the month column. This also results in most column names changing. As such, be careful when aggregating the monthly data since some variables are the same every month (e.g. number of officers employed is measured annually) so aggregating will be 12 times as high as the real value for those variables. Adds a date column. This date column is always set to the first of the month. It is NOT the date that a crime occurred or was reported. It is only there to make it easier to create time-series graphs that require a date input.All the data in this version was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. Data is the same as from NACJD but using all FBI files makes cleaning easier as all column names are already identical. Version 4 release notes: Add data for 2016.Order rows by year (descending) and ORI.Version 3 release notes: Fix bug where Philadelphia Police Department had incorrect FIPS county code. The LEOKA data sets contain highly detailed data about the number of officers/civilians employed by an agency and how many officers were killed or assaulted. All the data was acquired from the FBI as text/DAT files and read into R using the package asciiSetupReader. The FBI also provided a PDF file explaining how to create the setup file to read the data. Both the FBI's PDF and the setup file I made are included in the zip files. About 7% of all agencies in the data report more officers or civilians than population. As such, I removed the officers/civilians per 1,000 population variables. You should exercise caution if deciding to generate and use these variables yourself. Several agency had impossible large (>15) officer deaths in a single month. For those months I changed the value to NA. The UCR Handbook (https://ucr.fbi.gov/additional-ucr-publications/ucr_handbook.pdf/view) describes the LEOKA data as follows:"The UCR Program collects data from all contributing agencies ... on officer line-of-duty deaths and assaults. Reporting agencies must submit data on ... their own duly sworn officers f

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Organization logoOrganization logoOrganization logo

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Dec 7, 2023
Dataset provided by
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
PEPFAR
Authors
Dylan Westfall; Mullins James
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

Search
Clear search
Close search
Google apps
Main menu