15 datasets found
  1. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  2. e

    Merger of BNV-D data (2008 to 2019) and enrichment

    • data.europa.eu
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick VINCOURT (2025). Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f?locale=en
    Explore at:
    zip(18530465)Available download formats
    Dataset updated
    Jan 16, 2025
    Dataset authored and provided by
    Patrick VINCOURT
    Description

    Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

    All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
    3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}

  3. n

    Effect of data source on estimates of regional bird richness in northeastern...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    Agricultural Research Service
    Massachusetts Audubon Society
    Hebrew University of Jerusalem
    University of Michigan
    Columbia University
    Gettysburg College
    University of Vermont
    New York State Department of Environmental Conservation
    Authors
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Northeastern United States, United States
    Description

    Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

    Methods Overview

    This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

    Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

    The methods for compilation are contained in the supplementary information of the manuscript but also here:

    Bird data

    For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

    Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

    The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

    Environmental data

    Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.

  4. n

    ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count...

    • neuinfo.org
    • scicrunch.org
    • +2more
    Updated Jun 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count datasets [Dataset]. http://identifiers.org/RRID:SCR_001774
    Explore at:
    Dataset updated
    Jun 14, 2021
    Description

    RNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.

  5. H

    Replication Data for 'Big G'

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jan 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gernot Mueller; Ernesto Pasten; Raphael Schoenle; Michael Weber (2024). Replication Data for 'Big G' [Dataset]. http://doi.org/10.7910/DVN/8RCMZP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Gernot Mueller; Ernesto Pasten; Raphael Schoenle; Michael Weber
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The files described below replicate the results of "Big G". They are divided into three parts, which can be found in three different sub-folders: (1) FiveFacts, (2) ModelSimulation, and (3) VAR. ************************************************************************************* ******************* PART 1: Five Facts on Government spending *********************** ************************************************************************************* Folder: FiveFacts This folder contains code to replicate Figures 1-4 and Tables 1-4 in Section 3 of the paper. _ Data Set-Up _ In order to run the included script files, the main dataset needs to be assembled. The data on federal procurement contracts used in this paper is all publicly available from USASpending.gov. The base dataset used for all of the empirical results in this paper consists of the universe of procurement contract transactions from 2001-2019---around 30 GB of data. Due to its size, the data requires a substantial amount of computing power to work with. Our approach was to load the data into a SQL database on a server, following the instructions provided by USASpending.gov, which can be found here: https://files.usaspending.gov/database_download/usaspending-db-setup.pdf. As a result, the replication code cannot feasibly start with the raw dataset, though we have provided the raw files at an annual basis at [INSERT URL FOR SITE HERE]. The files "setup_data_1.R", "setup_data_2.R", "setup_data_3.R", and "setup_data_4.R" pull from the SQL database and create intermediate files that are provided with this replication package. You will NOT be able to run the "set_up" files without setting up your own SQL database, but you CAN run the Figure and Table replication code (described below) using the intermediate files created in the setup files. _ Figures _ Figure 1 + Step 1: Run 'create_contract_proxy.R,' which creates a dataset called 'contracts_for_ramey_merge.dta' + Step 2: Run ramey_zubairy_replication.do, which is a file TAKEN DIRECTLY FROM THE REPLICATION PACKAGE for Ramey & Zubairy (JPE, 2018), found at the link below. We merge our dataset into theirs, and re-run their regressions on our data. Ramey & Zubairy (2018) replication: https://econweb.ucsd.edu/~vramey/research/Ramey_Zubairy_replication_codes.zip. Figure 2 + 'Figure_2a.R' produces Figure 2a using 'intermediate_file_1.RData' + 'Figure_2b.R' produces Figure 2b using 'intermediate_file_2.RData' Figure 3 + 'Figure_3a.R' produces Figure 3a using 'intermediate_file_3.RData' + 'Figure_3b.R' produces Figure 3b using 'intermediate_file_2.RData' Figure 4 + 'Figure_4.R' produces Figures 4a and 4b using 'intermediate_file_3.RData' _ Tables _ Table 1 + 'Table_1.do' produces Table 1 using 'contracts_for_ramey_merge.dta' Table 2 + 'Table_2_upper' produces the top portion of Table 2 using the 'sectors_unbalanced.dta' file created in 'setup_data_4.R' + 'Table_2_lower' produces the lower portion of Table 2 using the 'firms_unbalanced.dta' file created in 'setup_data_4.R' Table 3 + 'Table_3.R' produces Table 3 using 'intermediate_file_1.RData'. Table 4 + Components for Table 4 can be found in 'Figure_3a.R' and 'Figure_3b.R' (noted in those files). ************************************************************************************* ************************** PART 2: Model Simulation ********************************* ************************************************************************************* Folder: "ModelSimulation" + Matlab file MAIN_generateIRFs.m generates Figures 5 and 6 in the paper. It calls the mod file modelG.mod + Matlab file MAIN_generateIRFs_htm.m generates Figure A.21 in the Appendix. It calls the mod file modelG_htm.mod + Both files run on Dynare 5.4. ************************************************************************************* ******************************** PART 3: VAR **************************************** ************************************************************************************* Folder: "VAR" (see README in VAR folder for more detail). Data Setup: "setup_var_data.R," like the files in the FiveFacts folder, will not run. They create a dataset of contracts by month and naics2 sector from the SQL database. + 'VAR.do' runs the VAR that produces Figure 7.

  6. RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix...

    • researchdata.edu.au
    • data.gov.au
    • +1more
    Updated Nov 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2016). RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix tables B to F (Spatial) [Dataset]. https://researchdata.edu.au/rps-galilee-basin-f-spatial/2989381
    Explore at:
    Dataset updated
    Nov 15, 2016
    Dataset provided by
    Data.govhttps://data.gov/
    Authors
    Bioregional Assessment Program
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Area covered
    Galilee Basin
    Description

    Abstract \r

    \r The Galilee Basin Operators' Forum (GBOF) is a group of petroleum companies exploring the Galilee\r \r Basin for commercial quantities of hydrocarbons. Exploration activities include the search for\r \r conventional hydrocarbons, and increasingly non-conventional hydrocarbon sources such as coal\r \r seam gas (CSG). The CSG target is the Permian coal measures as shown in Figure 1.1.\r \r Understanding and protecting groundwater is a key issue and community concern. As part of the\r \r early exploration activities in the Galilee Basin, the GBOF companies have initiated this study to\r \r assist in developing a regional and consistent subsurface description, and to document the existing\r \r data for the groundwater systems in the Galilee Basin study area. RPS, as an independent company,\r \r was contracted to perform the study and prepare a report.\r \r This initial study should not be confused with a "baseline assessment" or "underground water impact\r \r report" which are specific requirements under the Water Act 2000, triggered once production testing is\r \r underway or production has commenced. This study gathers and assembles all the base historical\r \r data which may be used in further studies. For the Galilee Basin study area, this investigation is\r \r specifically designed to:\r \r  Review stratigraphy and identify possible aquifers beneath the GBOF member company\r \r tenures;\r \r  Delineate aquifers that warrant further monitoring; and\r \r  Obtain and tabulate current Department of Environment and Resource Management\r \r Groundwater Database (DERM GWDB)( now the Department of Environment and Heritage\r \r EHP)registered bore data including;\r \r » Water bore location and summary statistics;\r \r » Groundwater levels and artesian flow data; and\r \r » Groundwater quality.\r \r Data sources for this report include:\r \r  Groundwater data available in the DERM GWDB;\r \r  Petroleum exploration wells recorded in Queensland Petroleum Exploration Data (QPED);\r \r  DERM groundwater data logger/tipping bucket rain gauge program;\r \r  Springs of Queensland Dataset (version 4.0) held by DERM;\r \r  PressurePlot Version 2 developed by CSIRO and linked to a Pressure-Hydrodynamics\r \r database; and\r \r  Direct communication with GBOF members.\r \r Data was sourced in January 2011. Since then there has been considerable additional drilling by\r \r GBOF members, which is not incorporated in this report. All data has been used by RPS as provided\r \r without independent investigations to validate the data. It is recognised that historical data may be\r \r subject to inaccuracies, however, as work progresses in the region, an improvement in data integrity\r \r should be realised.\r \r

    Dataset History \r

    \r Tables as taken from Appendix B to F of the - Galilee Basin: Report on the Hydrogelogical Investigations, Prepared by RPS Australia PTY LTD for RLMS. PR102603-1: Rev 1 / December 2012.\r \r \r \r Spatial datasets created for each appendix table using supplied coordinate values (MGA Zone 54, MGA Zone 55, GDA94 Geographics) where available, or spatially referencing (spatial join) the NGIS QLD core - bores dataset, via the unique DERM Registered Bore Numbers attribute field.\r \r

    Dataset Citation \r

    \r Geoscience Australia (XXXX) RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix tables B to F (Spatial). Bioregional Assessment Derived Dataset. Viewed 16 November 2016, http://data.bioregionalassessments.gov.au/dataset/d3d92616-c0b8-4cfb-9eb5-4031915e5e41.\r \r

    Dataset Ancestors \r

    \r * Derived From National Groundwater Information System, Queensland Core dataset (superseded)\r \r * Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)\r \r

  7. Z

    Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trujillo, Amaury; Cresci, Stefano (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6250576
    Explore at:
    Dataset updated
    Jan 10, 2023
    Dataset provided by
    IIT-CNR
    Authors
    Trujillo, Amaury; Cresci, Stefano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

    An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

    If you use this dataset please cite the related article.

    The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

    The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

    The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

    The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

    A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.

  8. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Networkhttp://www.hptn.org/
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  9. s

    Metrical, morphosyntactic, and syntactic analysis of the Rigveda

    • swissubase.ch
    Updated Sep 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Metrical, morphosyntactic, and syntactic analysis of the Rigveda [Dataset]. http://doi.org/10.48656/yc4z-sa04
    Explore at:
    Dataset updated
    Sep 22, 2025
    Description

    The dataset contains: • the main data table, RV_data.csv, with morphosyntactic, syntactic and metrical information on each Rigvedic word form, and • a script, disticha.rmd, for the analysis of disticha in the main types of Rigvedic stanzas which were studied as an example for the application of the data table, resulting in the published article: Salvatore Scarlata and Paul Widmer, Syntactic evidence for metrical structure in Rigvedic stanzas, Indo-European Linguistics 13 (2025), 1-21, doi:10.1163/22125892-bja10041, issn: 2212-5892.

    In addition the dataset contains: • a further data table, RV-polylex.csv, wherein all compounded word forms are analyzed, and • some ancillary basic scripts for linking the two tables respectively for simplified representations: join.r resp. pivot01–03.r.

    Finally, the dataset contains: • a data table, RV-polylexREJECTS.csv, containing words for which it was not possible to assess them as compounded

  10. H

    Area Resource File (ARF)

    • dataverse.harvard.edu
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D

  11. FIRST Catalog of FR I Radio Galaxies - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). FIRST Catalog of FR I Radio Galaxies - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/first-catalog-of-fr-i-radio-galaxies
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The authors have built a catalog of 219 Fanaroff and Riley class I edge-darkened radio galaxies (FR Is), called FRICAT, that is selected from a published sample and obtained by combining observations from the NVSS, FIRST, and SDSS surveys. They included in the catalog the sources with an edge-darkened radio morphology, redshift <= 0.15, and extending (at the sensitivity of the FIRST images) to a radius r larger than 30 kpc from the center of the host. The authors also selected an additional sample (sFRICAT) of 14 smaller (10 < r < 30 kpc) FR Is, limiting to z < 0.05. The hosts of the FRICAT sources are all luminous (-21 >~ Mr >~ 24), red early-type galaxies with black hole masses in the range 108 <~ MBH <~ 3 x 109 solar masses); the spectroscopic classification based on the optical emission line ratios indicates that they are all low excitation galaxies. Sources in the FRICAT are then indistinguishable from the FR Is belonging to the Third Cambridge Catalogue of Radio Sources (3C) on the basis of their optical properties. Conversely, while the 3C-FR Is show a strong positive trend between radio and [O III] emission line luminosity, these two quantities are unrelated in the FRICAT sources; at a given line luminosity, they show radio luminosities spanning about two orders of magnitude and extending to much lower ratios between radio and line power than 3C-FR Is. The authors' main conclusion is that the 3C-FR Is represent just the tip of the iceberg of a much larger and diverse population of FR Is. This HEASARC table contains both the 219 radio galaxies in the main FRICAT sample listed in Table B.1 of the reference paper and the 14 radio galaxies in the additional sFRICAT sample listed in Table B.2 of the reference paper. To enable users to distinguish from which sample an entry has been taken, the HEASARC created a parameter galaxy_sample which is set to 'M' for galaxies from the main sample, and to 'S' for galaxies from the supplementary sFRICAT sample. Throughout the paper, the authors adopted a cosmology with H0 = 67.8 km s-1 Mpc-1, OmegaM = 0.308, and OmegaLambda = 0.692 (Planck Collaboration XIII 2016). This table was created by the HEASARC in February 2017 based on electronic versions of Tables B.1 and B.2 that were obtained from the Astronomy & Astrophysics website. This is a service provided by NASA HEASARC .

  12. Updated Australian bathymetry: merged 250m bathyTopo

    • data.csiro.au
    • researchdata.edu.au
    Updated Sep 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian O'Grady; Claire Trenham; Ron Hoeke (2021). Updated Australian bathymetry: merged 250m bathyTopo [Dataset]. http://doi.org/10.25919/cm17-xc81
    Explore at:
    Dataset updated
    Sep 15, 2021
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Julian O'Grady; Claire Trenham; Ron Hoeke
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2009 - Aug 31, 2021
    Area covered
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    Accurate coastal wave and hydrodynamic modelling relies on quality bathymetric input. Many national scale modelling studies, hindcast and forecast products, have, or are currently using a 2009 digital elevation model (DEM), which does not include recently available bathymetric surveys and is now out of date. There are immediate needs for an updated national product, preceding the delivery of the AusSeabed program’s Global Multi-Resolution Topography for Australian coastal and ocean models. There are also challenges in stitching coarse resolution DEMs, which are often too shallow where they meet high-resolution information (e.g. LiDAR surveys) and require supervised/manual modifications (e.g. NSW, Perth, and Portland VIC bathymetries). This report updates the 2009 topography and bathymetry with a selection of nearshore surveys and demonstrates where the 2009 dataset and nearshore bathymetries do not matchup. Lineage: All of the datasets listed in Table 1 (see supporting files) were used in previous CSIRO internal projects or download from online data portals and processed using QGIS and R’s ‘raster’ package. The Perth LiDAR surveys were provided as points and gridded in R using raster::rasterFromXYZ(). The Macquarie Harbour contour lines were regridded in QGIS using the TIN interpolator. Each dataset was mapped with an accompanying Type Identifier (TID) following the conventions of the GEBCO dataset. The mapping went through several iterations, at each iteration the blending was checked for inconstancy, i.e., where the GA250m DEM was too shallow when it met the high-resolution LiDAR surveys. QGIS v3.16.4 was used to draw masks over inconstant blending and GA250 values falling within the mask and between two depths were assigned NA (no-data). LiDAR datasets were projected to +proj=longlat +datum=WGS84 +no_defs using raster::projectRaster(), resampled to the GA250 grid using raster::resample() and then merged with raster::merge(). Nearest neighbour resampling was performed for all datasets except for GEBCO ~500m product, which used the bilinear method. The order of the mapping overlay is sequential from TID = 1 being the base, through to 107, where 0 is the gap filled values.

    Permissions are required for all code and internal datasets (Contact Julian OGrady).

  13. m

    Microclimate Sensor Locations - Historical

    • data.melbourne.vic.gov.au
    csv, excel, geojson +1
    Updated Nov 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Microclimate Sensor Locations - Historical [Dataset]. https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensor-locations/
    Explore at:
    geojson, json, csv, excelAvailable download formats
    Dataset updated
    Nov 13, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: these sensors are removed and new microclimate sensors are installed in the City. Please check https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensors-data/table/This dataset contains the historical location and location description for each microclimate sensor device installed throughout the city. Each microclimate sensor device will have several climate sensors embedded inside. Sensor devices are typically installed on a street pole and locations are selected based on relevant project criteria.

    Since the beginning of the Microclimate Sensor Readings dataset in 2019, some sensor devices have been relocated. This may be for various reasons such as construction works or re-assignment to new climate projects. Any changes to sensor locations are important to consider when analysing and interpreting historical data.

    The site_id column can be used to merge the data with related dataset (linked below). Site_id refers to the location of the unique sensor device which may have changed over time.

    Related datasets: Microclimate Sensor - Sensor Readings

  14. f

    Data from: Using Deep Learning to Fill Data Gaps in Environmental Footprint...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    xlsx
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bu Zhao; Chenyang Shuai; Shen Qu; Ming Xu (2023). Using Deep Learning to Fill Data Gaps in Environmental Footprint Accounting [Dataset]. http://doi.org/10.1021/acs.est.2c01640.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    ACS Publications
    Authors
    Bu Zhao; Chenyang Shuai; Shen Qu; Ming Xu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Environmental footprint accounting relies on economic input–output (IO) models. However, the compilation of IO models is costly and time-consuming, leading to the lack of timely detailed IO data. The RAS method is traditionally used to predict future IO tables but suffers from doubts for unreliable estimations. Here we develop a machine learning-augmented method to improve the accuracy of the prediction of IO tables using the US summary-level tables as a demonstration. The model is constructed by combining the RAS method with a deep neural network (DNN) model in which the RAS method provides a baseline prediction and the DNN model makes further improvements on the areas where RAS tended to have poor performance. Our results show that the DNN model can significantly improve the performance on those areas in IO tables for short-term prediction (one year) where RAS alone has poor performance, R2 improved from 0.6412 to 0.8726, and median APE decreased from 37.49% to 11.35%. For long-term prediction (5 years), the improvements are even more significant where the R2 is improved from 0.5271 to 0.7893 and median average percentage error is decreased from 51.12% to 18.26%. Our case study on evaluating the US carbon footprint accounts based on the estimated IO table also demonstrates the applicability of the model. Our method can help generate timely IO tables to provide fundamental data for a variety of environmental footprint analyses.

  15. AO3 2021 Snapshot Dataset

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaynab Badawy (2025). AO3 2021 Snapshot Dataset [Dataset]. https://www.kaggle.com/datasets/zaynabbadawy/ao3-2021-snapshot-dataset
    Explore at:
    zip(94887935 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Zaynab Badawy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📊 Case Study: Analysis of Archive Warnings in AO3 Fanfiction Works

    1. Ask

    📌 Business Task

    As a junior data analyst at a fanfiction analytics consultancy, I was tasked with analyzing how archive warnings are distributed across fanfiction works on Archive of Our Own (AO3). The client is interested in understanding:

    • The prevalence of content warnings
    • How they inform content tagging accuracy
    • Ways to improve reader experience and moderation

    🔍 Key Questions

    • How many works contain each type of archive warning?
    • What percentage of overall works each warning represents?
    • Are there trends or overlaps in warning usage?

    👥 Stakeholders

    • AO3 content moderation team
    • Fanfiction readers and communities
    • Client-side data/product teams for content safety and discovery

    🎯 How Insights Help

    Better understanding of archive warnings can:

    • Enhance tagging algorithms
    • Improve content filtering
    • Guide moderation policies
    • Promote transparency and safety

    2. Prepare

    📁 Data Sources

    Dataset includes ~600,000 AO3 fanfiction works, organized across three tables:

    • works: Metadata on fanfiction works
    • tags: Includes tag types like archive warnings and fandoms
    • work_tag: Many-to-many mapping of works and tags

    🧹 Data Cleaning

    • Imported into RStudio
    • Tags linked via work_id and tag_id
    • Filtered for type == "ArchiveWarning" and type == "Fandom"
    • Converted relevant columns to correct types (integer, character)
    • Removed incomplete or inconsistent entries

    ✅ Data Quality

    • Validated counts against known AO3 stats
    • Confirmed warning totals made logical sense
    • Dataset is anonymized and public — no privacy concerns

    3. Process

    🧰 Tools Used

    • MySQL — for initial slicing of large tables
    • R (tidyverse) — for transformation, filtering, summarizing
    • Tableau — for interactive visualizations

    🔧 Key Transformations

    • Parsed concatenated tag ID strings
    • Filtered tags by type
    • Grouped by archive warning and counted distinct work IDs
    • Calculated percentages based on total works

    4. Analyze

    📈 Summary Statistics

    • Total works in dataset: 601,286
    • Works with at least one archive warning tag: 61,576 (~10.2%)

    🏷️ Top 5 Archive Warnings by Frequency

    Warning NameTotal Works% of All Works
    No Archive Warnings Apply32,0515.33%
    Choose Not To Use Archive Warnings21,5913.59%
    Graphic Depictions Of Violence5,2810.88%
    Major Character Death3,0090.50%
    Rape/Non-Con1,6500.27%

    🔍 Key Findings

    • Most works don’t use explicit warnings, or authors choose not to specify them
    • Tags for violence and major character death are more common than other sensitive tags
    • Multiple warnings can appear in the same work
    • Dataset reflects historical snapshot, not current live AO3 stats

    5. Share

    📊 Visualizations Created

    • Bar charts showing archive warning frequency
    • Tableau dashboard to explore archive warnings by fandom

    💡 Communicated Insights

    • Clear breakdown of which warnings are most prevalent
    • Patterns highlight gaps in author tagging practices
    • Supports better decisions for content filtering and moderation policies

    6. Act

    ✔️ Recommendations

    • Improve author tagging UX to encourage accurate warnings
    • Educate authors about importance of content warnings
    • Focus moderation resources on works tagged with higher-risk warnings
    • Repeat analysis with newer data to track changes over time

    ➕ Future Work

    • Analyze overlaps in warnings for nuanced content safety flags
    • Compare warning usage by fandom for genre-specific trends
    • Use reader engagement or feedback data to evaluate warning effectiveness

    📎 Appendix: R Code Snippets

    # Filter archive warning tags
    archive_warnings <- tags %>%
     filter(type == "ArchiveWarning") %>%
     select(warning_id = id, warning_name = name)
    
    # Filter tag mapping for works that use archive warnings
    work_warnings <- work_tag %>%
     filter(tag_id %in% archive_warnings$warning_id)
    
    # Total number of works with at least one archive warning
    total_works_with_warning <- work_warnings %>%
     summarise(total = n_distinct(work_id)) %>%
     pull(total)
    
    # Count per warning and join with tag names
    warning_summary <- work_warnings %>%
     group_by(tag_id) %>%
     summarise(total_works_with_warning = n_distinct(work_id)) %>%
     mutate(percent_of_all_works = (total_works_with_warning / 601286) * 100) %>%
     rename(warning_id =...
    
  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD

Current Population Survey (CPS)

Explore at:
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description

analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

Search
Clear search
Close search
Google apps
Main menu