15 datasets found

d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
e
Merger of BNV-D data (2008 to 2019) and enrichment
data.europa.eu
zip
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick VINCOURT (2025). Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f?locale=en
Explore at:
zip(18530465)Available download formats
Dataset updated
Jan 16, 2025
Dataset authored and provided by
Patrick VINCOURT
Description
Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}
n
Effect of data source on estimates of regional bird richness in northeastern...
data.niaid.nih.gov
datadryad.org
zip
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.m905qfv0h
Dataset updated
May 4, 2021
Dataset provided by
Agricultural Research Service
Massachusetts Audubon Society
Hebrew University of Jerusalem
University of Michigan
Columbia University
Gettysburg College
University of Vermont
New York State Department of Environmental Conservation
Authors
Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Northeastern United States, United States
Description
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

Methods Overview

This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

The methods for compilation are contained in the supplementary information of the manuscript but also here:

Bird data

For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

Environmental data

Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
n
ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count...
neuinfo.org
scicrunch.org
+2more
Updated Jun 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count datasets [Dataset]. http://identifiers.org/RRID:SCR_001774
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_001774
Dataset updated
Jun 14, 2021
Description
RNA-seq gene count datasets built using the raw data from 18 different studies. The raw sequencing data (.fastq files) were processed with Myrna to obtain tables of counts for each gene. For ease of statistical analysis, they combined each count table with sample phenotype data to form an R object of class ExpressionSet. The count tables, ExpressionSets, and phenotype tables are ready to use and freely available. By taking care of several preprocessing steps and combining many datasets into one easily-accessible website, we make finding and analyzing RNA-seq data considerably more straightforward.
H
Replication Data for 'Big G'
dataverse.harvard.edu
search.dataone.org
Updated Jan 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gernot Mueller; Ernesto Pasten; Raphael Schoenle; Michael Weber (2024). Replication Data for 'Big G' [Dataset]. http://doi.org/10.7910/DVN/8RCMZP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8RCMZP
Dataset updated
Jan 8, 2024
Dataset provided by
Harvard Dataverse
Authors
Gernot Mueller; Ernesto Pasten; Raphael Schoenle; Michael Weber
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The files described below replicate the results of "Big G". They are divided into three parts, which can be found in three different sub-folders: (1) FiveFacts, (2) ModelSimulation, and (3) VAR. ************************************************************************************* ******************* PART 1: Five Facts on Government spending *********************** ************************************************************************************* Folder: FiveFacts This folder contains code to replicate Figures 1-4 and Tables 1-4 in Section 3 of the paper. _ Data Set-Up _ In order to run the included script files, the main dataset needs to be assembled. The data on federal procurement contracts used in this paper is all publicly available from USASpending.gov. The base dataset used for all of the empirical results in this paper consists of the universe of procurement contract transactions from 2001-2019---around 30 GB of data. Due to its size, the data requires a substantial amount of computing power to work with. Our approach was to load the data into a SQL database on a server, following the instructions provided by USASpending.gov, which can be found here: https://files.usaspending.gov/database_download/usaspending-db-setup.pdf. As a result, the replication code cannot feasibly start with the raw dataset, though we have provided the raw files at an annual basis at [INSERT URL FOR SITE HERE]. The files "setup_data_1.R", "setup_data_2.R", "setup_data_3.R", and "setup_data_4.R" pull from the SQL database and create intermediate files that are provided with this replication package. You will NOT be able to run the "set_up" files without setting up your own SQL database, but you CAN run the Figure and Table replication code (described below) using the intermediate files created in the setup files. _ Figures _ Figure 1 + Step 1: Run 'create_contract_proxy.R,' which creates a dataset called 'contracts_for_ramey_merge.dta' + Step 2: Run ramey_zubairy_replication.do, which is a file TAKEN DIRECTLY FROM THE REPLICATION PACKAGE for Ramey & Zubairy (JPE, 2018), found at the link below. We merge our dataset into theirs, and re-run their regressions on our data. Ramey & Zubairy (2018) replication: https://econweb.ucsd.edu/~vramey/research/Ramey_Zubairy_replication_codes.zip. Figure 2 + 'Figure_2a.R' produces Figure 2a using 'intermediate_file_1.RData' + 'Figure_2b.R' produces Figure 2b using 'intermediate_file_2.RData' Figure 3 + 'Figure_3a.R' produces Figure 3a using 'intermediate_file_3.RData' + 'Figure_3b.R' produces Figure 3b using 'intermediate_file_2.RData' Figure 4 + 'Figure_4.R' produces Figures 4a and 4b using 'intermediate_file_3.RData' _ Tables _ Table 1 + 'Table_1.do' produces Table 1 using 'contracts_for_ramey_merge.dta' Table 2 + 'Table_2_upper' produces the top portion of Table 2 using the 'sectors_unbalanced.dta' file created in 'setup_data_4.R' + 'Table_2_lower' produces the lower portion of Table 2 using the 'firms_unbalanced.dta' file created in 'setup_data_4.R' Table 3 + 'Table_3.R' produces Table 3 using 'intermediate_file_1.RData'. Table 4 + Components for Table 4 can be found in 'Figure_3a.R' and 'Figure_3b.R' (noted in those files). ************************************************************************************* ************************** PART 2: Model Simulation ********************************* ************************************************************************************* Folder: "ModelSimulation" + Matlab file MAIN_generateIRFs.m generates Figures 5 and 6 in the paper. It calls the mod file modelG.mod + Matlab file MAIN_generateIRFs_htm.m generates Figure A.21 in the Appendix. It calls the mod file modelG_htm.mod + Both files run on Dynare 5.4. ************************************************************************************* ******************************** PART 3: VAR **************************************** ************************************************************************************* Folder: "VAR" (see README in VAR folder for more detail). Data Setup: "setup_var_data.R," like the files in the FiveFacts folder, will not run. They create a dataset of contracts by month and naics2 sector from the SQL database. + 'VAR.do' runs the VAR that produces Figure 7.
RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix...
researchdata.edu.au
data.gov.au
+1more
Updated Nov 15, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2016). RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix tables B to F (Spatial) [Dataset]. https://researchdata.edu.au/rps-galilee-basin-f-spatial/2989381
Explore at:
Dataset updated
Nov 15, 2016
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Area covered
Galilee Basin
Description
Abstract \r

\r The Galilee Basin Operators' Forum (GBOF) is a group of petroleum companies exploring the Galilee\r \r Basin for commercial quantities of hydrocarbons. Exploration activities include the search for\r \r conventional hydrocarbons, and increasingly non-conventional hydrocarbon sources such as coal\r \r seam gas (CSG). The CSG target is the Permian coal measures as shown in Figure 1.1.\r \r Understanding and protecting groundwater is a key issue and community concern. As part of the\r \r early exploration activities in the Galilee Basin, the GBOF companies have initiated this study to\r \r assist in developing a regional and consistent subsurface description, and to document the existing\r \r data for the groundwater systems in the Galilee Basin study area. RPS, as an independent company,\r \r was contracted to perform the study and prepare a report.\r \r This initial study should not be confused with a "baseline assessment" or "underground water impact\r \r report" which are specific requirements under the Water Act 2000, triggered once production testing is\r \r underway or production has commenced. This study gathers and assembles all the base historical\r \r data which may be used in further studies. For the Galilee Basin study area, this investigation is\r \r specifically designed to:\r \r  Review stratigraphy and identify possible aquifers beneath the GBOF member company\r \r tenures;\r \r  Delineate aquifers that warrant further monitoring; and\r \r  Obtain and tabulate current Department of Environment and Resource Management\r \r Groundwater Database (DERM GWDB)( now the Department of Environment and Heritage\r \r EHP)registered bore data including;\r \r » Water bore location and summary statistics;\r \r » Groundwater levels and artesian flow data; and\r \r » Groundwater quality.\r \r Data sources for this report include:\r \r  Groundwater data available in the DERM GWDB;\r \r  Petroleum exploration wells recorded in Queensland Petroleum Exploration Data (QPED);\r \r  DERM groundwater data logger/tipping bucket rain gauge program;\r \r  Springs of Queensland Dataset (version 4.0) held by DERM;\r \r  PressurePlot Version 2 developed by CSIRO and linked to a Pressure-Hydrodynamics\r \r database; and\r \r  Direct communication with GBOF members.\r \r Data was sourced in January 2011. Since then there has been considerable additional drilling by\r \r GBOF members, which is not incorporated in this report. All data has been used by RPS as provided\r \r without independent investigations to validate the data. It is recognised that historical data may be\r \r subject to inaccuracies, however, as work progresses in the region, an improvement in data integrity\r \r should be realised.\r \r

Dataset History \r

\r Tables as taken from Appendix B to F of the - Galilee Basin: Report on the Hydrogelogical Investigations, Prepared by RPS Australia PTY LTD for RLMS. PR102603-1: Rev 1 / December 2012.\r \r \r \r Spatial datasets created for each appendix table using supplied coordinate values (MGA Zone 54, MGA Zone 55, GDA94 Geographics) where available, or spatially referencing (spatial join) the NGIS QLD core - bores dataset, via the unique DERM Registered Bore Numbers attribute field.\r \r

Dataset Citation \r

\r Geoscience Australia (XXXX) RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix tables B to F (Spatial). Bioregional Assessment Derived Dataset. Viewed 16 November 2016, http://data.bioregionalassessments.gov.au/dataset/d3d92616-c0b8-4cfb-9eb5-4031915e5e41.\r \r

Dataset Ancestors \r

\r * Derived From National Groundwater Information System, Queensland Core dataset (superseded)\r \r * Derived From RPS Galilee Hydrogeological Investigations - Appendix tables B to F (original)\r \r
Z
Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...
data.niaid.nih.gov
zenodo.org
Updated Jan 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trujillo, Amaury; Cresci, Stefano (2023). Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation Interventions on r/The_Donald [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6250576
Explore at:
Dataset updated
Jan 10, 2023
Dataset provided by
IIT-CNR
Authors
Trujillo, Amaury; Cresci, Stefano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.

An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again

If you use this dataset please cite the related article.

The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.

The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.

The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.

The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.

A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Prevention Trials Networkhttp://www.hptn.org/
HIV Vaccine Trials Networkhttp://www.hvtn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
s
Metrical, morphosyntactic, and syntactic analysis of the Rigveda
swissubase.ch
Updated Sep 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Metrical, morphosyntactic, and syntactic analysis of the Rigveda [Dataset]. http://doi.org/10.48656/yc4z-sa04
Explore at:
Unique identifier
https://doi.org/10.48656/yc4z-sa04
Dataset updated
Sep 22, 2025
Description
The dataset contains: • the main data table, RV_data.csv, with morphosyntactic, syntactic and metrical information on each Rigvedic word form, and • a script, disticha.rmd, for the analysis of disticha in the main types of Rigvedic stanzas which were studied as an example for the application of the data table, resulting in the published article: Salvatore Scarlata and Paul Widmer, Syntactic evidence for metrical structure in Rigvedic stanzas, Indo-European Linguistics 13 (2025), 1-21, doi:10.1163/22125892-bja10041, issn: 2212-5892.

In addition the dataset contains: • a further data table, RV-polylex.csv, wherein all compounded word forms are analyzed, and • some ancillary basic scripts for linking the two tables respectively for simplified representations: join.r resp. pivot01–03.r.

Finally, the dataset contains: • a data table, RV-polylexREJECTS.csv, containing words for which it was not possible to assess them as compounded
H
Area Resource File (ARF)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). Area Resource File (ARF) [Dataset]. http://doi.org/10.7910/DVN/8NMSFV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8NMSFV
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the area resource file (arf) with r the arf is fun to say out loud. it's also a single county-level data table with about 6,000 variables, produced by the united states health services and resources administration (hrsa). the file contains health information and statistics for over 3,000 us counties. like many government agencies, hrsa provides only a sas importation script and an as cii file. this new github repository contains two scripts: 2011-2012 arf - download.R download the zipped area resource file directly onto your local computer load the entire table into a temporary sql database save the condensed file as an R data file (.rda), comma-separated value file (.csv), and/or stata-readable file (.dta). 2011-2012 arf - analysis examples.R limit the arf to the variables necessary for your analysis sum up a few county-level statistics merge the arf onto other data sets, using both fips and ssa county codes create a sweet county-level map click here to view these two scripts for mo re detail about the area resource file (arf), visit: the arf home page the hrsa data warehouse notes: the arf may not be a survey data set itself, but it's particularly useful to merge onto other survey data. confidential to sas, spss, stata, and sudaan users: time to put down the abacus. time to transition to r. :D
FIRST Catalog of FR I Radio Galaxies - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). FIRST Catalog of FR I Radio Galaxies - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/first-catalog-of-fr-i-radio-galaxies
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
The authors have built a catalog of 219 Fanaroff and Riley class I edge-darkened radio galaxies (FR Is), called FRICAT, that is selected from a published sample and obtained by combining observations from the NVSS, FIRST, and SDSS surveys. They included in the catalog the sources with an edge-darkened radio morphology, redshift <= 0.15, and extending (at the sensitivity of the FIRST images) to a radius r larger than 30 kpc from the center of the host. The authors also selected an additional sample (sFRICAT) of 14 smaller (10 < r < 30 kpc) FR Is, limiting to z < 0.05. The hosts of the FRICAT sources are all luminous (-21 >~ M_r >~ 24), red early-type galaxies with black hole masses in the range 10⁸ <~ M_BH <~ 3 x 10⁹ solar masses); the spectroscopic classification based on the optical emission line ratios indicates that they are all low excitation galaxies. Sources in the FRICAT are then indistinguishable from the FR Is belonging to the Third Cambridge Catalogue of Radio Sources (3C) on the basis of their optical properties. Conversely, while the 3C-FR Is show a strong positive trend between radio and [O III] emission line luminosity, these two quantities are unrelated in the FRICAT sources; at a given line luminosity, they show radio luminosities spanning about two orders of magnitude and extending to much lower ratios between radio and line power than 3C-FR Is. The authors' main conclusion is that the 3C-FR Is represent just the tip of the iceberg of a much larger and diverse population of FR Is. This HEASARC table contains both the 219 radio galaxies in the main FRICAT sample listed in Table B.1 of the reference paper and the 14 radio galaxies in the additional sFRICAT sample listed in Table B.2 of the reference paper. To enable users to distinguish from which sample an entry has been taken, the HEASARC created a parameter galaxy_sample which is set to 'M' for galaxies from the main sample, and to 'S' for galaxies from the supplementary sFRICAT sample. Throughout the paper, the authors adopted a cosmology with H₀ = 67.8 km s^-1 Mpc^-1, Omega_M = 0.308, and Omega_Lambda = 0.692 (Planck Collaboration XIII 2016). This table was created by the HEASARC in February 2017 based on electronic versions of Tables B.1 and B.2 that were obtained from the Astronomy & Astrophysics website. This is a service provided by NASA HEASARC .
Updated Australian bathymetry: merged 250m bathyTopo
data.csiro.au
researchdata.edu.au
Updated Sep 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian O'Grady; Claire Trenham; Ron Hoeke (2021). Updated Australian bathymetry: merged 250m bathyTopo [Dataset]. http://doi.org/10.25919/cm17-xc81
Explore at:
Unique identifier
https://doi.org/10.25919/cm17-xc81
Dataset updated
Sep 15, 2021
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Julian O'Grady; Claire Trenham; Ron Hoeke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2009 - Aug 31, 2021
Area covered

Dataset funded by
CSIROhttp://www.csiro.au/
Description
Accurate coastal wave and hydrodynamic modelling relies on quality bathymetric input. Many national scale modelling studies, hindcast and forecast products, have, or are currently using a 2009 digital elevation model (DEM), which does not include recently available bathymetric surveys and is now out of date. There are immediate needs for an updated national product, preceding the delivery of the AusSeabed program’s Global Multi-Resolution Topography for Australian coastal and ocean models. There are also challenges in stitching coarse resolution DEMs, which are often too shallow where they meet high-resolution information (e.g. LiDAR surveys) and require supervised/manual modifications (e.g. NSW, Perth, and Portland VIC bathymetries). This report updates the 2009 topography and bathymetry with a selection of nearshore surveys and demonstrates where the 2009 dataset and nearshore bathymetries do not matchup. Lineage: All of the datasets listed in Table 1 (see supporting files) were used in previous CSIRO internal projects or download from online data portals and processed using QGIS and R’s ‘raster’ package. The Perth LiDAR surveys were provided as points and gridded in R using raster::rasterFromXYZ(). The Macquarie Harbour contour lines were regridded in QGIS using the TIN interpolator. Each dataset was mapped with an accompanying Type Identifier (TID) following the conventions of the GEBCO dataset. The mapping went through several iterations, at each iteration the blending was checked for inconstancy, i.e., where the GA250m DEM was too shallow when it met the high-resolution LiDAR surveys. QGIS v3.16.4 was used to draw masks over inconstant blending and GA250 values falling within the mask and between two depths were assigned NA (no-data). LiDAR datasets were projected to +proj=longlat +datum=WGS84 +no_defs using raster::projectRaster(), resampled to the GA250 grid using raster::resample() and then merged with raster::merge(). Nearest neighbour resampling was performed for all datasets except for GEBCO ~500m product, which used the bilinear method. The order of the mapping overlay is sequential from TID = 1 being the base, through to 107, where 0 is the gap filled values.

Permissions are required for all code and internal datasets (Contact Julian OGrady).
m
Microclimate Sensor Locations - Historical
data.melbourne.vic.gov.au
csv, excel, geojson +1
Updated Nov 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Microclimate Sensor Locations - Historical [Dataset]. https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensor-locations/
Explore at:
geojson, json, csv, excelAvailable download formats
Dataset updated
Nov 13, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Note: these sensors are removed and new microclimate sensors are installed in the City. Please check https://data.melbourne.vic.gov.au/explore/dataset/microclimate-sensors-data/table/This dataset contains the historical location and location description for each microclimate sensor device installed throughout the city. Each microclimate sensor device will have several climate sensors embedded inside. Sensor devices are typically installed on a street pole and locations are selected based on relevant project criteria.

Since the beginning of the Microclimate Sensor Readings dataset in 2019, some sensor devices have been relocated. This may be for various reasons such as construction works or re-assignment to new climate projects. Any changes to sensor locations are important to consider when analysing and interpreting historical data.

The site_id column can be used to merge the data with related dataset (linked below). Site_id refers to the location of the unique sensor device which may have changed over time.

Related datasets: Microclimate Sensor - Sensor Readings
f
Data from: Using Deep Learning to Fill Data Gaps in Environmental Footprint...
acs.figshare.com
datasetcatalog.nlm.nih.gov
xlsx
Updated Jun 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bu Zhao; Chenyang Shuai; Shen Qu; Ming Xu (2023). Using Deep Learning to Fill Data Gaps in Environmental Footprint Accounting [Dataset]. http://doi.org/10.1021/acs.est.2c01640.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.est.2c01640.s001
Dataset updated
Jun 10, 2023
Dataset provided by
ACS Publications
Authors
Bu Zhao; Chenyang Shuai; Shen Qu; Ming Xu
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Environmental footprint accounting relies on economic input–output (IO) models. However, the compilation of IO models is costly and time-consuming, leading to the lack of timely detailed IO data. The RAS method is traditionally used to predict future IO tables but suffers from doubts for unreliable estimations. Here we develop a machine learning-augmented method to improve the accuracy of the prediction of IO tables using the US summary-level tables as a demonstration. The model is constructed by combining the RAS method with a deep neural network (DNN) model in which the RAS method provides a baseline prediction and the DNN model makes further improvements on the areas where RAS tended to have poor performance. Our results show that the DNN model can significantly improve the performance on those areas in IO tables for short-term prediction (one year) where RAS alone has poor performance, R2 improved from 0.6412 to 0.8726, and median APE decreased from 37.49% to 11.35%. For long-term prediction (5 years), the improvements are even more significant where the R2 is improved from 0.5271 to 0.7893 and median average percentage error is decreased from 51.12% to 18.26%. Our case study on evaluating the US carbon footprint accounts based on the estimated IO table also demonstrates the applicability of the model. Our method can help generate timely IO tables to provide fundamental data for a variety of environmental footprint analyses.

AO3 2021 Snapshot Dataset

kaggle.com

zip

Updated Jul 22, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Zaynab Badawy (2025). AO3 2021 Snapshot Dataset [Dataset]. https://www.kaggle.com/datasets/zaynabbadawy/ao3-2021-snapshot-dataset

Explore at:

zip(94887935 bytes)Available download formats

Dataset updated

Jul 22, 2025

Authors

Zaynab Badawy

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📊 Case Study: Analysis of Archive Warnings in AO3 Fanfiction Works

1. Ask

📌 Business Task

As a junior data analyst at a fanfiction analytics consultancy, I was tasked with analyzing how archive warnings are distributed across fanfiction works on Archive of Our Own (AO3). The client is interested in understanding:

The prevalence of content warnings
How they inform content tagging accuracy
Ways to improve reader experience and moderation

🔍 Key Questions

How many works contain each type of archive warning?
What percentage of overall works each warning represents?
Are there trends or overlaps in warning usage?

👥 Stakeholders

AO3 content moderation team
Fanfiction readers and communities
Client-side data/product teams for content safety and discovery

🎯 How Insights Help

Better understanding of archive warnings can:

Enhance tagging algorithms
Improve content filtering
Guide moderation policies
Promote transparency and safety

2. Prepare

📁 Data Sources

Dataset includes ~600,000 AO3 fanfiction works, organized across three tables:

works: Metadata on fanfiction works
tags: Includes tag types like archive warnings and fandoms
work_tag: Many-to-many mapping of works and tags

🧹 Data Cleaning

Imported into RStudio
Tags linked via work_id and tag_id
Filtered for type == "ArchiveWarning" and type == "Fandom"
Converted relevant columns to correct types (integer, character)
Removed incomplete or inconsistent entries

✅ Data Quality

Validated counts against known AO3 stats
Confirmed warning totals made logical sense
Dataset is anonymized and public — no privacy concerns

3. Process

🧰 Tools Used

MySQL — for initial slicing of large tables
R (tidyverse) — for transformation, filtering, summarizing
Tableau — for interactive visualizations

🔧 Key Transformations

Parsed concatenated tag ID strings
Filtered tags by type
Grouped by archive warning and counted distinct work IDs
Calculated percentages based on total works

4. Analyze

📈 Summary Statistics

Total works in dataset: 601,286
Works with at least one archive warning tag: 61,576 (~10.2%)

🏷️ Top 5 Archive Warnings by Frequency

Warning Name	Total Works	% of All Works
No Archive Warnings Apply	32,051	5.33%
Choose Not To Use Archive Warnings	21,591	3.59%
Graphic Depictions Of Violence	5,281	0.88%
Major Character Death	3,009	0.50%
Rape/Non-Con	1,650	0.27%

🔍 Key Findings

Most works don’t use explicit warnings, or authors choose not to specify them
Tags for violence and major character death are more common than other sensitive tags
Multiple warnings can appear in the same work
Dataset reflects historical snapshot, not current live AO3 stats

5. Share

📊 Visualizations Created

Bar charts showing archive warning frequency
Tableau dashboard to explore archive warnings by fandom

💡 Communicated Insights

Clear breakdown of which warnings are most prevalent
Patterns highlight gaps in author tagging practices
Supports better decisions for content filtering and moderation policies

6. Act

✔️ Recommendations

Improve author tagging UX to encourage accurate warnings
Educate authors about importance of content warnings
Focus moderation resources on works tagged with higher-risk warnings
Repeat analysis with newer data to track changes over time

➕ Future Work

Analyze overlaps in warnings for nuanced content safety flags
Compare warning usage by fandom for genre-specific trends
Use reader engagement or feedback data to evaluate warning effectiveness

📎 Appendix: R Code Snippets

# Filter archive warning tags
archive_warnings <- tags %>%
 filter(type == "ArchiveWarning") %>%
 select(warning_id = id, warning_name = name)

# Filter tag mapping for works that use archive warnings
work_warnings <- work_tag %>%
 filter(tag_id %in% archive_warnings$warning_id)

# Total number of works with at least one archive warning
total_works_with_warning <- work_warnings %>%
 summarise(total = n_distinct(work_id)) %>%
 pull(total)

# Count per warning and join with tag names
warning_summary <- work_warnings %>%
 group_by(tag_id) %>%
 summarise(total_works_with_warning = n_distinct(work_id)) %>%
 mutate(percent_of_all_works = (total_works_with_warning / 601286) * 100) %>%
 rename(warning_id =...

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD

Current Population Survey (CPS)

Explore at:

Unique identifier

https://doi.org/10.7910/DVN/AK4FDD

Dataset updated

Nov 21, 2023

Dataset provided by

Harvard Dataverse

Authors

Damico, Anthony

Description

analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

Clear search

Close search

Google apps

Main menu

Current Population Survey (CPS)

Merger of BNV-D data (2008 to 2019) and enrichment

Effect of data source on estimates of regional bird richness in northeastern...

ReCount - A multi-experiment resource of analysis-ready RNA-seq gene count...

Replication Data for 'Big G'

RPS Galilee Basin: Report on the Hydrogeological Investigations - Appendix...

Abstract \r

Dataset History \r

Dataset Citation \r

Dataset Ancestors \r

Dataset — Make Reddit Great Again: Assessing Community Effects of Moderation...

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Metrical, morphosyntactic, and syntactic analysis of the Rigveda

Area Resource File (ARF)

FIRST Catalog of FR I Radio Galaxies - Dataset - NASA Open Data Portal

Updated Australian bathymetry: merged 250m bathyTopo

Microclimate Sensor Locations - Historical

Data from: Using Deep Learning to Fill Data Gaps in Environmental Footprint...

AO3 2021 Snapshot Dataset

📊 Case Study: Analysis of Archive Warnings in AO3 Fanfiction Works

1. Ask

📌 Business Task

🔍 Key Questions

👥 Stakeholders

🎯 How Insights Help

2. Prepare

📁 Data Sources

🧹 Data Cleaning

✅ Data Quality

3. Process

🧰 Tools Used

🔧 Key Transformations

4. Analyze

📈 Summary Statistics

🏷️ Top 5 Archive Warnings by Frequency

🔍 Key Findings

5. Share

📊 Visualizations Created

💡 Communicated Insights

6. Act

✔️ Recommendations

➕ Future Work

📎 Appendix: R Code Snippets

Current Population Survey (CPS)