Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
TwitterIntroduction
The Annual Survey of Industries (ASI) is the principal source of industrial statistics in India. It provides statistical information to assess changes in the growth, composition and structure of organised manufacturing sector comprising activities related to manufacturing processes, repair services, gas and water supply and cold storage. Industrial sector occupies an important position in the State economy and has a pivotal role to play in the rapid and balanced economic development. The Survey is conducted annually under the statutory provisions of the Collection of Statistics Act 1953, and the Rules framed there-under in 1959, except in the State of Jammu & Kashmir where it is conducted under the State Collection of Statistics Act, 1961 and the rules framed there-under in 1964.
Coverage of the Annual Survey of Industries extends to the entire Factory Sector, comprising industrial units (called factories) registered under section 2(m)(i) and 2(m)(ii) of the Factories Act.1948, wherein a "Factory", which is the primary statistical unit of enumeration for the ASI is defined as:- "Any premises" including the precincts thereof:- (i) wherein ten or more workers are working or were working on any day of the preceding twelve months, and in any part of which a manufacturing process is being carried on with the aid of power or is ordinarily so carried on, or (ii) wherein twenty or more workers are working or were working on any day of the preceding twelve months, and in any part of which a manufacturing process is being carried on without the aid of power. In addition to section 2(m)(i) & 2(m)(ii) of the Factories Act, 1948, electricity units registered with the Central Electricity Authority and Bidi & Cigar units, registered under the Bidi & Cigar Workers (Conditions of Employment) Act,1966 are also covered in ASI.
The primary unit of enumeration in the survey is a factory in the case of manufacturing industries, a workshop in the case of repair services, an undertaking or a licensee in the case of electricity, gas & water supply undertakings and an establishment in the case of bidi & cigar industries. The owner of two or more establishments located in the same State and pertaining to the same industry group and belonging to same scheme (census or sample) is, however, permitted to furnish a single consolidated return. Such consolidated returns are common feature in the case of bidi and cigar establishments, electricity and certain public sector undertakings.
The survey cover factories registered under the Factory Act 1948. Establishments under the control of the Defence Ministry,oil storage and distribution units, restaurants and cafes and technical training institutions not producing anything for sale or exchange were kept outside the coverage of the ASI.
Census and Sample survey data [cen/ssd]
Sampling Procedure
All the factories in the updated frame (universe) are divided into two sectors, viz., Census and Sample.
Census Sector: Census Sector is defined as follows:
a) All industrial units belonging to the 12 less industrially developed states/ UT's viz. Goa, Himachal Pradesh, J & K, Manipur, Meghalaya, Nagaland, Tripura, Andaman & Nicobar Islands, Chandigarh, Dadra & Nagar Haveli, Daman & diu and Pondicherry were completely enumerated every year along with census units.
b) For the rest of the states/ UT's., (i) units having 50 or more workers and using power or 100 or more workers without using power and all electricity undertakings. (ii) all the industry groups for which the total number of units did not exceed 50 at all-India level
c) Remaining units, excluding those of Census Sector, called the sample sector, was covered in two consecutive years (50% samples in alternate years). The sampling strategy was stratified uni-stage with State X NIC 3 digit as stratum. The strata were formed by grouping factories within each State/UT by the industry group at the ultimate digit level of NIC. Thus in each state, each indutry group constitutes a stratum. Within each stratum the districts were first arranged in ascending order of district codes and within each district the factories were then listed in descending order of their employment size. The factories within each stratum having been arranged in the above manner were allotted a running serial number. Factories with odd serial numbers were surveyd in the first year and those with even numbers in the second year of a cycle of two years.
There was no deviation from sample design in ASI 1983-84
Statutory return submitted by factories as well as Face to face
Annual Survey of Industries 1983-84 Questionnaire is divided into different blocks : (However only Summarised data is available for processing and analysis). Therefore, there is only one merged data file for ASI Summary 1983-84. Record Layout of the merged file is provided.
Pre-data entry scrutiny was carried out on the schedules for inter and intra block consistency checks. Such editing was mostly manual, although some editing was automatic. But, for major inconsistencies, the schedules were referred back to NSSO (FOD) for clarifications/modifications.
Code list, State code list and NIC 70 code list may be refered in the External Resources which are used for editing and data processing as well..
Relative Standard Error (RSE) is calculated in terms of worker, wages to worker and GVA using the formula. Programs developed in Visual Foxpro are used to compute the RSE of estimates.
To check for consistency and reliability of data the same are compared with the NIC-2digit level growth rate at all India Index of Production (IIP) and the growth rates obtained from the National Accounts Statistics at current and constant prices for the registered manufacturing sector.
Facebook
TwitterThe Bureau of the Census has released Census 2000 Summary File 1 (SF1) 100-Percent data. The file includes the following population items: sex, age, race, Hispanic or Latino origin, household relationship, and household and family characteristics. Housing items include occupancy status and tenure (whether the unit is owner or renter occupied). SF1 does not include information on incomes, poverty status, overcrowded housing or age of housing. These topics will be covered in Summary File 3. Data are available for states, counties, county subdivisions, places, census tracts, block groups, and, where applicable, American Indian and Alaskan Native Areas and Hawaiian Home Lands. The SF1 data are available on the Bureau's web site and may be retrieved from American FactFinder as tables, lists, or maps. Users may also download a set of compressed ASCII files for each state via the Bureau's FTP server. There are over 8000 data items available for each geographic area. The full listing of these data items is available here as a downloadable compressed data base file named TABLES.ZIP. The uncompressed is in FoxPro data base file (dbf) format and may be imported to ACCESS, EXCEL, and other software formats. While all of this information is useful, the Office of Community Planning and Development has downloaded selected information for all states and areas and is making this information available on the CPD web pages. The tables and data items selected are those items used in the CDBG and HOME allocation formulas plus topics most pertinent to the Comprehensive Housing Affordability Strategy (CHAS), the Consolidated Plan, and similar overall economic and community development plans. The information is contained in five compressed (zipped) dbf tables for each state. When uncompressed the tables are ready for use with FoxPro and they can be imported into ACCESS, EXCEL, and other spreadsheet, GIS and database software. The data are at the block group summary level. The first two characters of the file name are the state abbreviation. The next two letters are BG for block group. Each record is labeled with the code and name of the city and county in which it is located so that the data can be summarized to higher-level geography. The last part of the file name describes the contents . The GEO file contains standard Census Bureau geographic identifiers for each block group, such as the metropolitan area code and congressional district code. The only data included in this table is total population and total housing units. POP1 and POP2 contain selected population variables and selected housing items are in the HU file. The MA05 table data is only for use by State CDBG grantees for the reporting of the racial composition of beneficiaries of Area Benefit activities. The complete package for a state consists of the dictionary file named TABLES, and the five data files for the state. The logical record number (LOGRECNO) links the records across tables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 11/15/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.
#Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Facebook
TwitterList of the data tables as part of the Immigration system statistics Home Office release. Summary and detailed data tables covering the immigration system, including out-of-country and in-country visas, asylum, detention, and returns.
If you have any feedback, please email MigrationStatsEnquiries@homeoffice.gov.uk.
The Microsoft Excel .xlsx files may not be suitable for users of assistive technology.
If you use assistive technology (such as a screen reader) and need a version of these documents in a more accessible format, please email MigrationStatsEnquiries@homeoffice.gov.uk
Please tell us what format you need. It will help us if you say what assistive technology you use.
Immigration system statistics, year ending September 2025
Immigration system statistics quarterly release
Immigration system statistics user guide
Publishing detailed data tables in migration statistics
Policy and legislative changes affecting migration to the UK: timeline
Immigration statistics data archives
https://assets.publishing.service.gov.uk/media/691afc82e39a085bda43edd8/passenger-arrivals-summary-sep-2025-tables.ods">Passenger arrivals summary tables, year ending September 2025 (ODS, 31.5 KB)
‘Passengers refused entry at the border summary tables’ and ‘Passengers refused entry at the border detailed datasets’ have been discontinued. The latest published versions of these tables are from February 2025 and are available in the ‘Passenger refusals – release discontinued’ section. A similar data series, ‘Refused entry at port and subsequently departed’, is available within the Returns detailed and summary tables.
https://assets.publishing.service.gov.uk/media/691b03595a253e2c40d705b9/electronic-travel-authorisation-datasets-sep-2025.xlsx">Electronic travel authorisation detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 58.6 KB)
ETA_D01: Applications for electronic travel authorisations, by nationality
ETA_D02: Outcomes of applications for electronic travel authorisations, by nationality
https://assets.publishing.service.gov.uk/media/6924812a367485ea116a56bd/visas-summary-sep-2025-tables.ods">Entry clearance visas summary tables, year ending September 2025 (ODS, 53.3 KB)
https://assets.publishing.service.gov.uk/media/691aebbf5a253e2c40d70598/entry-clearance-visa-outcomes-datasets-sep-2025.xlsx">Entry clearance visa applications and outcomes detailed datasets, year ending September 2025 (MS Excel Spreadsheet, 30.2 MB)
Vis_D01: Entry clearance visa applications, by nationality and visa type
Vis_D02: Outcomes of entry clearance visa applications, by nationality, visa type, and outcome
Additional data relating to in country and overse
Facebook
TwitterName: GoiEner smart meters data Summary: The dataset contains hourly time series of electricity consumption (kWh) provided by the Spanish electricity retailer GoiEner. The time series are arranged in four compressed files: raw.tzst, contains raw time series of all GoiEner clients (any date, any length, may have missing samples). imp-pre.tzst, contains processed time series (imputation of missing samples), longer than one year, collected before March 1, 2020. imp-in.tzst, contains processed time series (imputation of missing samples), longer than one year, collected between March 1, 2020 and May 30, 2021. imp-post.tzst, contains processed time series (imputation of missing samples), longer than one year, collected after May 30, 2020. metadata.csv, contains relevant information for each time series. License: CC-BY-SA Acknowledge: These data have been collected in the framework of the WHY project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 891943. Disclaimer: The sole responsibility for the content of this publication lies with the authors. It does not necessarily reflect the opinion of the Executive Agency for Small and Medium-sized Enterprises (EASME) or the European Commission (EC). EASME or the EC are not responsible for any use that may be made of the information contained therein. Collection Date: From November 2, 2014 to June 8, 2022. Publication Date: December 1, 2022. DOI: 10.5281/zenodo.7362094 Other repositories: None. Author: GoiEner, University of Deusto. Objective of collection: This dataset was originally used to establish a methodology for clustering households according to their electricity consumption. Description: The meaning of each column is described next for each file. raw.tzst: (no column names provided) timestamp; electricity consumption in kWh. imp-pre.tzst, imp-in.tzst, imp-post.tzst: “timestamp”: timestamp; “kWh”: electricity consumption in kWh; “imputed”: binary value indicating whether the row has been obtained by imputation. metadata.csv: “user”: 64-character identifying a user; “start_date”: initial timestamp of the time series; “end_date”: final timestamp of the time series; “length_days”: number of days elapsed between the initial and the final timestamps; “length_years”: number of years elapsed between the initial and the final timestamps; “potential_samples”: number of samples that should be between the initial and the final timestamps of the time series if there were no missing values; “actual_samples”: number of actual samples of the time series; “missing_samples_abs”: number of potential samples minus actual samples; “missing_samples_pct”: potential samples minus actual samples as a percentage; “contract_start_date”: contract start date; “contract_end_date”: contract end date; “contracted_tariff”: type of tariff contracted (2.X: households and SMEs, 3.X: SMEs with high consumption, 6.X: industries, large commercial areas, and farms); “self_consumption_type”: the type of self-consumption to which the users are subscribed; “p1”, “p2”, “p3”, “p4”, “p5”, “p6”: contracted power (in kW) for each of the six time slots; “province”: province where the user is located; “municipality”: municipality where the user is located (municipalities below 50.000 inhabitants have been removed); “zip_code”: post code (post codes of municipalities below 50.000 inhabitants have been removed); “cnae”: CNAE (Clasificación Nacional de Actividades Económicas) code for economic activity classification. 5 star: ⭐⭐⭐ Preprocessing steps: Data cleaning (imputation of missing values using the Last Observation Carried Forward algorithm using weekly seasons); data integration (combination of multiple SIMEL files, i.e. the data sources); data transformation (anonymization, unit conversion, metadata generation). Reuse: This dataset is related to datasets: "A database of features extracted from different electricity load profiles datasets" (DOI 10.5281/zenodo.7382818), where time series feature extraction has been performed. "Measuring the flexibility achieved by a change of tariff" (DOI 10.5281/zenodo.7382924), where the metadata has been extended to include the results of a socio-economic characterization and the answers to a survey about barriers to adapt to a change of tariff. Update policy: There might be a single update in mid-2023. Ethics and legal aspects: The data provided by GoiEner contained values of the CUPS (Meter Point Administration Number), which are personal data. A pre-processing step has been carried out to replace the CUPS by random 64-character hashes. Technical aspects: raw.tzst contains a 15.1 GB folder with 25,559 CSV files; imp-pre.tzst contains a 6.28 GB folder with 12,149 CSV files; imp-in.tzst contains a 4.36 GB folder with 15.562 CSV files; and imp-post.tzst contains a 4.01 GB folder with 17.519 CSV files. Other: None.
Facebook
TwitterIn-situ chemical oxidation (ICO) is a remediation technology that involves the addition of chemicals to the substrate that degrade contaminants through oxidation processes. This series of field experiments conducted at the Old Casey Powerhouse/Workshop investigate the potential for the use of ICO technology in Antarctica on petroleum hydrocarbon contaminated sediments.
Surface application was made using 12.5% sodium hyperchlorite, 6.25% sodium hydrechlorite, 30% hydrogen peroxide and Fentons Reagent (sodium hypchlorite with an iron catalyst) on five separate areas of petroleum hydrocarbon contaminated sediments. Sampling was conducted before and after chemical application from the top soil section (0 - 5 cm) and at depth (10 - 15 cm).
The data are stored in an excel file.
This work was completed as part of ASAC project 1163 (ASAC_1163).
The spreadsheet is divided up as follows:
The first 51 sheets are the raw GC-FID data for the 99/00 field season, labelled by sample name. These sheets use the same format as the radiometric GC-FID spreadsheet in the metadata record entitled 'Mineralisation results using 14C octadecane at a range of temperatures'. Sample name format consists of a location or experiment indicator (CW=Casey Workshop, BR= Small-scale field trial), the year the sample was collected (00=2000), the sample type (S=Soil) and a sequence number.
SUMMARY and PRINTABLE VERSION are the same data in different formats, PRINTABLE VERSION is printer friendly. This summary data includes the hydrocarbon concentrations corrected for dry weight of soil and biodegradation and weathering indices.
GRAPHS are graphs.
FIELD MEASUREMENTS show the results of the measurements taken in the field and include PID (ppm), Soil temperature (C), Air temperature (C), Ph and MC (moisture content) (%).
NOTES shows the chemicals added to each trial, and a short summary of the samples.
The next 21 sheets show the raw GC-FID data for the 00/01 field season, labelled to previously explained method. PRINTABLE (0001) is a summary of the raw GC-FID data.
The next 3 sheets show the raw GC-FID data for the 01/02 field season, labelled to previously explained method. PRINTABLE (0102) is a summary of the raw GC-FID data.
MPN-NOTES shows lab book references and set up summary for the Most Probable Number (MPN) analysis.
MPN-DETAILS shows the set up details, calculations and results for each MPN analysis.
MPN-RESULTS shows the raw MPN data.
MPN-Calculations show the results from the MPN Calculator.
The fields in the dataset are: Retention Time Area % Area Height of peak Amount Int Type Units Peak Type Codes
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
A dataset providing information of the vehicle types and counts in several locations in Leeds. Purpose of the project The aim of this work was to examine the profile of vehicle types in Leeds, in order to compare local emissions with national predictions. Traffic was monitored for a period of one week at two Inner Ring Road locations in April 2016 and at seven sites around the city in June 2016. The vehicle registration data was then sent to the Department for Transport (Dft), who combined it with their vehicle type data, replacing the registration number with an anonymised ‘Unique ID’. The data is provided in three folders:- Raw Data – contains the data in the format it was received, and a sample of each format. Processed Data – the data after processing by LCC, lookup tables, and sample data. Outputs – Excel spreadsheets summarising the data for each site, for various time/dates. Initially a dataset was received for the Inner Ring Road (see file “IRR ANPR matched to DFT vehicle type list.csv”), with vehicle details, but with missing / uncertain data on the vehicles emissions Eurostandard class. Of the 820,809 recorded journeys, from the pseudo registration number field (UniqueID) it was determined that there were 229,891 unique vehicles, and 31,912 unique “vehicle types” based on the unique concatenated vehicle description fields. It was therefore decided to import the data into an MS Access database, create a table of vehicle types, and to add the necessary fields/data so that combined with the year of manufacture / vehicle registration, the appropriate Eurostandard could be determined for the particular vehicle. The criteria for the Eurostandards was derived mainly from www.dieselnet.com and summarised in a spreadsheet (“EuroStandards.xlsx”). Vehicle types were assigned to a “VehicleClass” (see “Lookup Tables.xlsx”) and “EU class” with additional fields being added for any modified data (Gross Vehicle Weight – “GVM_Mod”; Engine capacity – “EngineCC_mod”; No of passenger seats – “PassSeats”; and Kerb weight – “KerbWt”). Missing data was added from the internet lookups, extrapolation from known data, and by association – eg 99% of cars with an engine size Additional data was then received from the Inner Ring Road site, giving journey date/time and incorporating the Taxi data for licensed taxis in Leeds. Similar data for Sites 1-7 was also then received, and processed to determine the “VehicleClass” and “EU class”. A mixture of Update queries, and VBA processing was then used to provide the Level 1-6 breakdown of vehicle types (see “Lookup Tables.xlsx”). The data was then combined into one database, so that the required Excel spreadsheets could be exported for the required time/date periods (see “outputs” folder).
Facebook
TwitterThe Annual Survey of Industries (ASI) is the principal source of industrial statistics in India. It provides statistical information to assess changes in the growth, composition and structure of organised manufacturing sector comprising activities related to manufacturing processes, repair services, gas and water supply and cold storage. Industrial sector occupies an important position in the State economy and has a pivotal role to play in the rapid and balanced economic development. The Survey is conducted annually under the statutory provisions of the Collection of Statistics Act 1953, and the Rules framed there-under in 1959, except in the State of Jammu & Kashmir where it is conducted under the State Collection of Statistics Act, 1961 and the rules framed there-under in 1964.
National and State level.
Coverage of the Annual Survey of Industries extends to the entire Factory Sector, comprising industrial units (called factories) registered under section 2(m)(i) and 2(m)(ii) of the Factories Act.1948, wherein a "Factory", which is the primary statistical unit of enumeration for the ASI is defined as:- "Any premises" including the precincts thereof:- (i) wherein ten or more workers are working or were working on any day of the preceding twelve months, and in any part of which a manufacturing process is being carried on with the aid of power or is ordinarily so carried on, or (ii) wherein twenty or more workers are working or were working on any day of the preceding twelve months, and in any part of which a manufacturing process is being carried on without the aid of power. In addition to section 2(m)(i) & 2(m)(ii) of the Factories Act, 1948, electricity units registered with the Central Electricity Authority and Bidi & Cigar units, registered under the Bidi & Cigar Workers (Conditions of Employment) Act,1966 are also covered in ASI.
The primary unit of enumeration in the survey is a factory in the case of manufacturing industries, a workshop in the case of repair services, an undertaking or a licensee in the case of electricity, gas & water supply undertakings and an establishment in the case of bidi & cigar industries. The owner of two or more establishments located in the same State and pertaining to the same industry group and belonging to same scheme (census or sample) is, however, permitted to furnish a single consolidated return. Such consolidated returns are common feature in the case of bidi and cigar establishments, electricity and certain public sector undertakings.
The survey cover factories registered under the Factory Act 1948. Establishments under the control of the Defence Ministry,oil storage and distribution units, restaurants and cafes and technical training institutions not producing anything for sale or exchange were kept outside the coverage of the ASI.
Sample survey data [ssd]
Sampling Procedure
All the factories in the updated frame (universe) are divided into two sectors, viz., Census and Sample.
Census Sector: Census Sector is defined as follows:
a) All industrial units belonging to the 12 less industrially developed states/ UT's viz. Goa, Himachal Pradesh, J & K, Manipur, Meghalaya, Nagaland, Tripura, Andaman & Nicobar Islands, Chandigarh, Dadra & Nagar Haveli, Daman & diu and Pondicherry were completely enumerated every year along with census units.
b) For the rest of the states/ UT's., (i) units having 50 or more workers and using power or 100 or more workers without using power and all electricity undertakings. (ii) all the industry groups for which the total number of units did not exceed 50 at all-India level
c) Remaining units, excluding those of Census Sector, called the sample sector, was covered in two consecutive years (50% samples in alternate years). The sampling strategy was stratified uni-stage with State X NIC 3 digit as stratum. The strata were formed by grouping factories within each State/UT by the industry group at the ultimate digit level of NIC. Thus in each state, each indutry group constitutes a stratum. Within each stratum the districts were first arranged in ascending order of district codes and within each district the factories were then listed in descending order of their employment size. The factories within each stratum having been arranged in the above manner were allotted a running serial number. Factories with odd serial numbers were surveyd in the first year and those with even numbers in the second year of a cycle of two years.
The sampling strategy was stratified unistage with state X NIC 3 digit as stratum.
There was no deviation from sample design in ASI 1974-75
Face-to-face [f2f]
Annual Survey of Industries 1978-79 Questionnaire is divided into different blocks : (However only Summarised data is available for processing and analysis). Therefore, there is only one merged data file for ASI Summary 1978-79. Record Layout of the merged file is provided.
Pre-data entry scrutiny was carried out on the schedules for inter and intra block consistency checks. Such editing was mostly manual, although some editing was automatic. But, for major inconsistencies, the schedules were referred back to NSSO (FOD) for clarifications/modifications.
Code list, State code list and NIC 70 code list may be refered in the External Resources which are used for editing and data processing as well..
Relative Standard Error (RSE) is calculated in terms of worker, wages to worker and GVA using the formula.
To check for consistency and reliability of data the same are compared with the NIC-2 digit level growth rate at all India Index of Production (IIP) and the growth rates obtained from the National Accounts Statistics at current and constant prices for the registered manufacturing sector.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Key Table Information.Table Title.Annual Business Survey: Statistics for Employer Firms by Industry, Sex, Ethnicity, Race, and Veteran Status for the U.S., States, Metro Areas, Counties, and Places: 2022.Table ID.ABSCS2022.AB2200CSA01.Survey/Program.Economic Surveys.Year.2022.Dataset.ECNSVY Annual Business Survey Company Summary.Release Date.2024-12-19.Release Schedule.The Annual Business Survey (ABS) occurs every year, beginning in reference year 2017.For more information about ABS planned data product releases, see Tentative ABS Schedule..Dataset Universe.The dataset universe consists of employer firms that are in operation for at least some part of the reference year, are located in one of the 50 U.S. states, associated offshore areas, or the District of Columbia, have paid employees and annual receipts of $1,000 or more, and are classified in one of nineteen in-scope sectors defined by the 2022 North American Industry Classification System (NAICS), except for NAICS 111, 112, 482, 491, 521, 525, 813, 814, and 92 which are not covered..Sponsor.National Center for Science and Engineering Statistics, U.S. National Science Foundation.Methodology.Data Items and Other Identifying Records.Number of employer firms (firms with paid employees)Sales and receipts of employer firms (reported in $1,000s of dollars)Number of employees (during the March 12 pay period)Annual payroll (reported in $1,000s of dollars)These data are aggregated by sex, ethnicity, race, and veteran status when classifiable.Definitions can be found by clicking on the column header in the table or by accessing the Economic Census Glossary..Unit(s) of Observation.The reporting units for the ABS are employer companies or firms rather than establishments. A company or firm is comprised of one or more in-scope establishments that operate under the ownership or control of a single organization..Geography Coverage.The 2022 reference year data are shown for the total for all sectors (00) and the 2-digit NAICS code levels for:United StatesStates and the District of ColumbiaMetropolitan Statistical AreasMicropolitan Statistical AreasMetropolitan DivisionsCombined Statistical AreasCountiesEconomic PlacesData are also shown for the 3- to 6-digit NAICS code levels for:United StatesStates and the District of ColumbiaFor information about geographies, see Geographies..Industry Coverage.The data are shown for the total of all sectors ("00"), and at the 2- through 6-digit NAICS code levels depending on geography. Sector "00" is not an official NAICS sector but is rather a way to indicate a total for multiple sectors. Note: Other programs outside of ABS may use sector 00 to indicate when multiple NAICS sectors are being displayed within the same table and/or dataset.The following are excluded from the total of all sectors:Crop and Animal Production (NAICS 111 and 112)Rail Transportation (NAICS 482)Postal Service (NAICS 491)Monetary Authorities-Central Bank (NAICS 521)Funds, Trusts, and Other Financial Vehicles (NAICS 525)Office of Notaries (NAICS 541120)Religious, Grantmaking, Civic, Professional, and Similar Organizations (NAICS 813)Private Households (NAICS 814)Public Administration (NAICS 92)For information about NAICS, see North American Industry Classification System..Sampling.The ABS sample includes firms that are selected with certainty if they have known research and development activities, were included in the 2022 BERD sample, or have high receipts, payroll, or employment. Total sample size is 850,000 firms. The universe is stratified by state, industry group, and expected demographic group. Firms selected to the sample receive a questionnaire. For all data on this table, firms not selected into the sample are represented with administrative, 2022 Economic Census, or other economic surveys records.For more information about the sample design, see Annual Business Survey Methodology..Confidentiality.The Census Bureau has reviewed this data product to ensure appropriate access, use, and disclosure avoidance protection of the confidential source data (Project No. P-7504866, Disclosure Review Board (DRB) approval number: CBDRB-FY24-0351).To protect confidentiality, the U.S. Census Bureau suppresses cell values to minimize the risk of identifying a particular business' data or identity.To comply with data quality standards, data rows with high relative standard errors (RSE) are not presented. Additionally, firm counts are suppressed when other select statistics in the same row are suppressed. More information on disclosure avoidance is available in the Annual Business Survey Methodology..Technical Documentation/Methodology.For detailed information about the methods used to collect data and produce statistics, survey questionnaires, Primary Business Activity/NAICS codes, and more, see Technical Documentation..Weights.For more information about weighting, see Annual Business Survey Methodology..Table Information.FTP Download.https://www2.census.gov/programs...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Full CNA data identified by array-CGH in the 50 Wilms tumor samples. CNA calling was performed using the software Nexus Copy Number 7.0 (Biodiscovery). Table S2. Array-CGH data summary of Wilms tumor samples: total number of copy number alteration, number of each type of copy number event (gain, loss, high-copy gain, and homozygous loss), and statistical analysis. (XLS 872Â kb)
Facebook
TwitterTags
survey, environmental behaviors, lifestyle, status, PRIZM, Baltimore Ecosystem Study, LTER, BES
Summary
BES Research, Applications, and Education
Description
Geocoded for Baltimore County. The BES Household Survey 2003 is a telephone survey of metropolitan Baltimore residents consisting of 29 questions. The survey research firm, Hollander, Cohen, and McBride conducted the survey, asking respondents questions about their outdoor recreation activities, watershed knowledge, environmental behavior, neighborhood characteristics and quality of life, lawn maintenance, satisfaction with life, neighborhood, and the environment, and demographic information. The data from each respondent is also associated with a PRIZM� classification, census block group, and latitude-longitude. PRIZM� classifications categorize the American population using Census data, market research surveys, public opinion polls, and point-of-purchase receipts. The PRIZM� classification is spatially explicit allowing the survey data to be viewed and analyzed spatially and allowing specific neighborhood types to be identified and compared based on the survey data. The census block group and latitude-longitude data also allow us additional methods of presenting and analyzing the data spatially.
The household survey is part of the core data collection of the Baltimore Ecosystem Study to classify and characterize social and ecological dimensions of neighborhoods (patches) over time and across space. This survey is linked to other core data including US Census data, remotely-sensed data, and field data collection, including the BES DemSoc Field Observation Survey.
The BES 2003 telephone survey was conducted by Hollander, Cohen, and McBride from September 1-30, 2003. The sample was obtained from the professional sampling firm Claritas, in order that their "PRIZM" encoding would be appended to each piece of sample (telephone number) supplied. Mailing addresses were also obtained so that a postcard could be sent in advance of interviewers calling. The postcard briefly informed potential respondents about the survey, who was conducting it, and that they might receive a phone call in the next few weeks. A stratified sampling method was used to obtain between 50 - 150 respondents in each of the 15 main PRIZM classifications. This allows direct comparison of PRIZM classifications. Analysis of the data for the general metropolitan Baltimore area must be weighted to match the population proportions normally found in the region. They obtained a total of 9000 telephone numbers in the sample. All 9,000 numbers were dialed but contact was only made on 4,880. 1508 completed an interview, 2524 refused immediately, 147 broke off/incomplete, 84 respondents had moved and were no longer in the correct location, and a qualified respondent was not available on 617 calls. This resulted in a response rate of 36.1% compared with a response rate of 28.2% in 2000. The CATI software (Computer Assisted Terminal Interviewing) randomized the random sample supplied, and was programmed for at least 3 attempted callbacks per number, with emphasis on pulling available callback sample prior to accessing uncalled numbers. Calling was conducted only during evening and weekend hours, when most head of households are home. The use of CATI facilitated stratified sampling on PRIZM classifications, centralized data collection, standardized interviewer training, and reduced the overall cost of primary data collection. Additionally, to reduce respondent burden, the questionnaire was revised to be concise, easy to understand, minimize the use of open-ended responses, and require an average of 15 minutes to complete.
The household survey is part of the core data collection of the Baltimore Ecosystem Study to classify and characterize social and ecological dimensions of neighborhoods (patches) over time and across space. This survey is linked to other core data, including US Census data, remotely-sensed data, and field data collection, including the BES DemSoc Field Observation Survey.
Additional documentation of this database is attached to this metadata and includes 4 documents, 1) the telephone survey, 2) documentation of the telephone survey, 3) metadata for the telephone survey, and 4) a description of the attribute data in the BES survey 2003 survey.
This database was created by joining the GDT geographic database of US Census Block Group geographies for the Baltimore Metropolitan Statisticsal Area (MSA), with the Claritas PRIZM database, 2003, of unique classifications of each Census Block Group, and the unique PRIZM code for each respondent from the BES Household Telephone Survey, 2003. The GDT database is preferred and used because
Facebook
TwitterThis map compares the number of people living above the poverty line to the number of people living below. Why do this?There are people living below the poverty line everywhere. Nearly every area of the country has a balance of people living above the poverty line and people living below it. There is not an "ideal" balance, so this map makes good use of the national ratio of 6 persons living above the poverty line for every 1 person living below it. Please consider that there is constant movement of people above and below the poverty threshold, as they gain better employment or lose a job; as they encounter a new family situation, natural disaster, health issue, major accident or other crisis. There are areas that suffer chronic poverty year after year. This map does not indicate how long people in the area have been below the poverty line. "The poverty rate is one of several socioeconomic indicators used by policy makers to evaluate economic conditions. It measures the percentage of people whose income fell below the poverty threshold. Federal and state governments use such estimates to allocate funds to local communities. Local communities use these estimates to identify the number of individuals or families eligible for various programs." Source: U.S. Census BureauFor example, here's where the poverty line is for 2 adults, 2 children:In the U.S. overall, there are 6 people living above the poverty line for every 1 household living below. Green areas on the map have a higher than normal number of people living above compared to below poverty. Orange areas on the map have a higher than normal number of people living below the poverty line compared to those above in that same area.The map is feature in this simple viewing app. The map shows the ratio for states, counties, and census tracts, using these layers, created directly from the U.S. Census Bureau's American Community Survey (ACS)For comparison, an older layer using 2013 ACS data is also provided.The layers are updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. Poverty status is based on income in past 12 months of survey. Current Vintage: 2014-2018ACS Table(s): B17020Data downloaded from: Census Bureau's API for American Community Survey National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases. Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines clipped for cartographic purposes. For census tracts, the water cutouts are derived from a subset of the 2010 AWATER (Area Water) boundaries offered by TIGER. For state and county boundaries, the water and coastlines are derived from the coastlines of the 500k TIGER Cartographic Boundary Shapefiles. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -555555...) have been set to null. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small. NOTE: any calculated percentages or counts that contain estimates that have null margins of error yield null margins of error for the calculated fields.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, the decennial census is the official source of population totals for April 1st of each decennial year. In between censuses, the Census Bureau's Population Estimates Program produces and disseminates the official estimates of the population for the nation, states, counties, cities, and towns and estimates of housing units and the group quarters population for states and counties..Information about the American Community Survey (ACS) can be found on the ACS website. Supporting documentation including code lists, subject definitions, data accuracy, and statistical testing, and a full list of ACS tables and table shells (without estimates) can be found on the Technical Documentation section of the ACS website.Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..Source: U.S. Census Bureau, 2023 American Community Survey 1-Year Estimates.ACS data generally reflect the geographic boundaries of legal and statistical areas as of January 1 of the estimate year. For more information, see Geography Boundaries by Year..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see ACS Technical Documentation). The effect of nonsampling error is not represented in these tables..Users must consider potential differences in geographic boundaries, questionnaire content or coding, or other methodological issues when comparing ACS data from different years. Statistically significant differences shown in ACS Comparison Profiles, or in data users' own analysis, may be the result of these differences and thus might not necessarily reflect changes to the social, economic, housing, or demographic characteristics being compared. For more information, see Comparing ACS Data..The age dependency ratio is derived by dividing the combined under-18 and 65-and-over populations by the 18-to-64 population and multiplying by 100..The old-age dependency ratio is derived by dividing the population 65 and over by the 18-to-64 population and multiplying by 100..The child dependency ratio is derived by dividing the population under 18 by the 18-to-64 population and multiplying by 100..When information is missing or inconsistent, the Census Bureau logically assigns an acceptable value using the response to a related question or questions. If a logical assignment is not possible, data are filled using a statistical process called allocation, which uses a similar individual or household to provide a donor value. The "Allocated" section is the number of respondents who received an allocated value for a particular subject..Estimates of urban and rural populations, housing units, and characteristics reflect boundaries of urban areas defined based on 2020 Census data. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..Explanation of Symbols:- The estimate could not be computed because there were an insufficient number of sample observations. For a ratio of medians estimate, one or both of the median estimates falls in the lowest interval or highest interval of an open-ended distribution. For a 5-year median estimate, the margin of error associated with a median was larger than the median itself.N The estimate or margin of error cannot be displayed because there were an insufficient number of sample cases in the selected geographic area. (X) The estimate or margin of error is not applicable or not available.median- The median falls in the lowest interval of an open-ended distribution (for example "2,500-")median+ The median falls in the highest interval of an open-ended distribution (for example "250,000+").** The margin of error could not be computed because there were an insufficient number of sample observations.*** The margin of error could not be computed because the median falls in the lowest interval or highest interval of an open-ended distribution.***** A margin of error is not appropriate because the corresponding estimate is controlled to an independent population or housing estimate. Effectively, the corresponding estimate has no sampling error and the margin of error may be treated as zero.
Facebook
TwitterThis layer shows Asian alone or in any combination by selected groups. This is shown by tract, county, and state centroids. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. The numbers by detailed Asian groups do not add to the total population. This is because the detailed Asian groups are tallies of the number of Asian responses rather than the number of Asian respondents. Responses that include more than one race and/or Asian group are counted several times. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B02001, B02011, B02018 (Not all lines of ACS table B02001 are available in this layer.)Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Facebook
TwitterThe survey on financial literacy among the citizens of Bosnia and Herzegovina was conducted within a larger project that aims at creating the Action Plan for Consumer Protection in Financial Services.
The conclusion about the need for an Action Plan was reached by the representatives of the World Bank, the Federal Ministry of Finance, the Central Bank of Bosnia and Herzegovina, supervisory authorities for entity financial institutions and non-governmental organizations for the protection of consumer rights, based on the Diagnostic Review on Consumer Protection and Financial Literacy in Bosnia and Herzegovina conducted by the World Bank in 2009-2010. This diagnostic review was conducted at the request of the Federal Ministry of Finance, as part of a larger World Bank pilot program to assess consumer protection and financial literacy in developing countries and middle-income countries. The diagnostic review in Bosnia and Herzegovina was the eighth within this project.
The financial literacy survey, whose results are presented in this report, aims at establishing the basic situation with respect to financial literacy, serving on the one hand as a preparation for the educational activities plan, and on the other as a basis for measuring the efficiency of activities undertaken.
Data collection was based on a random, nation-wide sample of citizens of Bosnia and Herzegovina aged 18 or older (N = 1036).
Household, individual
Population aged 18 or older
Sample survey data [ssd]
SUMMARY
In Bosnia and Herzegovina, as is well known, there is no completely reliable sample frame or information about universe. The main reasons for such a situation are migrations caused by war and lack of recent census data. The last census dates back to 1991, but since then the size and distribution of population has significantly changed. In such a situation, researchers have to combine all available sources of population data to estimate the present size and structure of the population: estimates by official statistical offices and international organizations, voters? lists, list of polling stations, registries of passport and ID holders, data from large random surveys etc.
The sample was three-stage stratified: in the first stage by entity, in the second by county/region and in the third by type of settlement (urban/rural). This means that, in the first stage, the total sample size was divided in two parts proportionally to number of inhabitants by entity, while in the second stage the subsample size for each entity was further divided by regions/counties. In the third stage, the subsample for each region/county was divided in two categories according to settlement type (rural/urban).
Taking into the account the lack of a reliable and complete list of citizens to be used as a sample frame, a multistage sampling method was applied. The list of polling stations was used as a frame for the selection of primary sampling units (PSU). Polling station territories are a good choice for such a procedure since they have been recently updated, for the general elections held in October 2010. The list of polling station territories contains a list of addresses of housing units that are certainly occupied.
In the second stage, households were used as a secondary sampling unit. Households were selected randomly by a random route technique. In total, 104 PSU were selected with an average of 10 respondents per PSU. The respondent from the selected household was selected randomly using the Trohdal-Bryant scheme.
In total, 1036 citizens were interviewed with a satisfactory response rate of around 60% (table 1). A higher refusal rate is recorded among middle-age groups (table 2). The theoretical margin of error for a random sample of this size is +/-3.0%.
Due to refusals, the sample structure deviated from the estimated population structure by gender, age and education level. Deviations were corrected by RIM weighting procedure.
MORE DETAILED INFORMATION
IPSOS designed a representative sample of approximately 1.000 residents age 18 and over, proportional to the adult populations of each region, based on age, sex, region and town (settlement) type.
For this research we designed three-stage stratified representative sample. First we stratify sample at entity level, regional level and then at settlement type level for each region.
Sample universe:
Population of B&H -18+; 1991 Census figures and estimated population dynamics, census figures of refugees and IDPs, 1996. Central Election Commision - 2008; CIPS - 2008;
Sampling frame:
Polling stations territory (approximate size of census units) within strata defined by regions and type of settlements (urban and rural) Polling stations territories are chosen to be used as primary units because it enables the most reliable sample selection, due to the fact that for these units the most complete data are available (dwelling register - addresses)
Type of sample:
Three stage random representative stratified sample
Definition and number of PSU, SSU, TSU, and sampling points
Stratification, purpose and method
Method: The strata are defined by criteria of optimal geographical and cultural uniformity
Selection procedure of PSU, SSU, and respondent Stratification, purpose and method
PSU Type of sampling of the PSU: Polling station territory chosen with probability proportional to size (PPS) Method of selection: Cumulative (Lachirie method)
SSU Type of sampling of the SSU: Sample random sampling without replacement Method of selection: Random walk - Random choice of the starting point
TSU - Respondent Type of sampling of respondent: Sample random sampling without replacement Method of selection: TCB (Trohdal-Bryant scheme)
Sample size N=1036 respondents
Sampling error Marginal error +/-3.0%
Face-to-face [f2f]
The survey was modelled after the identical survey conducted in Romania. The questionnaire used in the Financial Literacy Survey in Romania was localized for Bosnia and Herzegovina, including adaptations to match the Bosnian context and methodological improvements in wording of questions.
Before data entry, 100% logic and consistency controls are performed first by local supervisors and once later by staff in central office.
Verification of correct data entry is assured by using BLAISE system for data entry (commercial product of Netherlands statistics), where criteria for logical and consistency control are defined in advance.
Facebook
TwitterOver time there have been a number of tide gauges deployed at Mawson Station, Antarctica. The data download files contain further information about the gauges, but some of the information has been summarised here. Note that this metadata record only describes tide gauge data from 1992 to 2016. More recent data are described elsewhere.
Tide Gauge 1 (TG001) 1992-03-05 - 1992-05-13 This folder contains monthly download files from the first deployment of a submerged tide gauge at Mawson in March 1992. These files are ASCII hexadecimal files. They need to be converted to decimal. The resultant values are absolute seawater pressures in mbar.
Tide Gauge 4 (TG004)
1993-03-22 - 1999-12-29
This folder contains the following folders:-
old_tidedata
monthly download files from the second deployment of a submerged tide gauge at Mawson in March 1993.
These files are ASCII hexadecimal files. They need to be converted to decimal.
The resultant values are absolute seawater pressures in mbar.
raw memory images from submerged tide gauge. file extension is memory bank number. These files are processed by a utility called tgxtract.exe which creates files in same format as those in old_tidedata folder. These file have extension .srt. They are then converted to decimal pressure values.
interim files produced during processing of .raw files. output output file (.srt) which have been sent to BoM.
Tide Gauge 13 (TG013) 2014-06-04 - 2016-11-04
Tide Gauge 20 (TG020) 1999-11-05 - 2009-12-21 This folder contains the following folders:-
raw memory images from submerged tide gauge. file extension is memory bank number. These files are processed by a utility called tgxtract.exe which creates files in same format as original download format. These file have extension .srt. These files are ASCII hexadecimal files. They need to be converted to decimal. The resultant values are absolute seawater pressures in mbar.
interim files produced during processing of .raw files. output output file (.srt) which have been sent to BoM.
Tide Gauge 41 (TG041) 2008-03-02 - 2010-11-16 This folder contains the following folders:-
raw memory images from submerged tide gauge. file extension is memory bank number. These files are processed by a utility called tgxtract.exe which creates files in same format as original download format. These file have extension .srt. These files are ASCII hexadecimal files. They need to be converted to decimal. The resultant values are absolute seawater pressures in mbar.
interim files produced during processing of .raw files. output output file (.srt) which have been sent to BoM.
Documentation from older metadata record: Documentation dated 2001-03-26
Mawson Submerged Tide Gauge
The gauge used at Mawson was designed in 1991/2 by Platypus Engineering, Hobart, Tasmania. It was intended to be submerged in about 7 metres of water in a purpose made concrete mooring in the shape of a truncated pyramid. The gauge measures pressure using a Paroscientific Digiquartz Pressure Transducer with a full scale pressure of 30 psi absolute. The accuracy of the transducer is 1 in 10,000 of full scale over the calibrated temperature range. The overall accuracy of the system is better than +/- 3 mm for a known water density. Data is retrieved from the gauges by lowering a coil assembly on the end of a cable over a projecting knob on the top of the gauge and by use of an interface unit ,a serial connection can be established to the gauge. Time setting and data retrieval can be then achieved. The first of these gauges were first deployed Mawson in early 1992 in a a mooring in Horseshoe Harbour. The gauge was found to have some communications problems and was removed in May 1992. Tidal records from 6/3/92 to present have been retrieved from it. A new gauge was deployed at Mawson in March 1993. Data has been retrieved from these gauges irregularly since then. The records are complete since deployment except for a few days in late 1995. The loss was caused by a fault in the software which allows directory entries to overwrites data when the directory memory has been filled. The first gauge used at Mawson in 1992 was refitted with a higher pressure transducer and was later deployed at Heard Island in Atlas Cove. Conversion of raw data to tidal records is done as detailed in document DATAFORMAT1.DOC . As the current gauge is expected to require a new battery soon, a new mooring has been placed close to the original and a new gauge has been deployed. Levelling Several attempts have been made at precise levelling of the gauge. The first was in the Summer of 1995/6. Roger Handsworth, Tom Gordon and Natasha Adams physically measured the level of the top of the gauge in its mooring and derived a reading when a known column of water was over the gauge. The next attempt was in the Summer of 1996/7 when Roger Handsworth and Paul Delaney made timed water level measurements close to the gauge and the tide gauge benchmark. From this work, and from tidal records, a value for MSL for Mawson was derived.
Permanent Gauge
In the summer of 1995/6 two possible sites for a permanent Aquatrak type tide gauge were identified. As neither of these sites were approved, a survey in the Summer of 1996/7 identified two more suitable sites. One of these, the site at the base of East arm, near the Variometer Building, was approved and a bore hole was drilled to exit about 6 metres below MSL. A power cable was run from the variometer building to provide two phase 240V power to the site. A heated borehole liner containing an Aquatrak wave guide and a Druck pressure transducer was inserted into the bore hole. Two datalogger will be added to the installation in 2001 to complete the installation. A radio modem will be used to link the dataloggers to the AAD network.
Documentation dated 2008-10-17 Mawson A new submerged gauge ,TG41, was deployed at Mawson on 2008-03-03. Submerged Tide gauge TG20 was removed on 2008-08-26. There is a useful overlap of data between the gauges of about 104 days.
The dataloggers used in the shored based tide gauge installation have been replaced with Campbell Scientific CR1000 dataloggers.
The aquatrak shore based gauge at Mawson has not been operating since march 2008. The shore base pressure gauge is still operating.
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
The basic goal of the Household and Consumption Survey is to provide a necessary database for formulating national policies at various levels. This survey provides the contribution of the household sector to the Gross National Product (GNP). It determines the incidence of poverty, and provides weighted data which reflects the relative importance of the consumption items to be employed in determining the benchmark for rates and prices of items and services. Furthermore, this survey is a fundamental cornerstone in the process of studying the nutritional status in the Palestinian territory.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing household surveys in several Arab countries.
The Data are representative at region level (West Bank, Gaza Strip), locality type (urban, rural, camp) and governorates.
1- Household/family. 2- Individual/person.
All Palestinian households who are usually resident in the Palestinian Territory during 2011.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
Sample and Frame: The sampling frame consists of all enumeration areas which were enumerated in 2007, each numeration area consists of buildings and housing units with average of about 120 households in it. These enumeration areas are used as primary sampling units PSUs in the first stage of the sampling selection.
Sample Size: The calculated sample size for the Expenditure and Consumption survey 2011 is about 4,317 households, 2,834 households in West Bank and 1,483 households in Gaza Strip.
Sample Design: The sample is a stratified cluster systematic random sample with two stages: First stage: selection of a systematic random sample of 215 enumeration areas. Second stage: selection of a systematic random sample of 24 households from each enumeration area selected in the first stage.
Note: in Jerusalem Governorate (J1), 14 enumeration areas were selected. In the second stage, a group of households from each enumeration area were chosen using the 2007 census method of delineation and enumeration to obtain 24 responsive households. This ensures household response is the maximum to comply with the percentage of non-response as set in the sample design.
Enumeration areas were distributed to twelve months and the sample for each quarter covers sample strata (Governorate, locality type)
Sample strata: The population was divided by: 1- Governorate 2- Type of Locality (urban, rural, refugee camps)
Face-to-face [f2f]
The PECS questionnaire consists of two main sections:
First: Survey's Questionnaire Part of the questionnaire is to be filled in during the visit at the beginning of the month, while the other part is to be filled in at the end of the month. The questionnaire includes: Control Sheet: Includes household's identification data, date of visit, data on the fieldwork and data processing team, and summary of household's members by gender. Household Roster: Includes demographic, social, and economic characteristics of household's members. Housing Characteristics: Includes data like type of housing unit, number of rooms, value of rent, and connection of housing unit to basic services like water, electricity and sewage. In addition, data in this section includes source of energy used for cooking and heating, distance of housing unit from transportation, education, and health centers, and sources of income generation like ownership of farm land or animals. Food and Non-Food Items: includes food and non-food items, and household record her expenditure for one month. Durable Goods Schedule: Includes list of main goods like washing machine, refrigerator, TV. Assistances and Poverty: Includes data about cash and in kind assistances (assistance value, assistance source), also collecting data about household situation, and the procedures to cover expenses. Monthly and Annual Income: Data pertinent to household's income from different sources is collected at the end of the registration period.
Second: List of Goods The classification of the list of goods is based on the recommendation of the United Nations for the SNA under the name Classification of Personal Consumption by purpose. The list includes 55 groups of expenditure and consumption where each is given a sequence number based on its importance to the household starting with food goods, clothing groups, housing, medical treatment, transportation and communication, and lastly durable goods. Each group consists of important goods. The total number of goods in all groups amounted to 667 items for goods and services. Groups from 1-21 includes goods pertinent to food, drinks and cigarettes. Group 22 includes goods that are home produced and consumed by the household. The groups 23-45 include all items except food, drinks and cigarettes. The groups 50-55 include durable goods. The data is collected based on different reference periods to represent expenditure during the whole year except for cars where data is collected for the last three years.
Registration Form The registration form includes instructions and examples on how to record consumption and expenditure items. The form includes columns: * Monetary: If the good is purchased, or in kind: if the item is self produced. * Title of the service of the good * Unit of measurement (kilogram, liter, number) * Quantity * Value
The pages of the registration form are colored differently for the weeks of the month. The footer for each page includes remarks that encourage households to participate in the survey. The following are instructions that illustrate the nature of the items that should be recorded: * Monetary expenditures during purchases * Purchases based on debts * Monetary gifts once presented * Interest at pay * Self produced food and goods once consumed * Food and merchandise from commercial project once consumed * Merchandises once received as a wage or part of a wage from the employer.
Data editing took place through a number of stages, including: 1. Office editing and coding 2. Data entry 3. Structure checking and completeness 4. Structural checking of SPSS data files
The survey sample consisted of 5,272 households, weights were modified to account for the non-response rate. The response rate was 88%.
Total sample size = 5,272 Households Household completed = 4317 Households Traveling households = 66 Households Unit does not exist = 48 Households No one at home = 135 Households Refused to cooperate = 347 Households Vacant housing unit = 222 Households No available information = 6 Households Other= 30 Households
Response and non-response rates formulas:
Percentage of over-coverage errors = Total cases of over-coverage*100% Number of cases in original sample = 5% Non-response rate = Total cases of non-response*100% Net sample size = 12% Net sample = Original sample - cases of over-coverage Response rate = 100% - non-response rate= 88%
The impact of errors on data quality was reduced to a minimum due to the high efficiency and outstanding selection, training, and performance of the fieldworkers.
Procedures adopted during the fieldwork of the survey were considered a necessity to ensure the collection of accurate data, notably: 1- Develop schedules to conduct field visits to households during survey fieldwork. The objectives of the visits and the data collected on each visit were predetermined. 2- Fieldwork editing rules were applied during the data collection to ensure corrections were implemented before the end of fieldwork activities 3- Fieldworkers were instructed to provide details in cases of extreme expenditure or consumption by the household. 4- Questions on income were postponed until the final visit at the end of the month 5- Validation rules were embedded in the data processing systems, along with procedures to verify data entry and data edit.
Facebook
TwitterThis layer shows computer ownership and internet access by education. This is shown by tract, county, and state centroids. This service is updated annually to contain the most currently released American Community Survey (ACS) 5-year data, and contains estimates and margins of error. There are also additional calculated attributes related to this topic, which can be mapped or used within analysis. This layer is symbolized to show the count of people age 25+ in households with no computer and the percent of the population age 25+ who are high school graduates (includes equivalency) and have some college or associate's degree in households that have no computer. To see the full list of attributes available in this service, go to the "Data" tab, and choose "Fields" at the top right. Current Vintage: 2019-2023ACS Table(s): B28006 Data downloaded from: Census Bureau's API for American Community Survey Date of API call: December 12, 2024National Figures: data.census.govThe United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. For more information about ACS layers, visit the FAQ. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:This layer is updated automatically when the most current vintage of ACS data is released each year, usually in December. The layer always contains the latest available ACS 5-year estimates. It is updated annually within days of the Census Bureau's release schedule. Click here to learn more about ACS data releases.Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2023 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters).The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.