Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To accurately predict molecular properties, it is important to learn expressive molecular representations. Graph neural networks (GNNs) have made significant advances in this area, but they often face limitations like neighbors-explosion, under-reaching, oversmoothing, and oversquashing. Additionally, GNNs tend to have high computational costs due to their large number of parameters. These limitations emerge or increase when dealing with larger graphs or deeper GNN models. One potential solution is to simplify the molecular graph into a smaller, richer, and more informative one that is easier to train GNNs. Our proposed molecular graph coarsening framework called FunQG, uses Functional groups as building blocks to determine a molecule’s properties, based on a graph-theoretic concept called Quotient Graph. We show through experiments that the resulting informative graphs are much smaller than the original molecular graphs and are thus more suitable for training GNNs. We apply FunQG to popular molecular property prediction benchmarks and compare the performance of popular baseline GNNs on the resulting data sets to that of state-of-the-art baselines on the original data sets. Our experiments demonstrate that FunQG yields notable results on various data sets while dramatically reducing the number of parameters and computational costs. By utilizing functional groups, we can achieve an interpretable framework that indicates their significant role in determining the properties of molecular quotient graphs. Consequently, FunQG is a straightforward, computationally efficient, and generalizable solution for addressing the molecular representation learning problem.
Facebook
TwitterThe data consist of two parts: Time trade-off (TTO) data with one row per TTO question (5 questions), and discrete choice experiment (DCE) data with one row per question (6 questions). The purpose of the data is the calculation of a Swedish value set for the capability-adjusted life years (CALY-SWE) instrument. To protect the privacy of the study participants and to comply with GDPR, access to the data is given upon request.
The data is provided in 4 .csv files with the names:
The first two files (tto.csv, dce.csv) contain the time trade-off (TTO) answers and discrete choice experiment (DCE) answers of participants. The latter two files (weight_final_model.csv, coefs_final_model.csv) contain the generated value set of CALY-SWE weights, and the pertaining coefficients of the main effects additive model.
Background:
CALY-SWE is a capability-based instrument for studying Quality of Life (QoL). It consists of 6 attributes (health, social relations, financial situation & housing, occupation, security, political & civil rights) and provides the option to gives for attribute answers on 3 levels (Agree, Agree partially, Do not agree). A configuration or state is one of the 3^6 = 729 possible situations that the instrument describes. Here, a config is denoted in the form of xxxxxx, one x for each attribute in order above. X is a digit corresponding to the level of the respective attribute, with 3 being the highest (Agree), and 1 being the lowest (Do not agree). For example, 222222 encodes a configuration with all attributes on level 2 (Partially agree). The purpose of this dataset is to support the publication of the CALY-SWE value set and to enable reproduction of the calculations (due to privacy concerns we abstain from publishing individual level characteristics). A value set consists of values on the 0 to 1 scale for all 729, each of represents a quality weighting where 1 is the highest capability-related QoL, and 0 the lowest capability-related QoL.
The data contains answers to two types of questions: TTO and DCE.
In TTO questions, participants iteratively chose a number of years between 1 to 10. A choice of 10 years is equivalent to living 10 years with full capability (state configuration 333333) in the capability state that the TTO question describes. The answer on the 0 to 1 scale is then calculated as x/10. In the DCE questions, participants were given two states and they chose a state that they found to be better. We used a hybrid model with a linear regression and a logit model component, where the coefficients were linked through a multiplicative factor, to obtain the weights (weights_final_model.csv). Each weight is calculated as constant + the coefficients for the respective configuration. Coefficients for level 3 encode the difference to level 2, and coefficients for level 2 the difference to the constant. For example, for the weight for 123112 is calculated as constant + socrel2 + finhou2 + finhou3 + polciv2 (No coefficients for health, occupation, and security involved as they are on level 1 that is captured in the constant/intercept).
To assess the quality of TTO answers, we calculated a score per participant that takes into account inconsistencies in answering the TTO question. We then excluded 20% of participants with the worst score to improve the TTO data quality and signal strength for the model (this is indicated by the 'included' variable in the TTO dataset). Details of the entire survey are described in the preprint “CALY-SWE value set: An integrated approach for a valuation study based on an online-administered TTO and DCE survey” by Meili et al. (2023). Please check this document for updated versions.
Ids have been randomized with preserved linkage between the DCE and TTO dataset.
Data files and variables:
Below is a description of the variables in each CSV file. - tto.csv:
config: 6 numbers representing the attribute levels. position: The number of the asked TTO question. tto_block: The design block of the TTO question. answer: The equivalence value indicated by the participant, ranging from 0.1 to 1 in steps of 0.1. included: If the answer was included in the data for the model to generate the value set. id: Randomized id of the participant.
config1: Configuration of the first state in the question. config2: Configuration of the second state in the question. position: The number of the asked TTO question. answer: Whether state 1 or 2 was preferred. id: Randomized id of the participant.
config: 6 numbers representing the attribute levels. weight: The weight calculated with the final model. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
name: Name of the coefficient, composed of an abbreviation for the attribute and a level number (abbreviations in the same order as above: health, socrel, finhou, occu, secu, polciv). value: Continuous, weight on the 0 to 1 scale. ciu: The upper 95% credible interval. cil: The lower 95% credible interval.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
This data set is captured from a robot workcell that is performing activities representative of several manufacturing operations. The workcell contains two, 6-degree-of-freedom robot manipulators where one robot is performing material handling operations (e.g., transport parts into and out of a specific work space) while the other robot is performing a simulated precision operation (e.g., the robot touching the center of a part with a tool tip that leaves a mark on the part). This precision operation is intended to represent a precise manufacturing operation (e.g., welding, machining). The goal of this data set is to provide robot level and process level measurements of the workcell operating in nominal parameters. There are no known equipment or process degradations in the workcell. The material handling robot will perform pick and place operations, including moving simulated parts from an input area to in-process work fixtures. Once parts are placed in/on the work fixtures, the second robot will interact with the part in a specified precise manner. In this specific instance, the second robot has a pen mounted to its tool flange and is drawing the NIST logo on a surface of the part. When the precision operation is completed, the material handling robot will then move the completed part to an output. This suite of data includes process data and performance data, including timestamps. Timestamps are recorded at predefined state changes and events on the PLC and robot controllers, respectively. Each robot controller and the PLC have their own internal clocks and, due to hardware limitations, the timestamps recorded on each device are relative to their own internal clocks. All timestamp data collected on the PLC is available for real-time calculations and is recorded. The timestamps collected on the robots are only available as recorded data for post-processing and analysis. The timestamps collected on the PLC correspond to 14 part state changes throughout the processing of a part. Timestamps are recorded when PLC-monitored triggers are activated by internal processing (PLC trigger origin) or after the PLC receives an input from a robot controller (robot trigger origin). Records generated from PLC-originated triggers include parts entering the work cell, assignment of robot tasks, and parts leaving the work cell. PLC-originating triggers are activated by either internal algorithms or sensors which are monitored directly in the PLC Inputs/Outputs (I/O). Records generated from a robot-originated trigger include when a robot begins operating on a part, when the task operation is complete, and when the robot has physically cleared the fixture area and is ready for a new task assignment. Robot-originating triggers are activated by PLC I/O. Process data collected in the workcell are the variable pieces of process information. This includes the input location (single option in the initial configuration presented in this paper), the output location (single option in the initial configuration presented in this paper), the work fixture location, the part number counted from startup, and the part type (task number for drawing robot). Additional information on the context of the workcell operations and the captured data can be found in the attached files, which includes a README.txt, along with several noted publications. Disclaimer: Certain commercial entities, equipment, or materials may be identified or referenced in this data, or its supporting materials, in order to illustrate a point or concept. Such identification or reference is not intended to imply recommendation or endorsement by NIST; nor does it imply that the entities, materials, equipment or data are necessarily the best available for the purpose. The user assumes any and all risk arising from use of this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data collection contains the tabular data, R scripts and methods used to generate three indicators specific to vascular plants for the NSW Biodiversity Indicator Program's first assessment (prior to the date of commencement of the Biodiversity Conservation Act 2016): 1.2a expected survival of all known species; 2.1a within-species genetic diversity (for all known species); 2.1b extant area occupied (for all known species). These indicators use representative species sets (provided in a related data collection). The habitat condition indicators (related data collections) are used to infer reduction in geographic range size. These indicators are an application of the ‘expected diversity’ framework. Reduction in the geographic range size of a species due to habitat loss, alteration and fragmentation is well known to decrease within-species genetic diversity and increase extinction risk. Therefore, current range size and proportion of range lost from habitat loss, alteration and fragmentation were estimated for vascular plant species known to occur naturally in New South Wales. The area of effective habitat (i.e. high quality habitat able to support biodiversity) remaining for each species was estimated from two alternative habitat condition indicators (Love et al. 2020): ecological condition of terrestrial habitat and ecological carrying capacity of terrestrial habitat. Because most species in New South Wales have not been formally assessed for possible threatened status (i.e. at heightened risk of extinction), a provisional risk assessment using a limited set of criteria was completed for all NSW vascular plant species for which adequate data were available from the Atlas of Living Australia. For consistency with IUCN recommended Red List methods, the expected survival of all known species uses area of occupancy within 2km grids to classify all species into four categories: lowest risk, lower risk, higher risk and highest risk. Each category was assigned a probability of survival, allowing the proportion of NSW vascular plant species expected to survive in 100 years to be estimated. Extrapolating trends in the rate of biodiversity loss requires that the list of species used in analyses are representative of the overall biodiversity of New South Wales. A subset of NSW vascular plant species that uniformly represent the full variety of natural habitats for vascular plants in New South Wales (called the representative species set) was selected to represent all vascular plant species, including those yet to be discovered. Ecological environments defined by a generalised dissimilarity model of vascular plants were used as a surrogate for the variety of natural habitats. Based on the proportion of remaining effective habitat in each species’ original range, within-species genetic diversity is also estimated. A range of values is given because each species will respond to loss of range size differently, depending on factors like dispersal ability and degree of adaptation to local environmental conditions, and these differences are not precisely known. The data and scripts provided in the data collection will allow the pre-commencement analyses of these indicators to be re-run. The method as applied in the scripts is designed to allow future iterations of the indicators to be run using updated input data. Guidelines on how to re-run the analyses using the scripts and adapt the data package for future iterations of the indicators is provided in the implementation report (Nipperess DA, Faith DP, Williams KJ, King D, Manion G, Ware C, Schmidt R, Love J, Drielsma M, Allen S & Gallagher R 2020. Expected survival and state of all known species, first assessment. Department of Planning, Industry and Environment NSW, Sydney, Australia.). The relevant guidelines extracted from that report are provided with this data package. Lineage: This Indicator uses a representative sample of vascular plant species (data and method of derivation described in a separate data collection - see ‘related links’ Representative species sets for vascular plants generated for the Biodiversity Indicator Program, first assessment: expected survival and state of all known species - supplementary data package) to derive three indicators: 1.2a expected survival; 2.1a within-species genetic diversity; 2.1b extant area occupied. Expected survival estimates extinction risk of all biodiversity (both known and undiscovered species) beyond those formally assessed by the NSW Scientific Threatened Species Committee. Species from a biological (i.e. taxonomic) group, in this case vascular plants, are sampled to uniformly represent the full range of natural habitats for that group. The representative species are provisionally assigned to risk of extinction categories based on the estimated proportion of their original habitat that remains intact. This is a limited, provisional assessment of risk using commonly available species occupancy data. The method uses species occurrence observations since 1950 in 2km map grids (each being 4km2) and area of occupancy (AOO) thresholds specified by the two criteria to discriminate four risk of extinction categories. Each species is further assessed for a reduction in AOO determined from the ecological condition indicator as a measure of habitat condition and, for comparison, the ecological carrying capacity measure. The reduction in AOO in four classes (<30%, 30-50%, 50-80% and > 80%), and the AOO thresholds (also using ecological condition) provide the dimensions of the risk categorisation. Each category is given a probability of survival which is applied to all representative species in that category. The Indicator is calculated by summing the probabilities of survival for the representative species across all categories and is expressed as a proportion of the total number of species representing the biological group. This serves as an indicator for all known species within the biological group expected to survive in 100 years and, by logic, extends to undiscovered species in that group. Change in the value of the Indicator reflects a change in survival probability due to a change in habitat condition. If sufficient habitat is lost or degraded for a particular species, its extinction risk category will also change. The AOO data are used to estimate the proportion of within-species genetic diversity that still exists and extant area occupied, after considering loss of suitable habitats. Genetic diversity is inferred from species diversity using geographic range and occupancy. A power curve relates the intact fraction of a species’ AOO to the respective fraction of genetic diversity remaining. Two forms of the curve are used: one that simulates spatially high genetic diversity due to high rates of population divergence and the other low. The two curves equate to an upper and lower estimate of fractional within-species genetic diversity. The Indicator is calculated by separately summing the upper and lower fractions of genetic diversity remaining for all species representing the biological group. This serves as an indicator of within-species genetic diversity for all known species within the biological group and, by logic, extends to undiscovered species in that group. It is also used to show the variation in genetic diversity loss across the categories of species survival for indicator 1.2a. The extant area occupied by all known species is the average fraction of original habitat occupied by the representative species. It is also used to show variation in reductions in AOOs across the categories of species survival for indicator 1.2a. Change in the value of the Indicators reflects a change in habitat condition. Details are given in the explanatory notes attached with this package.
Facebook
TwitterAccess to up-to-date socio-economic data is a widespread challenge in Tonga and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 as a way to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details. For Tonga, after two rounds of data collection from in 2022, monthly HFPS data collection commenced in April 2023 and continued until November 2024 (but with some gaps in the months of collection). The survey collected socio-economic data on topics including employment, income, food security, health, food prices, assets and well-being. Each month of collection has approximately 415 households in the sample and is representative of urban and rural areas. This dataset contains combined monthly survey data for all months of the continuous HFPS in Tonga.
National urban and rural areas (5 islands): Tongatapu, Vava'u, Ha'apai, Eua, Ongo Niua
Individual and household.
Sample survey data [ssd]
The Tonga High Frequency Phone Survey (HFPS) monthly sample was generated in three ways. The first method is Random Digit Dialing (RDD) process covering all cell telephone numbers active at the time of the sample selection. The RDD methodology generates virtually all possible telephone numbers in the country under the national telephone numbering plan and then draws a random sample of numbers. This method guarantees full coverage of the population with a phone.
First, a large first-phase sample of cell phone numbers was selected and screened through an automated process to identify the active numbers. Then, a smaller second-phase sample was selected from the active residential numbers identified in the first-phase sample and was delivered to the data collection team to be called by the interviewers. When a cell phone was called, the call answerer was interviewed as long as he or she was 18 years of age or above and knowledgeable about the household activities.
It was initially planned to stratify the sample by island group based on the phone number prefixes. However, this was not feasible given the high internal migration across islands and the atypical assignment of phone number prefixes across islands in Tonga. The raw sample is overrepresenting urban areas and the population of Tongatapu.
Computer Assisted Telephone Interview [cati]
The questionnaire was developed in both English and Tongan and can be found in this documentation in Excel format. Sections of the Questionnaire are provided below: 1. Interview information and Basic information 2. Household roster 3. Labor 4. Food security and food prices 5. Household income 6. Agriculture 7. Social protection 8. Access to services 9. Assets 10. Education 11. Follow up
At the end of data collection, the raw dataset was cleaned by the survey firm and the World Bank team. Data cleaning mainly included formatting, relabeling, and excluding survey monitoring variables (e.g., interview start and end times). Data was edited using the software Stata.
Facebook
TwitterA data set of cross-nationally comparable microdata samples for 15 Economic Commission for Europe (ECE) countries (Bulgaria, Canada, Czech Republic, Estonia, Finland, Hungary, Italy, Latvia, Lithuania, Romania, Russia, Switzerland, Turkey, UK, USA) based on the 1990 national population and housing censuses in countries of Europe and North America to study the social and economic conditions of older persons. These samples have been designed to allow research on a wide range of issues related to aging, as well as on other social phenomena. A common set of nomenclatures and classifications, derived on the basis of a study of census data comparability in Europe and North America, was adopted as a standard for recoding. This series was formerly called Dynamics of Population Aging in ECE Countries. The recommendations regarding the design and size of the samples drawn from the 1990 round of censuses envisaged: (1) drawing individual-based samples of about one million persons; (2) progressive oversampling with age in order to ensure sufficient representation of various categories of older people; and (3) retaining information on all persons co-residing in the sampled individual''''s dwelling unit. Estonia, Latvia and Lithuania provided the entire population over age 50, while Finland sampled it with progressive over-sampling. Canada, Italy, Russia, Turkey, UK, and the US provided samples that had not been drawn specially for this project, and cover the entire population without over-sampling. Given its wide user base, the US 1990 PUMS was not recoded. Instead, PAU offers mapping modules, which recode the PUMS variables into the project''''s classifications, nomenclatures, and coding schemes. Because of the high sampling density, these data cover various small groups of older people; contain as much geographic detail as possible under each country''''s confidentiality requirements; include more extensive information on housing conditions than many other data sources; and provide information for a number of countries whose data were not accessible until recently. Data Availability: Eight of the fifteen participating countries have signed the standard data release agreement making their data available through NACDA/ICPSR (see links below). Hungary and Switzerland require a clearance to be obtained from their national statistical offices for the use of microdata, however the documents signed between the PAU and these countries include clauses stipulating that, in general, all scholars interested in social research will be granted access. Russia requested that certain provisions for archiving the microdata samples be removed from its data release arrangement. The PAU has an agreement with several British scholars to facilitate access to the 1991 UK data through collaborative arrangements. Statistics Canada and the Italian Institute of statistics (ISTAT) provide access to data from Canada and Italy, respectively. * Dates of Study: 1989-1992 * Study Features: International, Minority Oversamples * Sample Size: Approx. 1 million/country Links: * Bulgaria (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02200 * Czech Republic (1991), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06857 * Estonia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06780 * Finland (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06797 * Romania (1992), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06900 * Latvia (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/02572 * Lithuania (1989), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03952 * Turkey (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/03292 * U.S. (1990), http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/06219
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This collection contains the data, processes and descriptions of workflows required to produce the representative species sets for vascular plants used in the NSW Biodiversity Indicator Program first assessment. The labels given to the datasets in this collection are defined in the workflow diagram and data links spreadsheet. This is a supplementary dataset that was used as an input to the three derived indicators for vascular plants: 1.2a expected survival of all known species 2.1a within-species genetic diversity (for all known species) 2.1b extant area occupied (for all known species). Details are given in the explanatory notes attached with this package and the method implementation report (Nipperess DA, Faith DP, Williams KJ, King D, Manion G, Ware C, Schmidt R, Love J, Drielsma M, Allen S & Ware C 2019, Expected survival and state of all known species: Data packages for the Biodiversity Indicator Program, first assessment.) accessed through the NSW Biodiversity Indicator Program website (see related links). Lineage: Biological data used in this collection are a list of vascular plant species for NSW, generated from all NSW occurrence records (kingdom Plantae) downloaded from the Atlas of Living Australia (https://www.ala.org.au). Taxa were excluded from this list if they were not from a major subclass of vascular plants (Cycadidae, Pinidae, Magnoliidae, or a fern subclass); not a species-level or subspecific taxon; not listed as a valid name in both the Australian Plant Census (https://biodiversity.org.au/nsl/services/APC) and NSW PlantNet (http://plantnet.rbgsyd.nsw.gov.au/); or not listed as native in either the Australian Plant Census or NSW PlantNet. Data cleaning reduced the original list of 16,501 unique names to 5,528 species. All available occurrence records for this set of 5,528 species of vascular plants were downloaded from the Atlas of Living Australia. Records were then removed if they were not a preserved specimen; if spatial (latitude / longitude) or temporal (year of collection) data were incomplete; if they were a cultivated specimen; if they occurred more than 5 km outside the coastline of Australia; if they were collected prior to 1950; or their coordinate uncertainty (if known) was greater than or equal to 3000 meters. Cleaning reduced the initial dataset of 7,802,849 records for 5,528 species to 1,243,554 records for 5,506 species. These included occurrences anywhere in Australia, of which 4,859 species were represented by at least one occurrence within NSW (including ACT and commonwealth properties). A project-specific unique ID was assigned to each species to enable traceability in subsequent analyses. For other datasets within this repository, the process of how they were derived is described in the method implementation report. An existing GDM for vascular plants was used to represent the diversity of ecological environments occupied by vascular plants in New South Wales. This model is based on vascular plant survey data from across south-eastern continental Australia (‘NARCliM Domain’) and used 25 environmental predictors (19 climate, 6 substrate). Representative sets of species to be selected were developed using the .NET Survey Gap Analysis Tool. Diagnostic measures of environmental representation and range size were used to select an optimal number of demand points (1149) which were equated with the names of the species selected as representative. These species were then used as an input to the three derived indicators for vascular plants: 1.2a expected survival of all known species 2.1a within-species genetic diversity (for all known species) 2.1b extant area occupied (for all known species). Details are given in the explanatory notes attached with this package
Facebook
TwitterAccess to up-to-date socio-economic data is a widespread challenge in Vanuatu and other Pacific Island Countries. To increase data availability and promote evidence-based policymaking, the Pacific Observatory provides innovative solutions and data sources to complement existing survey data and analysis. One of these data sources is a series of High Frequency Phone Surveys (HFPS), which began in 2020 to monitor the socio-economic impacts of the COVID-19 Pandemic, and since 2023 has grown into a series of continuous surveys for socio-economic monitoring. See https://www.worldbank.org/en/country/pacificislands/brief/the-pacific-observatory for further details.
For Vanuatu, data for December 2023 – January 2025 was collected with each month having approximately 1000 households in the sample and is representative of urban and rural areas but is not representative at the province level. This dataset contains combined monthly survey data for all months of the continuous HFPS in Vanuatu. There is one date file for household level data with a unique household ID. And a separate file for individual level data within each household data, that can be matched to the household file using the household ID, and which also has a unique individual ID within the household data which can be used to track individuals over time within households, where the data is panel data.
National, urban and rural. Six provinces were covered by this survey: Sanma, Shefa, Torba, Penama, Malampa and Tafea.
Household and individuals.
Sample survey data [ssd]
The Vanuatu High Frequency Phone Survey (HFPS) sample is drawn from the list of customer phone numbers (MSIDNS) provided by Digicel Vanuatu, one of the country’s two main mobile providers. Digicel’s customer base spans all regions of Vanuatu. For the initial data collection, Digicel filtered their MSIDNS database to ensure a representative distribution across regions. Recognizing the challenge of reaching low-income respondents, Digicel also included low-income areas and customers with a low-income profile (defined by monthly spending between 50 and 150 VT), as well as those with only incoming calls or using the IOU service without repayment. These filtered lists were then randomized, and enumerators began calling the numbers.
This approach was used to complete the first round of 1,000 interviews. The respondents from this first round formed a panel to be surveyed monthly. Each month, phone numbers from the panel are contacted until all have been interviewed, at which point new phone numbers (fresh MSIDNS from Digicel’s database) are used to replace those that have been exhausted. These new respondents are then added to the panel for future surveys.
Computer Assisted Telephone Interview [cati]
The questionnaire was developed in both English and Bislama. Sections of the Questionnaire:
-Interview Information
-Household Roster (separate modules for new households and returning households)
-Labor (separate modules for new households and returning households)
-Food Security
-Household Income
-Agriculture
-Social Protection
-Access to Services
-Assets
-Perceptions
-Follow-up
At the end of data collection, the raw dataset was cleaned by the survey firm and the World Bank team. Data cleaning mainly included formatting, relabeling, and excluding survey monitoring variables (e.g., interview start and end times). Data was edited using the software STATA.
The data are presented in two datasets: a household dataset and an individual dataset. The total number of observations is 13,779 in the household dataset and 77,501 in the individual dataset. The individual dataset contains information on individual demographics and labor market outcomes of all household members aged 15 and above, and the household data set contains information about household demographics, education, food security, household income, agriculture activities, social protection, access to services, and durable asset ownership. The household identifier (hhid) is available in both the household dataset and the individual dataset. The individual identifier (hhid_mem) can be found in the individual dataset.
In November 2024, a total of 7,874 calls were made. Of these, 2,251 calls were successfully connected, and 1,000 respondents completed the survey. By February 2024, the sample was fully comprised of returning respondents, with a re-contact rate of 99.9 percent.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context: This data set originates from a practice-relevant degradation process, which is representative for Prognostics and Health Management (PHM) applications. The observed degradation process is the clogging of filters when separating of solid particles from gas. A test bench is used for this purpose, which performs automated life testing of filter media by loading them. For testing, dust complying with ISO standard 12103-1 and with a known particle size distribution is employed. The employed filter media is made of randomly oriented non-woven fibre material. Further data sets are generated for various practice-relevant data situations which do not correspond to the ideal conditions of full data coverage. These data sets are uploaded to Kaggle by the user "Prognostics @ HSE" in a continuous process. In order to avoid the carryover between two data sets, a different configuration of the filter tests is used for each uploaded practice-relevant data situation, for example by selecting a different filter media.
Detailed specification: For more information about the general operation and the components used, see the provided description file Random Recording Condition Data Data Set.pdf
Given data situation: In order to implement a predictive maintenance policy, knowledge about the time of failure respectively about the remaining useful life (RUL) of the technical system is necessary. The time of failure or the RUL can be predicted on the basis of condition data that indicate the damage progression of a technical system over time. However, the collection of condition data in typical industrial PHM applications is often only possible in an incomplete manner. An example is the collection of data during defined test cycles with specific loads, carried at intervals. For instance, this approach is often used with machining centers, where test cycles are only carried out between finished machining jobs or work shifts. Due to different work pieces, the machining time varies and the test cycle with the recording of condition data is not performed equidistantly. This results in a data characteristic that is comparable to a random sample of continuously recorded condition data. Another example that may result in such a data characteristic comes from the effort to reduce data volumes when recording condition data. Attempts can be made to keep the amount of data with unchanged damage as small as possible. One possible measure is not to transmit and store the continuous sensor readings, but rather sections of them, which also leads to gaps in the data available for prognosis. In the present data set, the life cycle of filters or rather their condition data, represented by the differential pressure, is considered. Failure of the filter occurs when the differential pressure across the filter exceeds 600 Pa. The time until a filter failure occurs depends especially on the amount of dust supplied per time, which is constant within a run-to-failure cycle. The previously explained data characteristics are addressed by means of corresponding training and test data. The training data is structured as follows: A run-to-failure cycle contains n batches of data. The number n varies between the cycles and depends on the duration of the batches and the time interval between the individual batches. The duration and time interval of the batches are random variables. A data batch includes the sensor readings of differential pressure and flow rate for the filter, the start and end time of the batch, and RUL information related to the end time of the batch. The sensor readings of the differential pressure and flow rate are recorded at a constant sampling rate. Figure 6 shows an illustrative run-to-failure cycle with multiple batches. The test data are randomly right-censored. They are also made of batches with a random duration and time interval between the batches. For each batch contained, the start and end time are given, as well as the sensor readings within the batch. The RUL is not given for each batch but only for the last data point of the right-censored run-to-failure cycle.
Task: The aim is to predict the RUL of the censored filter test cycles given in the test data. In order to predict the RUL, training and test data are given, each consisting of 60 and 40 run-to-failure cycles. The test data contains random right-censored run-to-failure cycles and the respective RUL for the prediction task. The main challenge is to make the best use of the incompletely recorded training and test data to provide the most accurate prediction possible. Due to the detailed description of the setup and the various physical filter models described in literature, it is possible to support the actual data-driven models by integrating physical knowledge respectively models in the sense of theory-guided data science or informed machi...
Facebook
TwitterThis fifth cycle of the NMIS focuses on care for women during pregnancy and delivery and the relationship between this and the outcome of pregnancy, in terms of estimates of low birth weight and survival of the baby. It is timely in view of the recent publication of the National Maternity Care Guidelines for Nepal. It is intended to provide information on the current situation in relation to the targets set in the guidelines and some insights about what might help to improve matters, for use by service planners and providers at national and local levels. No attempt was made to estimate maternal mortality. Estimates of maternal mortality in Nepal are available from other sources.
The NMIS employs Sentinel Community Surveillance (SCS). Features of this method include: the focus of each cycle on a small group of issues; the combination of quantitative and qualitative data from the same communities in a meso-analysis; data analysis and risk analysis to produce results in a form useful for planning; revisiting of the same sites, making estimation of impact of interventions easier.
National Urban/ Rural areas Development regions Ecological Zones Eco-Development Regions
Household and ever married woman aged 15-49 years
Households and ever married women aged 15-49 years
Sample survey data [ssd]
The NMIS uses a methodology known as Sentinel Community Surveillance (SCS). It has the underlying aim of 'building the community voice into planning'. SCS can be described as a multi-sectoral community-based information management system. There are a number of particular features of the SCS methodology:
Transfer of skills of data collection, analysis and communication over a number of cycles is an explicit aim of the methodology.
A key feature of SCS is the ability to do risk analysis to look at causes. In NMIS cycle five focuses on care for women during pregnancy and delivery and the relationship between this and the outcome of pregnancy, in terms of estimates of low birth weight and survival of the baby SCS is deliberately designed to concentrate data collection efforts: in time (a series of cycles in the sentinel sites, at approximately 6 monthly intervals); in space (representative communities are surveyed rather than collecting data from all communities); and in subject matter (each cycle focuses on one area at a time, rather than trying to collect all possible data on every occasion). SCS employs a type of cluster survey methodology, but the clusters are larger than in many cluster surveys: typically 100-120 households per site, rather than the 10-50 used in most cluster surveys. And in the SCS method, there is no sampling within each site; every household is included. This gives greater statistical power in the data analysis and also allows the linkage of data from the household questionnaires to other, mainly qualitative, data from the same sites. This data relating to the whole site is combined with the household data in a mesoanalysis11.
A key issue in the SCS methodology and in the NMIS is the selection of sites so as to be representative. In some countries, random sampling is not a possibility because no adequate sampling frame exists. In these situations, purposive selection is used, drawing on local knowledge of conditions to choose sites as representative as possible of the situation in a district, region or country. When possible, random sampling methods are used and this is the case in Nepal, where a reasonably good census sampling frame exists. In both cases, stratification is first used to ensure that certain types of sites are included in proportion to their occurrence in the population. For example, stratification can be by urban and rural sites, or by ecological zones. In the NMIS, the sample sites for the NMIS were drawn by the Central Bureau of Statistics (CBS), after stratification into development regions, ecological zones and urban/rural sites. The details of the sampling method and the selected sites are given in the report of the first NMIS cycle and the annexes to that report.
The sites in NMIS cycle 5 are selected by a multistage random sampling method. The sites are representative of the country, of the five development regions, of the three ecological zones, of the 15 eco-development regions, and of urban and rural situations. The rural sites were selected primarily to give representation of the 15 eco-development regions but in 18 districts there are sufficient sites (four or more) to ensure reasonable district representativeness. In a further 19 districts, only 1-2 sites were selected so they cannot be relied upon to be representative of that district. Note that representation of the 15 eco-development regions is among the rural sites only; the urban sites are stratified separately and are not intended to be part of the representation of the different eco-development regions. This reflects the high proportion of the population living in rural communities (around 90%) and the difficulty of having a large enough urban samples to stratify separately among the 15 eco-development regions.
There are a total of 144 sites in the sample: 126 rural and 18 urban. A total of 18,996 households and 106,160 household members were interviewed in the survey.
Face-to-face [f2f]
The following instruments were used for data collection for NMIS Five cycle surveillance:
The questionnaire and guides for interview were published in Nepali language. An English version has been provided in the Report on Care during Pregnancy and Delivery: Implications for Protecting the Health of Mothers and their Babies, Fifth Cycle (JUNE 1998).
Data editing took place at a number of stages throughout the processing, including:
The household data were entered twice and validated using Epi Info.
A total of 18,996 households were visited in 144 sites. Information was available for 18,653 households (99%). Only 1% households refused the interview. The total population in the households interviewed is 106,160 people. More detailed information was collected from ever-married women aged 15-49 years: a total of 19,557 women. They reported on their last pregnancy and data on a total of 17,609 pregnancies were collected.
Standard deviations and 95% Confidence Intervals were calculated for specific variables. These estimates are provided in the Report on Care during Pregnancy And Delivery: Implications for Protecting the Health of Mothers and their Babies, Fifth Cycle (1998).
The data collection instruments were piloted several times to ensure that they were appropriate to the households, health facility workers and focus groups concerned and that the coding and data entry arrangements were satisfactory.
Facebook
TwitterThis dataset (TOVSAMNG) contains the TIROS Operational Vertical Sounder (TOVS) level 3 geophysical parameters derived using data from NOAA-10 and the physical retrieval method of Susskind et al. (1984) and processed by the Satellite Data Utilization Office of the Goddard Laboratory for Atmospheres at NASA/GSFC. This method, which is hydrodynamic model- and a priori data-dependent, is designated as the so-called Path A scheme by the TOVS Pathfinder Science Working Group. The 20 channel High resolution Infrared Radiation Sounder 2 (HIRS2) and the 4 channel Microwave Sounding Unit (MSU) aboard the NOAA-xx series of Polar Orbiting Satellites are used to produce global fields of the 3-dimensional temperature-moisture structure of the atmosphere. In addition to profiles of temperature and moisture, the HIRS2/MSU data are used to derive important quantities such as land and sea surface temperature, outgoing longwave radiation, cloud fraction, cloudtop height, total ozone overburden and precipitation estimates.The Path A system steps through an interactive forecast-retrieval-analysis cycle. In each 6 hour synoptic period, a 2nd order General Circulation Model (Takacs et al., 1994) is used to generate the 6 hour forecast fields of temperature and humidity. These global fields are used as the first guess for all soundings occurring within a 6 hour time window centered upon the forecast time. These retrievals are then assimilated with all available insitu measurements (such as radiosonde and ship reports) in the 6 hour interval using an Optimal Interpolation (OI) analysis scheme developed by the Data Assimilation Office of the Goddard Laboratory for Atmospheres. This analysis is then used to specify the initial conditions for the next 6 hour forecast, thus completing the cycle.The retrieval algorithm itself is a physical method based on the iterative relaxation technique originally proposed by Chahine (1968). The basic approach consists of modifying the temperature profile from the previous iteration by an amount proportional to the difference between the observed brightness temperatures and the brightness temperatures computed from the trial parameters using the full radiative transfer equation applied at the observed satellite zenith angle. For the case of the temperature profile, the updated layer mean temperatures are given as a linear combination of multichannel brightness temperature differences with the coefficients given by the channel weighting functions. Constraints are imposed upon the solution in order to ensure stability and convergence of the iterative process. For more details see Susskind et al (1984).These Level 3 monthly mean products are in the netCDF format. Each data set is representative of a different monthly average time period and for one of nine satellites. All files contain the same number of geophysical parameter arrays with the AM and PM portions of the orbits treated separately. All data are mapped to a 1 degree longitude by 1 degree latitude global grid.
Facebook
TwitterSince the beginning of the 1960s, Statistics Sweden, in collaboration with various research institutions, has carried out follow-up surveys in the school system. These surveys have taken place within the framework of the IS project (Individual Statistics Project) at the University of Gothenburg and the UGU project (Evaluation through follow-up of students) at the University of Teacher Education in Stockholm, which since 1990 have been merged into a research project called 'Evaluation through Follow-up'. The follow-up surveys are part of the central evaluation of the school and are based on large nationally representative samples from different cohorts of students.
Evaluation through follow-up (UGU) is one of the country's largest research databases in the field of education. UGU is part of the central evaluation of the school and is based on large nationally representative samples from different cohorts of students. The longitudinal database contains information on nationally representative samples of school pupils from ten cohorts, born between 1948 and 2004. The sampling process was based on the student's birthday for the first two and on the school class for the other cohorts.
For each cohort, data of mainly two types are collected. School administrative data is collected annually by Statistics Sweden during the time that pupils are in the general school system (primary and secondary school), for most cohorts starting in compulsory school year 3. This information is provided by the school offices and, among other things, includes characteristics of school, class, special support, study choices and grades. Information obtained has varied somewhat, e.g. due to changes in curricula. A more detailed description of this data collection can be found in reports published by Statistics Sweden and linked to datasets for each cohort.
Survey data from the pupils is collected for the first time in compulsory school year 6 (for most cohorts). Questionnaire in survey in year 6 includes questions related to self-perception and interest in learning, attitudes to school, hobbies, school motivation and future plans. For some cohorts, questionnaire data are also collected in year 3 and year 9 in compulsory school and in upper secondary school.
Furthermore, results from various intelligence tests and standartized knowledge tests are included in the data collection year 6. The intelligence tests have been identical for all cohorts (except cohort born in 1987 from which questionnaire data were first collected in year 9). The intelligence test consists of a verbal, a spatial and an inductive test, each containing 40 tasks and specially designed for the UGU project. The verbal test is a vocabulary test of the opposite type. The spatial test is a so-called ‘sheet metal folding test’ and the inductive test are made up of series of numbers. The reliability of the test, intercorrelations and connection with school grades are reported by Svensson (1971).
For the first three cohorts (1948, 1953 and 1967), the standartized knowledge tests in year 6 consist of the standard tests in Swedish, mathematics and English that up to and including the beginning of the 1980s were offered to all pupils in compulsory school year 6. For the cohort 1972, specially prepared tests in reading and mathematics were used. The test in reading consists of 27 tasks and aimed to identify students with reading difficulties. The mathematics test, which was also offered for the fifth cohort, (1977) includes 19 assignments. After a changed version of the test, caused by the previously used test being judged to be somewhat too simple, has been used for the cohort born in 1982. Results on the mathematics test are not available for the 1987 cohort. The mathematics test was not offered to the students in the cohort in 1992, as the test did not seem to fully correspond with current curriculum intentions in mathematics. For further information, see the description of the dataset for each cohort.
For several of the samples, questionnaires were also collected from the students 'parents and teachers in year 6. The teacher questionnaire contains questions about the teacher, class size and composition, the teacher's assessments of the class' knowledge level, etc., school resources, working methods and parental involvement and questions about the existence of evaluations. The questionnaire for the guardians includes questions about the child's upbringing conditions, ambitions and wishes regarding the child's education, views on the school's objectives and the parents' own educational and professional situation.
The students are followed up even after they have left primary school. Among other things, data collection is done during the time they are in high school. Then school administrative data such as e.g. choice of upper secondary school line / program and grades after completing studies. For some of the cohorts, in addition to school administrative data, questionnaire data were also collected from the students.
he sample consisted of students born on the 5th, 15th and 25th of any month in 1953, a total of 10,723 students.
The data obtained in 1966 were: 1. School administrative data (school form, class type, year and grades). 2. Information about the parents' profession and education, number of siblings, the distance between home and school, etc.
This information was collected for 93% of all born on the current days. The reason for this is reduced resources for Statistics Sweden for follow-up work - reminders etc. Annual data for cohorts in 1953 were collected by Statistics Sweden up to and including academic year 1972/73.
Response rate for test and questionnaire data is 88% Standard test results were received for just over 85% of those who took the tests.
The sample included a total of 9955 students, for whom some form of information was obtained.
Part of the "Individual Statistics Project" together with cohort 1953.
Facebook
Twitterhttps://www.law.cornell.edu/uscode/text/17/106https://www.law.cornell.edu/uscode/text/17/106
Medical image analysis is critical to biological studies, health research, computer- aided diagnoses, and clinical applications. Recently, deep learning (DL) techniques have achieved remarkable successes in medical image analysis applications. However, these techniques typically require large amounts of annotations to achieve satisfactory performance. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for medical image analysis while reducing annotation efforts? To address this problem, we have outlined two specific aims: (A1) Utilize existing annotations effectively from advanced models; (A2) extract generic knowledge directly from unannotated images.
To achieve the aim (A1): First, we introduce a new data representation called TopoImages, which encodes the local topology of all the image pixels. TopoImages can be complemented with the original images to improve medical image analysis tasks. Second, we propose a new augmentation method, SAMAug-C, that lever- ages the Segment Anything Model (SAM) to augment raw image input and enhance medical image classification. Third, we propose two advanced DL architectures, kCBAC-Net and ConvFormer, to enhance the performance of 2D and 3D medical image segmentation. We also present a gate-regularized network training (GrNT) approach to improve multi-scale fusion in medical image segmentation. To achieve the aim (A2), we propose a novel extension of known Masked Autoencoders (MAEs) for self pre-training, i.e., models pre-trained on the same target dataset, specifically for 3D medical image segmentation.
Scientific visualization is a powerful approach for understanding and analyzing various physical or natural phenomena, such as climate change or chemical reactions. However, the cost of scientific simulations is high when factors like time, ensemble, and multivariate analyses are involved. Additionally, scientists can only afford to sparsely store the simulation outputs (e.g., scalar field data) or visual representations (e.g., streamlines) or visualization images due to limited I/O bandwidths and storage space. Therefore, in this dissertation, we seek to address this critical problem: How can we develop efficient and effective DL algorithms for scientific data generation and compression while reducing simulation and storage costs?
To tackle this problem: First, we propose a DL framework that generates un- steady vector fields data from a set of streamlines. Based on this method, domain scientists only need to store representative streamlines at simulation time and recon- struct vector fields during post-processing. Second, we design a novel DL method that translates scalar fields to vector fields. Using this approach, domain scientists only need to store scalar field data at simulation time and generate vector fields from their scalar field counterparts afterward. Third, we present a new DL approach that compresses a large collection of visualization images generated from time-varying data for communicating volume visualization results.
Facebook
TwitterThe main purpose of the Household Income Expenditure Survey (HIES) 2016 was to offer high quality and nationwide representative household data that provided information on incomes and expenditure in order to update the Consumer Price Index (CPI), improve National Accounts statistics, provide agricultural data and measure poverty as well as other socio-economic indicators. These statistics were urgently required for evidence-based policy making and monitoring of implementation results supported by the Poverty Reduction Strategy (I & II), the AfT and the Liberia National Vision 2030. The survey was implemented by the Liberia Institute of Statistics and Geo-Information Services (LISGIS) over a 12-month period, starting from January 2016 and was completed in January 2017. LISGIS completed a total of 8,350 interviews, thus providing sufficient observations to make the data statistically significant at the county level. The data captured the effects of seasonality, making it the first of its kind in Liberia. Support for the survey was offered by the Government of Liberia, the World Bank, the European Union, the Swedish International Development Corporation Agency, the United States Agency for International Development and the African Development Bank. The objectives of the 2016 HIES were:
National
Sample survey data [ssd]
The original sample design for the HIES exploited two-phased clustered sampling methods, encompassing a nationally representative sample of households in every quarter and was obtained using the 2008 National Housing and Population Census sampling frame. The procedures used for each sampling stage are as follows:
i. First stage
Selection of sample EAs. The sample EAs for the 2016 HIES were selected within each stratum systematically with Probability Proportional to Size from the ordered list of EAs in the sampling frame. They are selected separately for each county by urban/rural stratum. The measure of size for each EA was based on the number of households from the sampling frame of EAs based on the 2008 Liberia Census. Within each stratum the EAs were ordered geographically by district, clan and EA codes. This provided implicit geographic stratification of the sampling frame.
ii. Second stage
Selection of sample households within a sample EA. A random systematic sample of 10 households were selected from the listing for each sample EA. Using this type of table, the supervisor only has to look up the total number of households listed, and a specific systematic sample of households is identified in the corresponding row of the table.
Face-to-face [f2f]
There were three questionnaires administered for this survey: 1. Household and Individual Questionnaire 2. Market Price Questionnaire 3. Agricultural Recall Questionnaire
The data entry clerk for each team, using data entry software called CSPro, entered data for each household in the field. For each household, an error report was generated on-site, which identified key problems with the data collected (outliers, incorrect entries, inconsistencies with skip patterns, basic filters for age and gender specific questions etc.). The Supervisor along with the Data Entry Clerk and the Enumerator that collected the data reviewed these errors. Callbacks were made to households if necessary to verify information and rectify the errors while in that EA.
Once the data were collected in each EA, they were sent to LISGIS headquarters for further processing along with EA reports for each area visited. The HIES Technical committee converted the data into STATA and ran several consistency checks to manage overall data quality and prepared reports to identify key problems with the data set and called the field teams to update them about the same. Monthly reports were prepared by summarizing observations from data received from the field alongside statistics on data collection status to share with the field teams and LISGIS Management.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes data on the beliefs that a large sample of nationally representative US individuals associate with a set of first names. These beliefs include race, age, education, productivity levels, and noncognitive skills.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description of the experiment setting Data collection for the Study of Nutrition and Activity in Childcare Settings (SNACS) started in January 2017 and continued through September 2017. The complex study included web-based surveys, pre-interview surveys, on-site interviews, environmental observations, and telephone interviews of childcare sponsors and providers, as well as interviews of parents of some of the children from the sampled providers. The data were collected from a nationally representative sample of programs, children, and meals. The data cover a range of subjects including the provider’s characteristics, the nutritional quality of meals and snacks served, the dietary intake of children in childcare, the activities of children over the course of the childcare day, and the financial conditions of the childcare operations. Processing methods and equipment used SNACS data were collected via web-based surveys, pre-interview surveys, on-site interviews, environmental observations, and telephone interviews of childcare sponsors and providers, as well as interviews of parents of some of the children from the sampled providers. The study team cleaned the raw data to ensure the data were as correct, complete, and consistent as possible. They used many different methods to check the data depending on the data type. The details are described in the study document called “Appendix A: Methods” (https://fns-prod.azureedge.us/sites/default/files/resource-files/SNACS-AppendixA.pdf) available at the study website. Study date(s) and duration Data collection for the Study of Nutrition and Activity in Childcare Settings (SNACS) started in January 2017 and continued through September 2017. The final public data set was produced in 2021. Study spatial scale (size of replicates and spatial scale of study area) The study is nationally representative and the sample design reflects the complexity of the sample needed to answer the research questions. The primary sampling units were 20 states randomly selected with six states selected with certainty due to their size. Secondary sampling units were selected from a random sample of metropolitan areas and clusters of non-metropolitan counties from the 20 States. Further details about the sample design are described in the “Appendix A: Methods” document available at the study website. Level of true replication See the document, “Appendix A: Methods,” available at the study website. Sampling precision (within-replicate sampling or pseudoreplication) See the document, “Appendix A: Methods,” available at the study website. Level of subsampling (number and repeat or within-replicate sampling) See the document, “Appendix A: Methods,” available at the study website. Study design (before–after, control–impacts, time series, before–after-control–impacts) Non-experimental Description of any data manipulation, modeling, or statistical analysis undertaken The public use data files contain constructed variables used for analytic purposes. The files do include weights created to produce national estimates for the Study of Nutrition and Activity in Childcare Settings final reports available at the study website. The data files do not include any identifying information about childcare sponsors, providers, or individuals who completed the questionnaires or participated in the study in other ways. Description of any gaps in the data or other limiting factors See the document, “Appendix A: Methods,” available at the study website for a detailed explanation of the study’s limitations. Outcome measurement methods and equipment used The height and weight of sampled children were measured with scales provided by data collectors. See the document, “Appendix A: Methods,” available at the study website for details on other outcomes measured through statistical analysis of the survey responses about outcomes such as food insecurity. Resources in this dataset:
Resource Title: Study of Nutrition and Activity in Childcare Settings (SNACS) - SAS Data Sets, Data Codebooks and Documentation Guides File Name: SNACS-I Public Use Files.zip Description: The zip file contains 19 Data Codebooks, 7 Data Documentation Guides, 19 SAS Datasets and one SAS Formats File.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
According to Connecticut law, no manufacturer, wholesaler, or out-of-state shipper may ship, transport or deliver within Connecticut, or sell or offer for sale, any alcoholic liquor unless the following information is registered with, and approved by, the Connecticut Department of Consumer Protection: The name of the brand, trade name or other distinctive characteristics by which the alcoholic liquors are bought and sold, the name and address of the manufacturer, and the name and address of each wholesaler permittee who is authorized by the manufacturer or his authorized representative to sell such alcoholic liquors. Brand registration is valid for three (3) years. The registration and subsequent renewal fees are payable by the manufacturer or his authorized representative when such liquors are manufactured in the United States and by the importer or his authorized representative when such liquors are imported into the country.
No manufacturer, wholesaler, or out-of-state shipper may discriminate in price discounts between one permittee and another on sales or purchases of alcoholic liquors bearing the same brand or trade name and of like age, size and quality, nor shall he allow in any form any discount, rebate, free goods, allowance or other inducement for the purpose of making sales or purchases.
Facebook
TwitterThis fifth cycle of the NMIS focuses on care for women during pregnancy and delivery and the relationship between this and the outcome of pregnancy, in terms of estimates of low birth weight and survival of the baby. It is timely in view of the recent publication of the National Maternity Care Guidelines for Nepal. It is intended to provide information on the current situation in relation to the targets set in the guidelines and some insights about what might help to improve matters, for use by service planners and providers at national and local levels. No attempt was made to estimate maternal mortality. Estimates of maternal mortality in Nepal are available from other sources.
The NMIS employs Sentinel Community Surveillance (SCS). Features of this method include: the focus of each cycle on a small group of issues; the combination of quantitative and qualitative data from the same communities in a meso-analysis; data analysis and risk analysis to produce results in a form useful for planning; revisiting of the same sites, making estimation of impact of interventions easier.
National Urban/ Rural areas Development regions Ecological Zones Eco-Development Regions
Household and ever married woman aged 15-49 years
Households and ever married women aged 15-49 years
Sample survey data [ssd]
The NMIS uses a methodology known as Sentinel Community Surveillance (SCS). It has the underlying aim of 'building the community voice into planning'. SCS can be described as a multi-sectoral community-based information management system. There are a number of particular features of the SCS methodology:
Transfer of skills of data collection, analysis and communication over a number of cycles is an explicit aim of the methodology.
A key feature of SCS is the ability to do risk analysis to look at causes. In NMIS cycle five focuses on care for women during pregnancy and delivery and the relationship between this and the outcome of pregnancy, in terms of estimates of low birth weight and survival of the baby SCS is deliberately designed to concentrate data collection efforts: in time (a series of cycles in the sentinel sites, at approximately 6 monthly intervals); in space (representative communities are surveyed rather than collecting data from all communities); and in subject matter (each cycle focuses on one area at a time, rather than trying to collect all possible data on every occasion). SCS employs a type of cluster survey methodology, but the clusters are larger than in many cluster surveys: typically 100-120 households per site, rather than the 10-50 used in most cluster surveys. And in the SCS method, there is no sampling within each site; every household is included. This gives greater statistical power in the data analysis and also allows the linkage of data from the household questionnaires to other, mainly qualitative, data from the same sites. This data relating to the whole site is combined with the household data in a mesoanalysis11.
A key issue in the SCS methodology and in the NMIS is the selection of sites so as to be representative. In some countries, random sampling is not a possibility because no adequate sampling frame exists. In these situations, purposive selection is used, drawing on local knowledge of conditions to choose sites as representative as possible of the situation in a district, region or country. When possible, random sampling methods are used and this is the case in Nepal, where a reasonably good census sampling frame exists. In both cases, stratification is first used to ensure that certain types of sites are included in proportion to their occurrence in the population. For example, stratification can be by urban and rural sites, or by ecological zones. In the NMIS, the sample sites for the NMIS were drawn by the Central Bureau of Statistics (CBS), after stratification into development regions, ecological zones and urban/rural sites. The details of the sampling method and the selected sites are given in the report of the first NMIS cycle and the annexes to that report.
The sites in NMIS cycle 5 are selected by a multistage random sampling method. The sites are representative of the country, of the five development regions, of the three ecological zones, of the 15 eco-development regions, and of urban and rural situations. The rural sites were selected primarily to give representation of the 15 eco-development regions but in 18 districts there are sufficient sites (four or more) to ensure reasonable district representativeness. In a further 19 districts, only 1-2 sites were selected so they cannot be relied upon to be representative of that district. Note that representation of the 15 eco-development regions is among the rural sites only; the urban sites are stratified separately and are not intended to be part of the representation of the different eco-development regions. This reflects the high proportion of the population living in rural communities (around 90%) and the difficulty of having a large enough urban samples to stratify separately among the 15 eco-development regions.
There are a total of 144 sites in the sample: 126 rural and 18 urban. A total of 18,996 households and 106,160 household members were interviewed in the survey.
Face-to-face [f2f]
The following instruments were used for data collection for NMIS Five cycle surveillance:
The questionnaire and guides for interview were published in Nepali language. An English version has been provided in the Report on Care during Pregnancy and Delivery: Implications for Protecting the Health of Mothers and their Babies, Fifth Cycle (JUNE 1998).
Data editing took place at a number of stages throughout the processing, including:
The household data were entered twice and validated using Epi Info.
A total of 18,996 households were visited in 144 sites. Information was available for 18,653 households (99%). Only 1% households refused the interview. The total population in the households interviewed is 106,160 people. More detailed information was collected from ever-married women aged 15-49 years: a total of 19,557 women. They reported on their last pregnancy and data on a total of 17,609 pregnancies were collected.
Standard deviations and 95% Confidence Intervals were calculated for specific variables. These estimates are provided in the Report on Care during Pregnancy And Delivery: Implications for Protecting the Health of Mothers and their Babies, Fifth Cycle (1998).
The data collection instruments were piloted several times to ensure that they were appropriate to the households, health facility workers and focus groups concerned and that the coding and data entry arrangements were satisfactory.
Facebook
TwitterIn the WAEMU countries, COVID-19 is expected to affect households in many ways. First, governments might reduce social transfers to households due to the decline in revenue arising from the potential COVID-19 economic recession. Second households deriving income from vulnerable sectors such as tourism and related activities will likely face risk of unemployment or loss of income. Third an increase in prices of imported goods can also negatively impact household welfare, as a direct consequence of the increase of these imported items or as indirect increase of prices of local good manufactured using imported inputs. In this context, there is a need to produce high frequency data to help policy makers in monitoring the channels by which the pandemic affects households and assessing its distributional impact. To do so, the sample of the longitudinal survey will be a sub-sample of the 2018/19 household survey in each country.
For Mali, the survey which is implemented by the National Statistical Office (INSTAT), is conducted using cell phone numbers of household members collected during the 2018/19 survey. This has the advantage of conducting cost effectively welfare analysis without collecting new consumption data. The 35 minutes questionnaires covered 10 modules (knowledge, behavior, access to services, food security, employment, safety nets, shocks, etc…). Data collection is planned for six months (six rounds) and the questionnaire is designed with core modules and rotating modules. Survey data collection started on May 11th, 2020 and households are expected to be called back every three to four weeks.
The main objectives of the survey are to: • Identify type of households directly or indirectly affected by the pandemic; • Identify the main channels by which the pandemic affects households; • Provide relevant data on income and socioeconomic indicators to assess the welfare impact of the pandemic.
National coverage including rural and urban
The survey covered only households of the 2018/19 survey which excluded populations in prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
The Mali COVID-19 impact monitoring survey is a high frequency Computer Assisted Telephone Interview (CATI). The survey’s sample was drawn from the population of the 2018/19 - Enquête Harmonisée des Conditions de Vie des Ménages (EHCVM) -, which was conducted between October 2018 and July 2019. EHCVM is itself a sample survey representative at national, regional and by urban/rural. For the 7,000 HHs in EHCVM, phone numbers were collected for about 90 percent of them. Each HH has between 1-4 phone numbers. The sampling, which was similar across WAEMU, aimed at having representative estimates by three zones: the capital city of Bamako, other urban areas and the rural area. The minimum sample size was 1,908 for which 1,766 were successfully interviewed, that is about 98 % of the expected minimal sample size at the national level. Given that Mali is conducting a phone survey for the first time, a total of 2,270 were drawn (25% increase) to take into account unknown non-response rates or presence of invalid numbers in the database.
The total number of completed interviews in round one is 1,766. The total number of completed interviews in round two is 1,935. The total number of completed interviews in round three is 1,901. The total number of completed interviews in round four is 1,797. The total number of completed interviews in round five is 1,766.
Computer Assisted Telephone Interview [cati]
All the interview materials were translated in french for the NSO. The questionnaire was administered in local languages with about varying length (30-35 minutes) and covered the following topics: 1- Household Roster 2- Knowledge of COVID-19 3- Behaviour and Social Distancing 4- Access to Basic Services 5- Employment and Income 6- Prices and Food Security 7- Other Impacts of COVID-19 8- Income Loss 9- Coping/Shocks 10- Social Safety Nets 11- Fragility 12- Governance and socio-political crisis
At the end of data collection, the raw dateset was cleaned by the NSO. This included formatting, and correcting results based on monitoring issues, enumerator feedback and survey changes.
The minimum sample expected is 1,809 households (with 603 households per domain). This sample was therefore 99% covered for Bamako, about 100% for other urban areas and 91% for rural areas. Overall, the minimum sample is 98% covered. This level of coverage provides reliable data at national level and for each domain.
Round one response rate was 77.8%. Round two response rate was 85.2%. Round three response rate was 83.7%. Round four response rate was 79.2%. Round five response rate was 79.7%.
Facebook
TwitterThe Multiple Indicator Cluster Survey (MICS) is a household survey programme developed by UNICEF to assist countries in filling data gaps for monitoring human development in general and the situation of children and women in particular. MICS is capable of producing statistically sound, internationally comparable estimates of social indicators. The current round of MICS is focused on providing a monitoring tool for the Millennium Development Goals (MDGs), the World Fit for Children (WFFC), as well as for other major international commitments, such as the United Nations General Assembly Special Session (UNGASS) on HIV/AIDS and the Abuja targets for malaria.
Survey Objectives The 2005 Georgia Multiple Indicator Cluster Survey has as its primary objectives: - To provide up-to-date information for assessing the situation of children and women in Georgia; - To furnish data needed for monitoring progress toward goals established in the Millennium Declaration, the goals of A World Fit For Children (WFFC), and other internationally agreed upon goals, as a basis for future action; - To contribute to the improvement of data and monitoring systems in Georgia and to strengthen technical expertise in the design, implementation, and analysis of such systems.
Survey Content MICS questionnaires are designed in a modular fashion that can be easily customized to the needs of a country. They consist of a household questionnaire, a questionnaire for women aged 15-49 and a questionnaire for children under the age of five (to be administered to the mother or caretaker). Other than a set of core modules, countries can select which modules they want to include in each questionnaire.
Survey Implementation The survey was carried out by the State Department of Statistics of Georgia and the National Centre for Disease Control of Georgia, with the support and assistance of UNICEF.
Technical assistance and training for the MICS surveys is provided through a series of regional workshops, covering questionnaire content, sampling and survey implementation; data processing; data quality and data analysis; report writing and dissemination.
The survey is nationally representative and covers the whole of Georgia.
Households (defined as a group of persons who usually live and eat together)
De jure household members (defined as memers of the household who usually live in the household, which may include people who did not sleep in the household the previous night, but does not include visitors who slept in the household the previous night but do not usually live in the household)
Women aged 15-49
Children aged 0-4
The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.
Sample survey data [ssd]
The principal objective of the sample design was to provide current and reliable estimates on a set of indicators covering the four major areas of the World Fit for Children declaration, including promoting healthy lives; providing quality education; protecting against abuse, exploitation and violence; and combating HIV/AIDS. The population covered by the 2005 MICS is defined as the universe of all women aged 15-49 and all children aged under 5. A sample of households was selected and all women aged 15-49 identified as usual residents of these households were interviewed. In addition, the mother or the caretaker of all children aged under 5 who were usual residents of the household were also interviewed about the child.
The 2005 MICS collected data from a nationally representative sample of households, women and children. The primary focus of the 2005 MICS was to prodvide estimates of key population and health, education, child protection and HIV related indicators for the country as a whole, and for urban and rural areas separately. In additon, the sample was designed to provide estimates for each of the 11 regions for key indicators. Georgia is devided into 11 regions: Tbilisi, Kakheti, Mtskheta - Mtianeti, Shida Kartli, Kvemo Kartli, Samtskhe - Javakheti, Racha - Lechkhumi and Kvemo, Svaneti, Imereti, Guria, Samegrelo and Zemo Svaneti, Adjara. The sample frame for this survey was based on the list of enumeration areas developed from the 2002 population census.
The primary sampling unit (PSU), the cluster for the 2005 MICS, is defined on the basis of the enumeration areas from the census frame. The minimum PSU size in Georgia is 11 households and the maximum PSU size is 188 households. The average PSU size is 70.8 households. While constructing the sampling frame the PSUs that are smaller then 30 households is merged with the neighbouring PSUs to achieve the minimum size of PSU equalling to 30 households. Although the original sample design for the Georgia MICS 2005 called for approximately 14000 households with an equal number of clusters (42) of households in each of the 11 regions, stratified into urban and rural areas, this sample design was changed to use a more complicated stratification design, with unequal numbers of clusters in each stratum. The rationale for this was for the selection to more closely follow the population distribution of the population.
The sample was selected in four stages and in the first two stages, sample design was stratified according to 11 regions, 3 settlement types (Large town, Small town, and Village), and 4 geographic strata (Valley, Foothills, Mountain, and High mountain). This stratification was applied in all regions, except the city of Tbilisi where the region is stratified according to 10 districts. In total 49 separate strata were identified. The last two stages of the sample design were for the selection of clusters and households.
First stage of sampling: The number of clusters based on sample size calculations was 467 and these were allocated to regions based on the cube root of the number of households in the region. Because the number of clusters for the Racha-Lechkumi region was small (12 clusters), it was decided to increase the number of clusters in that region by 8 for a total of 20 clusters in that region for a total of 475 clusters nationwide.
Second stage of sampling: Within each region, another level of stratification was on a combination of the following: size of settlement (large town, small town, and village) and topography (valley, foothills, mountain, and mountain). The allocation of the number of clusters for a settlement/topography stratum was based on the square root of the number of households in each stratum. Some regions did not have each of the different size settlements or topography. Also, in Tbilisi, the Rayons (districts) were used for stratification.
Third stage of sampling: Within each stratum, clusters were selected with probability proportional to population size (PPS).
Fourth stage of sampling: Within each cluster, 30 households were systematically selected, resulting with 14,250 households.
The Georgia Multiple Indicator Cluster Survey sample is not self-weighted. The basic weighting of the data has been done using the inverse of the probability of selection of each household.
Following standard MICS data collection rules, if a household was actually more than one household when visited, then a) if the selected household contained two households, both were interviewed, or b) if the selected household contained 3 or more households, then only the household of the person named as the head was interviewd.
No replacement of households was permitted in case of non-response or non-contactable households. Adjustments were made to the sampling weights to correct for non-response, according to MICS standard procedures.
The sampling procedures are more fully described in the sampling design document and the sampling appendix of the final report.
No major deviations from the original sample design were made. All sample enumeration areas were accessed and successfully interviewed with good response rates.
Face-to-face [f2f]
The questionnaires for the Georgia MICS were structured questionnaires based on the MICS3 Model Questionnaire with some modifications and additions. A household questionnaire was administered in each household, which collected various information on household members including sex, age, relationship, and orphanhood status. The household questionnaire includes household listing, education, water and sanitation, household characteristics, child labour, child discipline, disability, and salt iodization.
In addition to a household questionnaire, questionnaires were administered in each household for women age 15-49 and children under age five. For children, the questionnaire was administered to the mother or caretaker of the child.
The women's questionnaire includes child mortality, maternal and newborn health, marriage and union, contraception, attitudes towards domestic violence, HIV knowledge, cigarette smoking, and hemoglobin test.
The children's questionnaire includes birth registration and early learning, child development, breastfeeding, care of illness, immunization*, and anthropometry.
The questionnaires are based on the MICS3 model questionnaire. From the MICS3 model English and Russian versions, the questionnaires were translated into Georgian and were pre-tested in Tbilisi and in Mtskheta-Mtianeti during September 2005. Based on the results of the pre-test, modifications were made to the wording and translation of the
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
To accurately predict molecular properties, it is important to learn expressive molecular representations. Graph neural networks (GNNs) have made significant advances in this area, but they often face limitations like neighbors-explosion, under-reaching, oversmoothing, and oversquashing. Additionally, GNNs tend to have high computational costs due to their large number of parameters. These limitations emerge or increase when dealing with larger graphs or deeper GNN models. One potential solution is to simplify the molecular graph into a smaller, richer, and more informative one that is easier to train GNNs. Our proposed molecular graph coarsening framework called FunQG, uses Functional groups as building blocks to determine a molecule’s properties, based on a graph-theoretic concept called Quotient Graph. We show through experiments that the resulting informative graphs are much smaller than the original molecular graphs and are thus more suitable for training GNNs. We apply FunQG to popular molecular property prediction benchmarks and compare the performance of popular baseline GNNs on the resulting data sets to that of state-of-the-art baselines on the original data sets. Our experiments demonstrate that FunQG yields notable results on various data sets while dramatically reducing the number of parameters and computational costs. By utilizing functional groups, we can achieve an interpretable framework that indicates their significant role in determining the properties of molecular quotient graphs. Consequently, FunQG is a straightforward, computationally efficient, and generalizable solution for addressing the molecular representation learning problem.