The joiner is a component often used in workflows to merge or join data from different sources or intermediate steps into a single output. In the context of Common Workflow Language (CWL), the joiner can be implemented as a step that combines multiple inputs into a cohesive dataset or output. This might involve concatenating files, merging data frames, or aggregating results from different computations.
Metabolomics encounters challenges in cross-study comparisons due to diverse metabolite nomenclature and reporting practices. To bridge this gap, we introduce the Metabolites Merging Strategy (MMS), offering a systematic framework to harmonize multiple metabolite datasets for enhanced interstudy comparability. MMS has three steps. Step 1: Translation and merging of the different datasets by employing InChIKeys for data integration, encompassing the translation of metabolite names (if needed). Followed by Step 2: Attributes' retrieval from the InChIkey, including descriptors of name (title name from PubChem and RefMet name from Metabolomics Workbench), and chemical properties (molecular weight and molecular formula), both systematic (InChI, InChIKey, SMILES) and non-systematic identifiers (PubChem, CheBI, HMDB, KEGG, LipidMaps, DrugBank, Bin ID and CAS number), and their ontology. Finally, a meticulous three-step curation process is used to rectify disparities for conjugated base/acid compounds (optional step), missing attributes, and synonym checking (duplicated information). The MMS procedure is exemplified through a case study of urinary asthma metabolites, where MMS facilitated the identification of significant pathways hidden when no dataset merging strategy was followed. This study highlights the need for standardized and unified metabolite datasets to enhance the reproducibility and comparability of metabolomics studies.
Im4Sketch is a large-scale dataset with shape-oriented set of classes for image-to-sketch generalization . It consists of a collection of natural images from 874 categories for training and validation, and sketches from 393 categories (a subset of natural image categories) for testing.
The images and sketches are collected from existing popular computer vision datasets. The categories are selected having shape similarity in mind, so that object with same shape belong to the same category.
The natural-image part of the dataset is based on the ILSVRC2012 version of ImageNet. The original ImageNet categories are first merged according to the shape criteria. Object categories for objects whose shape, e.g. how a human would draw the object, is the same are merged. For this step, semantic similarity of categories, obtained through WordNet and category names, is used to obtain candidate categories for merging. Based on visual inspection of these candidates, the decision to merge the original ImageNet classes is made by a human. For instance, ”Indian Elephant” and ”African Elephant”, or ”Laptop” and ”Notebook” are merged. An extreme case of merging is the new class “dog” that is a union of 121 original ImageNet classes of dog breeds.
In the second step, classes from datasets containing sketches are used. In particular, DomainNet, Sketchy, PACS, and TU-Berlin. Note that merging is not necessary for classes in these datasets, because the shape criteria are guaranteed since they are designed for sketches. In this step, a correspondence between the merged ImageNet categories and categories of the other datasets is found. As in the merging step, semantic similarity is used to guide the correspondence search. Sketch categories that are not present in the merged ImageNet are added to the overall category set, while training natural images of those categories are collected from either DomainNet or Sketchy. In the end, ImageNet is used for 690 classes, DomainNet for 183 classes, and Sketchy for 1 class, respectively.
Pre-processed mission statements and additional data from 1023-EZ approvals for 2018 and 2019. For additional information on cleaning steps, please go to the project's replication GitHub page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RSR1.5 of ICP and CICP algorithms in two steps on US-MERGE and US-SNAP datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Physically based numerical weather prediction and climate models provide useful information for a large number of end users, such as flood forecasters, water resource managers, and farmers. However, due to model uncertainties arising from, e.g., initial value and model errors, the simulation results do not match the in situ or remotely sensed observations to arbitrary accuracy. Merging model-based data with observations yield promising results benefiting simultaneously from the information content of the model results and observations. Machine learning (ML) and/or deep learning (DL) methods have been shown to be useful tools in closing the gap between models and observations due to the capacity in the representation of the non-linear space–time correlation structure. This study focused on using UNet encoder–decoder convolutional neural networks (CNNs) for extracting spatiotemporal features from model simulations for predicting the actual mismatches (errors) between the simulation results and a reference data set. Here, the climate simulations over Europe from the Terrestrial Systems Modeling Platform (TSMP) were used as input to the CNN. The COSMO-REA6 reanalysis data were used as a reference. The proposed merging framework was applied to mismatches in precipitation and surface pressure representing more and less chaotic variables, respectively. The merged data show a strong average improvement in mean error (~ 47%), correlation coefficient (~ 37%), and root mean square error (~22%). To highlight the performance of the DL-based method, the results were compared with the results obtained by a baseline method, quantile mapping. The proposed DL-based merging methodology can be used either during the simulation to correct model forecast output online or in a post-processing step, for downstream impact applications, such as flood forecasting, water resources management, and agriculture.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains shapefiles showing landscape classification, including all natural and human ecosystems, for the Galilee preliminary assessment extent.
It is constructed from source data (see Lineage) to show the landscape classification systematically and define geographical areas into classes based on similarity in physical and/or biological and hydrological character. The landscape classification includes all natural and human ecosystems in the Galilee preliminary assessment extent.
A landscape classification was developed to characterise the nature of water dependency among these assets.
The aim of the landscape classification is to systematically define geographical areas into classes based on similarity in physical and/or biological and hydrological character.
The landscape classification was carried out on data layers consisting of polygons (e.g. remnant vegetation, wetlands) and lines (stream network) and points (springs and spring complexes).
The layers created to contribute to the landscape classes were: GAL_Landform_LC layer; GAL_GW_SRCE_LC; GAL_FLD_r; GAL Streams; LC_WaterType_GAL; LC_WaterAvail_GAL.
The description of how these layers were created is below:
A. To make the GAL_Landform_LC layer:
Merge Queensland wetlands (QLD_WETLAND_SYSTEM_100K_A) with South Australian wetlands (Wetlands_GDE_Classification (SA))
Select wetlands from step A1that intersect the Galilee_SW_PAE_v02.
Add a new field to wetlands data for landform class called "Landform_LC".
From merged wetland data (step A1) select Queensland wetlands ( "Wetlandsys" field is not blank) and update Landform_LC to first letter of "Wetlandsys" value
From merged wetland data (step A1) select South Australian wetlands ("WETCLASS" field is not blank) and update Landform_LC to "wetclass"
Compare to Landclass_Draft1 (Don Butler's data) to check areas between wetlands created in step A1 have the "Landform" value of "-" and cover about the same percent of the area (\~63%). This is true and therefore these data match the Don Butler data for wetlands.
Select all wetlands as defined in steps A4-6 and eliminate errors or slithers created by slight overlaps when data were merged
Select all wetlands as defined in steps A4-6 and delete all overlaps
9.Select all streams (AHGHMappedStream) within Galilee_SW_PAE_v02 and buffer to 1m total width. This makes the area of the stream equal to the length
Overlap wetland areas with buffered streams (created in step A9) and erase any wetlands inside buffered stream areas. This ensures no overlapping polygons when wetlands and streams merged
Merge buffered streams created in step A9 and wetlands created in step A10 to create GAL_LANDFORM_LC
B. To make the GAL_GW_SRCE_LC (GAL Ground Water SOURCE Land Class):
Select "Aquifers assocated with springs that form saline scolds" and "Sandstone aquifers with fresh permanent groundwater connectivity regime associated with discharge springs") from GDE_Terr_Area_v01_3 that are within Galilee_SW_PAE_v02, add a new field called "GW_SRCE_LC" and update values to "Artesian"
Select NRM_Regions_2014_v01 within South Australia
Select subsurface GDEs (GDEsub) and surface GDEs (GDEsur) from GM_PED_AssetList_poly that are within South Australia (ie they intersect NRM regions in South Australia (those selected in step B2), add a new field called "GW_SRCE_LC" and update values to "Artesian"
Select springs from Topo250KSeries3_gdb (Springs) that are within South Australia (ie they intersect NRM regions in South Australia (those selected in step B2) and buffer to 20m radius, add a new field called "GW_SRCE_LC" and update values to "Artesian"
Select where LEB_Non-GAB_Springs.shp intersect NRM regions in South Australia (those selected in step B2) - there were none.
Erase GDEs from step B3 that overlap with springs from step B4 and merge remaining GDEs with springs (step B4) and aquifers (step B1) to make GAL_GW_SRCE_LC
C. To make the Topography landclasses = GAL_FLD_r:
Select landzone 3 from DP_Preclear_RE_DCDB_A
Select floodplains from QLD_Wetland_System
Combine LandSubjectToInundation, MarineSwamp, SalineCoastalFlat and Swamps from GA 250K topogrpahic flats data (GA_250K_topo(Flats))
Select floodplains from Don Butlers landscape classes (Landclass_Draft\])
Merge landzone 3 (from step C1) with wetland floodplains (from step C2), flats from GA (step C3) and Don Butler's floodplains (step C4) then add a new field called "LC_Code" and update values to "10,000"
Select floodplains (created in step C5) that are within the Galilee_SW_PAE_v02
convert to raster and cut into 2 degree tiles to create GAL_FLD_1..75
D. To make the GAL Streams (no overlaps between buffered streams):
Select streams from AHGHMappedStream within GAL_PAE_v02
Buffer streams to 1m
Select first 2 buffered streams, erase the first from the area of the second then merge the 2 together
Select 3rd stream, erase from the 3rd areas that overlap first and second (results of D3), then merge
continue for every stream until all done
check for overlaps and remove if any to create GAL Streams
add a new field called "LC_Code" and update values to "3" for riverine
E. To make the LC_WaterType_GAL:
Select Queensland terrestrial GDEs (QLD_GDETerr) within GAL_PAE_v02 where salinity of groundwater >= 3000mg/L TDS
Select Queensland surface GDEs (QLD_GDETerr) within GAL_PAE_v02 where salinity of groundwater >= 3000mg/L TDS
Select Queensland wetlands (QLD_Wetlands) within GAL_PAE_v02 where "SALIMOD" = "S2" or "S3" or "T1"
Merge results from E1, E2 and E3 to create LC_WaterType_GAL
add a new field called LC_Code" and update values to 100 for fresh water and 200 for saline
F. To make the LC_WaterAvail_GAL:
Select Queensland terrestrial GDEs (QLD_GDETerr) within GAL_PAE_v02 where water regime (WTRRegime) = "WR0", update LC_Code to 30 (intermittent)
Select Queensland surface GDEs (QLD_GDETerr) within GAL_PAE_v02 where water regime (WTRRegime) = "WR0", "T1", "WT1" or "WR2", update LC_Code to 30 (intermittent)
Select Queensland surface GDEs (QLD_GDETerr) within GAL_PAE_v02 where water regime (WTRRegime) = ""WT3" or "WR3", update LC_Code to 20 (near perminent)
Select Queensland wetlands (QLD_Wetlands) within GAL_PAE_v02 where water regime (WTRRegime) = "WR0", "T1", "WT1"or "WR2", update LC_Code to 30 (intermittent)
Select Queensland wetlands (QLD_Wetlands) within GAL_PAE_v02 where water regime (WTRRegime) = "WT3" or "WR3", update LC_Code to 20 (near perminent)
Combine results from F1..F5 to create LC_WaterAvail_GAL
G. To make the REmVEG:
select vegetation classes \[1-23, 26, 29-32\] from NVIS - Australian Major Vegetation Subgroups that are within GAL_PAE_v02,
convert to vector data and add a new field called "LC_Code" and update values to "100,000" (remnant vegetation) to create REMVeg
H.To make Landscape_Tile1..4
1 = P = palustrine
2 = L = lacustrine
3 = R = riverine
4 = E = estuarine
10 = permanant water
20 = near permanent water (water there between 70 and 100% of the time)
30 = intermittent water (water there less than 70% of the time)
100 = fresh water
200 = saline water
10,000 = floodplain
100,000 = remnant vegetation
Bioregional Assessment Programme (2015) Landscape classification of the Galilee preliminary assessment extent. Bioregional Assessment Derived Dataset. Viewed 12 December 2018, http://data.bioregionalassessments.gov.au/dataset/80e7b80a-23e4-4aa1-a56c-27febe34d7db.
Derived From Queensland wetland data version 3 - wetland areas.
Derived From Geofabric Surface Cartography - V2.1
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From Queensland groundwater dependent ecosystems
Derived From GEODATA TOPO 250K Series 3
Derived From Multi-resolution Valley Bottom Flatness MrVBF at three second resolution CSIRO 20000211
Derived From Biodiversity status of pre-clearing and remnant regional ecosystems - South East Qld
This dataset contains all Wintertime Investigation of Transport, Emission, and Reactivity (WINTER) C-130 observations merged at the rate of the SAGA data. In addition to the observations, the dataset also contains results from the GEOS-Chem near-realtime simulation sampled along the flight track. Refer to the instruments dataset for instrument description. The GEOS-Chem model description can be found at www.geos-chem.org. Missing values are indicated by -99999. When generating fine time resolution data from a coarser resolution, the reported value at the original (coarse) time step is applied uniformly to all intermediate (fine) time steps - no interpolation is performed. When generating coarse time resolution data from a finer resolution, the time weighted average of the values at the intermediate (fine) time steps is used as the value at the coarser time step. Please follow the WINTER data policy. These data were updated April 7, 2016. The revised set uses the final data uploaded to the WINTER archive as of 4th April 2016.
The data include a full year of logbook forms for vessels 60-124 feet in length (the partial coverage fleet) that had participated in the trawl flatfish fishery of 2005 in the Gulf of Alaska. The digitized hauls were not restricted exclusively to the population of trips to the Gulf of Alaska (GOA), since some vessels also participated in BSAI trawl fisheries. A total of 55 unique vessels daily fishing logbooks (9 catcher-processors and 46 catcher vessels) were digitized into the Vessel Log System database. The daily production section for catcher-processors was not digitized, therefore they were excluded from the data entry procedure and we focus on the remaining catcher vessels. These logbook records are then combined with observer and fish ticket data for the same vessels to create a more complete accounting of each vessels activity in 2005. In order to examine the utility, uniqueness, and the congruence of data contained in the logbooks with other sources, we collated vessel records from logbook data with Alaska Commercial Fisheries Entry Commission (CFEC) fish tickets (retrieved from the Alaska Fisheries Information Network (AKFIN)) and the North Pacific Groundfish Observer Program observer records. Merging of datasets was a multiple-step process. The first merge of data was between the quality-controlled observer and fish ticket data. Prior to 2007, the observer program did not track trip-level information such as the date of departure and return to/from port, or landing date. Consequently, to combine the 2005 haul-level observer data with the trip-level data from the fish tickets for a given vessel, each observer haul was merged with a fish ticket record if the haul retrieval date from the observer data was contained within in the modified start and end date derived from the fish ticket data (see above). Since the starting date on the fish ticket record represents the date fishing began, rather than the date a vessel left port, all observer haul records should be within the time frame of the fish ticket start and end dates. The observer hauls were therefore given the same trip number as determined by the fish tickets trip numbering algorithm. The same process was then repeated to merge each logbook haul onto the combined fish ticket and observer data. Trip targets were then assigned from the North Pacific Fishery Management Council comprehensive observer database (Council.Comprehensive_obs) for observed trips, and statistical areas denoted on the fish tickets were mapped to Fishery Management Plan (FMP) areas. After quality control, the dataset was considered complete, and is referred to as the combined dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A1MetEventsTable.txt: Reported metassembly events (i.e. modifications to the primary assembly such as gaps closed, number of scaffold links, etc) for all Assemblathon1 metassemblies at each merging step. (TXT 24 kb)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset merges the groundwater recharge estimate grids for each hydrogeological formation in the Galilee Basin (GUID: d42a8497-9d67-42ad-9e7d-70a8d519875f) into a single grid for input into the Galilee groundwater model.
A second groundwater recharge estimate has been produced which reduces the estimated rate for all formations. All values of the merged grid were halved, and areas which underlie unconsolidated clay and basalt surface geology features (GUID: 3c8e66e7-6a15-47ce-853b-bbe38435d28f) are given a recharge value of 0.125 mm/year for clay and 0.25 mm/year for basalt, unless the original cell value was less.
This dataset provides a single input recharge estimate grid for the Galilee groundwater model.
All formation recharge estimate grids from the input recharge dataset were merged into a single raster layer using the raster calculator statement:
Con(IsNull("%a%"),Con(IsNull("%b%"),Con(IsNull("%c%"),Con(IsNull("%d%"),Con(IsNull("%e%"),Con(IsNull("%f%"),Con(IsNull("%g%"),Con(IsNull("%h%"),Con(IsNull("%i%"),"%j%","%i%"),"%h%"),"%g%"),"%f%"),"%e%"),"%d%"),"%c%"),"%b%"),"%a%"). Where a, b, c, d... are recharge estimates for individual formations.
Then, 'No data' gaps were filled in using the raster calculator statement: Con(IsNull("%MergeALL%"),Con(IsNull(BlockStatistics("%MergeALL%",NbrRectangle(2,2,"CELL"),"MAXIMUM")),(BlockStatistics("%MergeALL%",NbrRectangle(4,4,"CELL"),"MAXIMUM")),(BlockStatistics("%MergeALL%",NbrRectangle(2,2,"CELL"),"MAXIMUM"))),"%MergeALL%"). Where MergeALL is the output raster of the previous step.
To create the raster "Recharge_mergeAll_AlluviumCenozoicFeatures" the output of the previous step was multiplied by 0.5, then cells which were contained within the clay features shapefile were given a values of 0.125 and cells contained within the basalt features shapefile were given values of 0.25 (original cell values less than the new values were retained)
Bioregional Assessment Programme (2015) Merged Galilee model recharge estimates chloride mass balance v02. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/b892b063-4df9-4199-80c8-2ed4a1077d5b.
Derived From Galilee model recharge estimates: chloride mass balance v02
Derived From Australian 0.05º gridded chloride deposition v2
Derived From Galilee Recharge Cenozoic Alluvium Regions v01
Derived From GAL Aquifer Formation Extents v01
Derived From GAL Aquifer Formation Extents v02
Derived From Surface Geology of Australia, 1:1 000 000 scale, 2012 edition
Derived From Natural Resource Management (NRM) Regions 2010
Derived From Galilee Groundwater Model, hydrogeological formation recharge (Outcrop) extents v01
Derived From Galilee - Alluvium and Cenozoic 1M surface Geology
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From GEODATA TOPO 250K Series 3
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From Geological Provinces - Full Extent
Derived From Phanerozoic OZ SEEBASE v2 GIS
Derived From Galilee Hydrochemistry: Quality control for Chloride model recharge v02
Derived From Bioregional Assessment areas v03
Derived From QLD Geological Digital Data - QLD Geology, Structural Framework, November 2012
Derived From Galilee Groundwater Model, Hydrogeological Formation Extents v01
Derived From Queensland petroleum exploration data - QPED
Derived From Three-dimensional visualisation of the Great Artesian Basin - GABWRA
Derived From QLD Department of Natural Resources and Mines Groundwater Database Extract 20142808
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From Queensland Geological Digital Data - Detailed state extent, regional. November 2012
Origin Datasets: HuggingFaceTB/smoltalk Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:
Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on the resulting distribution… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/HuggingFaceTB_smoltalk_filtered_10k_sampled.
This dataset is an 8-year (2011-2018) global spatiotemporally consistent surface soil moisture dataset with a 25km spatial grid resolution and daily temporal step in unit of cm3/cm3. This dataset is developed by applying a linear weight fusion algorithm based on the Triple Collocation Analysis (TCA) to merge the five soil moisture data products, i.e., SMOS, ASCAT, FY3B, CCI and SMAP in two steps. The first step is to fuse the SMOS, ASCAT and FY3B soil moisture products from 2011 to 2018. The second step is to refuse the merged soil moisture product in the first step, CCI and SMAP products from 2015 to 2018, and to obtain the finally merged soil moisture product from 2011 to 2018. In addition, the measured soil moisture data from seven ground observation networks around the world are used to evaluate and analyze the merged soil moisture product. The fused soil moisture product has the global spatial coverage ratio of more than 80%. With rhe minimum RMSE (root mean square error) of 0.036 cm3/cm3.
This dataset contains all Wintertime Investigation of Transport, Emission, and Reactivity (WINTER) C-130 observations merged at the 1 second time steps. In addition to the observations, the dataset also contains results from the GEOS-Chem near-realtime simulation sampled along the flight track. Refer to the instrument's dataset for instrument description. The GEOS-Chem model description can be found at www.geos-chem.org. Missing values are indicated by -99999. When generating fine time resolution data from a coarser resolution, the reported value at the original (coarse) time step is applied uniformly to all intermediate (fine) time steps - no interpolation is performed. When generating coarse time resolution data from a finer resolution, the time weighted average of the values at the intermediate (fine) time steps is used as the value at the coarser time step. Please follow the WINTER data policy. These data were updated April 7, 2016. The revised set uses the final data uploaded to the WINTER archive as of 4th April 2016.
Origin Datasets: allenai/llama-3.1-tulu-3-405b-preference-mixture Dataset Sampling for Merge-Up SLM Training To prepare a dataset of 100,000 samples for Merge-Up SLM training, the following steps were taken:
Filtering for English Only: We used a regular expression to filter the dataset, retaining only the samples that contain English alphabets exclusively. Proportional Sampling by Token Length: Starting from 4,000 tokens, we counted the number of samples in increments of 200 tokens. Based on… See the full description on the dataset page: https://huggingface.co/datasets/aeolian83/allenai_llama_3.1_tulu_3_405b_preference_mixture_filtered_10k_sampled.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets are results from merging three FengYun passive microwave soil moisture observations at a 15kmx15km spatial resolution from 2011 to 2020 with continuous extension as data becomes available. Here, we rely on a merging technique that minimizes mean square error (MSE) using the signal-to-noise ratio (SNRopt) of the input parent products to first merge subdaily soil moisture products into dail averages. From these, these are gap-filled using a Data INterpolating Convolutional Auto-Encoder, DINCAE (FY3_Reoconstructed_*). The advantage of this method is that it comes with error variances(FY3_ErVar_*) for each pixel and time step which are useful for sevral applications.
Noncommunicable diseases are the top cause of deaths. In 2008, more than 36 million people worldwide died of such diseases. Ninety per cent of those lived in low-income and middle-income countries.WHO Maps Noncommunicable Disease Trends in All Countries The STEPS Noncommunicable Disease Risk Factor Survey, part of the STEPwise approach to surveillance (STEPS) Adult Risk Factor Surveillance project by the World Health Organization (WHO), is a survey methodology to help countries begin to develop their own surveillance system to monitor and fight against noncommunicable diseases. The methodology prescribes three steps—questionnaire, physical measurements, and biochemical measurements. The steps consist of core items, core variables, and optional modules. Core topics covered by most surveys are demographics, health status, and health behaviors. These provide data on socioeconomic risk factors and metabolic, nutritional, and lifestyle risk factors. Details may differ from country to country and from year to year.
The general objective of the Zimbabwe NCD STEPS survey was to assess the risk factors of selected NCDs in the adult population of Zimbabwe using the WHO STEPwise approach to non-communicable diseases surveillance. The specific objectives were: - To assess the distribution of life-style factors (physical activity, tobacco and alcohol use), and anthropometric measurements (body mass index and central obesity) which may impact on diabetes and cardiovascular risk factors. - To identify dietary practices that are risk factors for selected NCDs. - To determine the prevalence and determinants of hypertension - To determine the prevalence and determinants of diabetes. - To determine the prevalence and determinants of serum lipid profile.
Mashonaland Central, Midlands and Matebeleland South Provinces.
Household Individual
The survey comprised of individuals aged 25 years and over.
Sample survey data [ssd]
A multistage sampling strategy with 3 stages consisting of province, district and health centre was employed. The World Health Organization STEPwise Approach (STEPS) was used as the design basis for the survey. The 3 randomly selected provinces for the survey were Mashonaland Central, Midlands and Matebeleland South. In each Province four districts were chosen and four health centres were surveyed per district. The survey comprised of individuals aged 25 years and over.The survey was carried out on 3,081 respondents consisting of 1,189 from Midlands,944 from Mashonaland Central and 948 from Matebeleland South. A detailed description of the sampling process is provided in sections 3.8 -3.9. if the survey report provided under the related materials tab.
Designing a community-based survey such as this one is fraught with difficulties in ensuring representativeness of the sample chosen. In this survey there was a preponderance of female respondents because of the pattern of employment of males and females which also influences urban rural migration.
The response rate in Midlands was lower than the other two provinces in both STEP 2 and 3. This notable difference was due to the fact that Midlands had more respondents sampled from the urban communities. A higher proportion of urban respondents was formally employed and therefore did not complete STEP 2 and 3 due to conflict with work schedules.
Face-to-face [f2f]
In this survey all the core and selected expanded and optional variables were collected. In addition a food frequency questionnaire and a UNICEF developed questionnaire, the Fortification Rapid Assessment Tool (FRAT) were administered to elicit relevant dietary information.
Data entry for Step 1 and Step 2 data was carried out as soon as data became available to the data management team. Step 3 data became available in October and data entry was carried out when data quality checks were completed in November. Report writing started in September and a preliminary report became available in December 2005.
Training of data entry clerks Five data entry clerks were recruited and trained for one week. The selection of data entry clerks was based on their performance during previous research carried out by the MOH&CW. The training of the data entry clerks involved the following: - Familiarization with the NCD, FRAT and FFQ questionnaires. - Familiarization with the data entry template. - Development of codes for open-ended questions. - Statistical package (EPI Info 6). - Development of a data entry template using EPI6. - Development of check files for each template - Trial runs (mock runs) to check whether template was complete and user friendly for data entry. - Double entry (what it involves and how to do it and why it should be done). - Pre-primary data cleaning (check whether denominators are tallying) of the data entry template was done.
Data Entry for NCD, FRAT and FFQ questionnaires The questionnaires were sequentially numbered and were then divided among the five data entry clerks. Each one of the data entry clerks had a unique identifier for quality control purposes. Hence, the data was entered into five separate files using the statistical package EPI Info version 6.0. The data entry clerks inter-changed their files for double entry and validation of the data. Preliminary data cleaning was done for each of the five files. The five files were then merged to give a single file. The merged file was then transferred to STATA Version 7.0 using Stat Transfer version 5.0.
Data Cleaning A data-cleaning workshop was held with the core research team members. The objectives of the workshop were: 1. To check all data entry errors. 2. To assess any inconsistencies in data filling. 3. To assess any inconsistencies in data entry. 4. To assess completeness of the data entered.
Data Merging There were two datasets (NCD questionnaire dataset and laboratory dataset) after the data entry process. The two files were merged by joining corresponding observations from the NCD questionnaire dataset with those from the laboratory dataset into single observations using a unique identifier. The ID number was chosen as the unique identifier since it appeared in both data sets. The main aim of merging was to combine the two datasets containing information on behaviour of individuals and the NCD laboratory parameters. When the two data sets were merged, a new merge variable was created. The merge variable took values 1, 2 and 3. Merge variable==1 Observation appeared in the NCD questionnaire data set but a corresponding observation was not in the laboratory data set Merge variable==2 Observation appeared in the laboratory data set but a corresponding observation did not appear in the questionnaire data set Merge variable==3 Observation appeared in both data sets and reflects a complete merge of the two data sets.
Data Cleaning After Merging Data cleaning involved identifying the observations where the merge variable values were either 1 or 2. Merge status for each observation was also changed after effecting any corrections. The other two unique variables that were used in the cleaning were Province, district and health centre since they also appeared in both data sets.
Objectives of cleaning: 1. Match common variables in both data sets and identify inconsistencies in other matching variables e.g. province, district and health centre. 2. To check for any data entry errors.
A total of 3,081 respondents were included in the survey against an estimated sample size of 3,000. The response rate for Step 1 was 80% for and for Step 2 70% taking Step 1 accrual as being 100%.
This dataset contains the partial pressure of carbon dioxide (pCO2) climatology that was created by merging 2 published and publicly available pCO2 datasets covering the open ocean (Landschützer et. al 2016) and the coastal ocean (Laruelle et. al 2017). Both fields were initially created using a 2-step neural network technique. In a first step, the global ocean is divided into 16 biogeochemical provinces using a self-organizing map. In a second step, the non-linear relationship between variables known to drive the surface ocean carbon system and gridded observations from the SOCAT open and coastal ocean datasets (Bakker et. al 2016) is reconstructed using a feed-forward neural network within each province separately. The final product is then produced by projecting driving variables, e.g., surface temperature, chlorophyll, mixed layer depth, and atmospheric CO2 onto oceanic pCO2 using these non-linear relationships (see Landschützer et. al 2016 and Laruelle et. al 2017 for more detail). This results in monthly open ocean pCO2 fields at 1°x1° resolution and coastal ocean pCO2 fields at 0.25°x0.25° resolution. To merge the products, we divided each 1°x1° open ocean bin into 16 equal 0.25°x0.25° bins without any interpolation. The common overlap area of the products has been merged by scaling the respective products by their mismatch compared to observations from the SOCAT datasets (see Landschützer et. al 2020).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
summary-map-reduce-v1
A dataset for training text-to-text models to consolidate multiple summaries from a chunked long document in the "reduce" step of map-reduce summarization
About
Each example contains chunked summaries from a long document, concatenated into a single string with
as delimiter (input_summaries), and their synthetically generated consolidated/improved version (final_summary). The consolidation step focuses on
Merge redundant information while… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/summary-map-reduce-v1.
Version 2 of the dataset has been superseded by a newer version. Users should not use version 2 except in rare cases (e.g., when reproducing previous studies that used version 2). The International Best Track Archive for Climate Stewardship (IBTrACS) dataset was developed by The NOAA National Climatic Data Center, which took the initial step of synthesizing and merging best track data from all official Tropical Cyclone Warning Centers (TCWCs) and the WMO Regional Specialized Meteorological Centers (RSMCs) who are responsible for developing and archiving best track data worldwide. In recognizing the deficiency in global tropical cyclone data, and the lack of a publically available dataset, the IBTrACS dataset was produced, which, for the first time, combines existing best track data from over 10 international forecast centers. The dataset contains the position, maximum sustained winds, minimum central pressure, and storm nature for every tropical cyclone globally at 6-hr intervals in UTC. Statistics from the merge are also provided (such as number of centers tracking the storm, range in pressure, median wind speed, etc.). The dataset period is from 1848 to the present with dataset updates performed semi-annually--in the boreal spring following the completion of the Northern Hemisphere TC season and in the boreal autumn following the completion of the Southern Hemisphere TC season. The dataset is archived as netCDF files but can be accessed via a variety of user-friendly formats to facilitate data analysis, including netCDF and CSV formatted files. Version 2 changes include source data updates, bug fixes, adjustments and corrections as well as additional source datasets.
The joiner is a component often used in workflows to merge or join data from different sources or intermediate steps into a single output. In the context of Common Workflow Language (CWL), the joiner can be implemented as a step that combines multiple inputs into a cohesive dataset or output. This might involve concatenating files, merging data frames, or aggregating results from different computations.