Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterThe main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis
National, including Funafuti and Outer islands
All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope
Sample survey data [ssd]
It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.
For details please refer to Table 1.1 of the Report.
Only the island of Niulakita was not included in the sampling frame, considered too small.
Face-to-face [f2f]
There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.
HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items
INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer
DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)
Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:
Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.
Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.
Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.
Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.
Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.
Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.
All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.
The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.
Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.
A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.
Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.
Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.
Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with
Facebook
TwitterThe intention is to collect data for the calendar year 2009 (or the nearest year for which each business keeps its accounts. The survey is considered a one-off survey, although for accurate NAs, such a survey should be conducted at least every five years to enable regular updating of the ratios, etc., needed to adjust the ongoing indicator data (mainly VAGST) to NA concepts. The questionnaire will be drafted by FSD, largely following the previous BAS, updated to current accounting terminology where necessary. The questionnaire will be pilot tested, using some accountants who are likely to complete a number of the forms on behalf of their business clients, and a small sample of businesses. Consultations will also include Ministry of Finance, Ministry of Commerce, Industry and Labour, Central Bank of Samoa (CBS), Samoa Tourism Authority, Chamber of Commerce, and other business associations (hotels, retail, etc.).
The questionnaire will collect a number of items of information about the business ownership, locations at which it operates and each establishment for which detailed data can be provided (in the case of complex businesses), contact information, and other general information needed to clearly identify each unique business. The main body of the questionnaire will collect data on income and expenses, to enable value added to be derived accurately. The questionnaire will also collect data on capital formation, and will contain supplementary pages for relevant industries to collect volume of production data for selected commodities and to collect information to enable an estimate of value added generated by key tourism activities.
The principal user of the data will be FSD which will incorporate the survey data into benchmarks for the NA, mainly on the current published production measure of GDP. The information on capital formation and other relevant data will also be incorporated into the experimental estimates of expenditure on GDP. The supplementary data on volumes of production will be used by FSD to redevelop the industrial production index which has recently been transferred under the SBS from the CBS. The general information about the business ownership, etc., will be used to update the Business Register.
Outputs will be produced in a number of formats, including a printed report containing descriptive information of the survey design, data tables, and analysis of the results. The report will also be made available on the SBS website in “.pdf” format, and the tables will be available on the SBS website in excel tables. Data by region may also be produced, although at a higher level of aggregation than the national data. All data will be fully confidentialised, to protect the anonymity of all respondents. Consideration may also be made to provide, for selected analytical users, confidentialised unit record files (CURFs).
A high level of accuracy is needed because the principal purpose of the survey is to develop revised benchmarks for the NA. The initial plan was that the survey will be conducted as a stratified sample survey, with full enumeration of large establishments and a sample of the remainder.
National Coverage
The main statistical unit to be used for the survey is the establishment. For simple businesses that undertake a single activity at a single location there is a one-to-one relationship between the establishment and the enterprise. For large and complex enterprises, however, it is desirable to separate each activity of an enterprise into establishments to provide the most detailed information possible for industrial analysis. The business register will need to be developed in such a way that records the links between establishments and their parent enterprises. The business register will be created from administrative records and may not have enough information to recognize all establishments of complex enterprises. Large businesses will be contacted prior to the survey post-out to determine if they have separate establishments. If so, the extended structure of the enterprise will be recorded on the business register and a questionnaire will be sent to the enterprise to be completed for each establishment.
SBS has decided to follow the New Zealand simplified version of its statistical units model for the 2009 BAS. Future surveys may consider location units and enterprise groups if they are found to be useful for statistical collections.
It should be noted that while establishment data may enable the derivation of detailed benchmark accounts, it may be necessary to aggregate up to enterprise level data for the benchmarks if the ongoing data used to extrapolate the benchmark forward (mainly VAGST) are only available at the enterprise level.
The BAS's covered all employing units, and excluded small non-employing units such as the market sellers. The surveys also excluded central government agencies engaged in public administration (ministries, public education and health, etc.). It only covers businesses that pay the VAGST. (Threshold SAT$75,000 and upwards).
Sample survey data [ssd]
-Total Sample Size was 1240 -Out of the 1240, 902 successfully completed the questionnaire. -The other remaining 338 either never responded or were omitted (some businesses were ommitted from the sample as they do not meet the requirement to be surveyed) -Selection was all employing units paying VAGST (Threshold SAT $75,000 upwards)
WILL CONFIRM LATER!!
OSO LE MEA E LE FAASA...AEA :-)
Mail Questionnaire [mail]
Supplementary Pages Additional pages have been prepared to collect data for a limited range of industries. 1.Production data. To rebase and redevelop the Industrial Production Index (IPI), it is intended to collect volume of production information from a selection of large manufacturing businesses. The selection of businesses and products is critical to the usefulness of the IPI. The products must be homogeneous, and be of enough importance to the economy to justify collecting the data. Significance criteria should be established for the selection of products to include in the IPI, and the 2009 BAS provides an opportunity to collect benchmark data for a range of products known to be significant (based on information in the existing IPI, CPI weights, export data, etc.) as well as open questions for respondents to provide information on other significant products. 2.Tourism. There is a strong demand for estimates of tourism value added. To estimate tourism value added using the international standard Tourism Satellite Account methodology requires the use of an input-output table, which is beyond the capacity of SBS at present. However, some indicative estimates of the main parts of the economy influenced by tourism can be derived if the necessary data are collected. Tourism is a demand concept, based on defining tourists (the international standard includes both international and domestic tourists), what products are characteristically purchased by tourists, and which industries supply those products. Some questions targeted at those industries that have significant involvement with tourists (hotels, restaurants, transport and tour operators, vehicle hire, etc.), on how much of their income is sourced from tourism would provide valuable indicators of the size of the direct impact of tourism.
Partial imputation was done at the time of receipt of questionnaires, after follow-up procedures to obtain fully completed questionnaires have been followed. Imputation followed a process, i.e., apply ratios from responding units in the imputation cell to the partial data that was supplied. Procedures were established during the editing stage (a) to preserve the integrity of the questionnaires as supplied by respondents, and (b) to record all changes made to the questionnaires during editing. If SBS staff writes on the form, for example, this should only be done in red pen, to distinguish the alterations from the original information.
Additional edit checks were developed, including checking against external data at enterprise/establishment level. External data to be checked against include VAGST and SNPF for turnover and purchases, and salaries and wages and employment data respectively. Editing and imputation processes were undertaken by FSD using Excel.
NOT APPLICABLE!!
Facebook
TwitterOur goals with this dataset were to 1) isolate, culture, and identify two fungal life stages of Aspergillus flavus, 2) characterize the volatile emissions from grain inoculated by each fungal morphotype, and 3) understand how microbially-produced volatile organic compounds (MVOCs) from each fungal morphotype affect foraging, attraction, and preference by S. oryzae. This dataset includes that derived from headspace collection coupled with GC-MS, where we found the sexual life stage of A. flavus had the most unique emissions of MVOCs compared to the other semiochemical treatments. This translated to a higher arrestment with kernels containing grain with the A. flavus sexual life stage, as well as a higher cumulative time spent in those zones by S. oryzae in a video-tracking assay in comparison to the asexual life stage. While fungal cues were important for foraging at close-range, the release-recapture assay indicated that grain volatiles were more important for attraction at longer distances. There was no significant preference between grain and MVOCs in a four-way olfactometer, but methodological limitations in this assay prevent broad interpretation. Overall, this study enhances our understanding of how fungal cues affect the foraging ecology of a primary stored product insect. In the assays described herein, we analyzed the behavioral response of Sitophilus oryzae to five different blends of semiochemicals found and introduced in wheat (Table 1). Briefly, these included no stimuli (negative control), UV-sanitized grain, clean grain from storage (unmanipulated, positive control), as well as grain from storage inoculated with fungal morphotype 1 (M1, identified as the asexual life stage of Aspergillus flavus) and fungal morphotype 2 (M2, identified as the sexual life stage of A. flavus). Fresh samples of semiochemicals were used for each day of testing for each assay. In order to prevent cross-contamination, 300 g of grain (tempered to 15% grain moisture) was initially sanitized using UV for 20 min. This procedure was done before inoculating grain with either morphotype 1 or 2. The 300 g of grain was kept in a sanitized mason jar (8.5 D × 17 cm H). To inoculate grain with the two different morphologies, we scraped an entire isolation from a petri dish into the 300 g of grain. Each isolation was ~1 week old and completely colonized by the given morphotype. After inoculation, each treatment was placed in an environmental chamber (136VL, Percival Instruments, Perry, IA, USA) set at constant conditions (30°C, 65% RH, and 14:10 L:D). This procedure was the same for both morphologies and was done every 2 weeks to ensure fresh treatments for each experimental assay. See file list for descriptions of each data file. Resources in this dataset:Resource Title: Ethovision Movement Assay. File Name: ponce_lizarraga_ethovision_assay_microbial_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 1 Assay - With Fused Air Permeable Glass. File Name: ponce_lizarraga_first_round_olfactometer_fungal_study_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Olfactometer Round 2 Assay - With Fused Air Permeable Glass Containing Holes. File Name: ponce_lizarraga_second_round_olfactometer_fungal_study_2021.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Small Release-Recapture Assay. File Name: ponce_lizarraga_small_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Large Release-Recapture Assay. File Name: ponce_lizarraga_large_release_recapture_assay.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Headspace Volatile Collection Assay. File Name: sandra_headspace_volatiles_2020.csvResource Software Recommended: Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: README file list. File Name: file_list_stored_grain_Aspergillus_Sitophilus_oryzae.txt
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The CEPS EurLex dataset The dataset contains 142.036 EU laws - almost the entire corpus of the EU's digitally available legal acts passed between 1952 - 2019. It encompasses the three types of legally binding acts passed by the EU institutions: 102.304 regulations, 4.070 directives, 35.798 decisions in English language. The dataset was scraped from the official EU legal database (Eur-lex.eu) and transformed in machine-readable CSV format with the programming languages R and Python. The dataset was collected by the Centre for European Policy Studies (CEPS) for the TRIGGER project (https://trigger-project.eu/). We hope that it will facilitate future quantitative and computational research on the EU. Brief description: - The dataset is organised in tabular format, with each law representing one row and the columns representing 23 variables. - The full text of 134.633 laws is included (column "act_raw_text"). For newer laws, the text was scraped from Eur-lex.eu via the HTML pages, while for older laws, the text was extracted from (scanned) PDF documents (if available in English). - 22 additional variables are included, such as 'Act_name', 'Act_type', 'Subject_matter', 'Authors', 'Date_document', 'ELI_link', 'CELEX' (a unique identifier for every law). Please see the "CEPS_EurLex_codebook.pdf" file for an explanation of all variables. - Given its size, the dataset was uploaded in different batches to facilitate usage. Some Excel files are provided for non-technical users. We recommend, however, the use of the CSV files, since Excel does not save large amounts of data properly. EurLex_all.csv is the master file containing all data. Caveats: - The Eur-lex.eu website does not consistently provide data for all the variables. In addition, the HTML documents were not always cleanly formatted and text extraction from scanned PDFs is not entirely clean. Some data points are therefore missing for some laws and some laws were excluded entirely. - Not not all (older) laws were available in English, especially since Ireland and the UK only joined the European Communities in 1973. Non-English laws are excluded from the dataset. Other: - For details on the types of EU legal acts: https://ec.europa.eu/info/law/law-making-process/types-eu-law_en - An example for an experimental analysis with this dataset: https://trigger-project.eu/2019/10/28/a-data-science-approach-to-eu-differentiated-integration/ - The TRIGGER project is funded by the EU's Horizon 2020 programme, grant number 822735
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
About Dataset Safa S. Abdul-Jabbar, Alaa k. Farhan
Context This is the first Dataset for various ordinary patients in Iraq. The Dataset provides the patients’ Cell Blood Count test information that can be used to create a Hematology diagnosis/prediction system. Also, this Data was collected in 2022 from Al-Zahraa Al-Ahly Hospital. These data can be cleaned & analyzed using any programming language because it is provided in an excel file that can be accessed and manipulated easily. The user just needs to understand how rows and columns are arranged because the data was collected as images(CBC images) from the laboratories and then stored the extracted data in an excel file. Content This Dataset contains 500 rows. For each row (patient information), there are 21 columns containing CBC test features that can be described as follows:
ID: Patients Identifier
WBC: White Blood Cell, Normal Ranges: 4.0 to 10.0, Unit: 10^9/L.
LYMp: Lymphocytes percentage, which is a type of white blood cell, Normal Ranges: 20.0 to 40.0, Unit: %
MIDp: Indicates the percentage combined value of the other types of white blood cells not classified as lymphocytes or granulocytes, Normal Ranges: 1.0 to 15.0, Unit: %
NEUTp: Neutrophils are a type of white blood cell (leukocytes); neutrophils percentage, Normal Ranges: 50.0 to 70.0, Unit: %
LYMn: Lymphocytes number are a type of white blood cell, Normal Ranges: 0.6 to 4.1, Unit: 10^9/L.
MIDn: Indicates the combined number of other white blood cells not classified as lymphocytes or granulocytes, Normal Ranges: 0.1 to 1.8, Unit: 10^9/L.
NEUTn: Neutrophils Number, Normal Ranges: 2.0 to 7.8, Unit: 10^9/L.
RBC: Red Blood Cell, Normal Ranges: 3.50 to 5.50, Unit: 10^12/L
HGB: Hemoglobin, Normal Ranges: 11.0 to 16.0, Unit: g/dL
HCT: Hematocrit is the proportion, by volume, of the Blood that consists of red blood cells, Normal Ranges: 36.0 to 48.0, Unit: %
MCV: Mean Corpuscular Volume, Normal Ranges: 80.0 to 99.0, Unit: fL
MCH: Mean Corpuscular Hemoglobin is the average amount of haemoglobin in the average red cell, Normal Ranges: 26.0 to 32.0, Unit: pg
MCHC: Mean Corpuscular Hemoglobin Concentration, Normal Ranges: 32.0 to 36.0, Unit: g/dL
RDWSD: Red Blood Cell Distribution Width, Normal Ranges: 37.0 to 54.0, Unit: fL
RDWCV: Red blood cell distribution width, Normal Ranges: 11.5 to 14.5, Unit: %
PLT: Platelet Count, Normal Ranges: 100 to 400, Unit: 10^9/L
MPV: Mean Platelet Volume, Normal Ranges: 7.4 to 10.4, Unit: fL
PDW: Red Cell Distribution Width, Normal Ranges: 10.0 to 17.0, Unit: %
PCT: The level of Procalcitonin in the Blood, Normal Ranges: 0.10 to 0.28, Unit: %
PLCR: Platelet Large Cell Ratio, Normal Ranges: 13.0 to 43.0, Unit: %
Acknowledgements We thank the entire Al-Zahraa Al-Ahly Hospital Hospital team, especially the hospital manager, for cooperating with us in collecting this data while maintaining patients' confidentiality.
Facebook
TwitterThis dataset was created by Shiva Vashishtha
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset contains the results of the real-time water quality monitoring program (RTWQM) conducted across the Russell-Mulgrave catchment (south of Cairns) for "Project 25". Project 25 spanned two (2) NESP TWQ projects: 2.1.7 (2016 - 2018) and 4.8 (2019 - 2020), with the dataset for Project 4.8 also containing the data for Project 2.1.7. Data is the result of 2-3 hourly in situ logging of stream height (in metres) and nitrate concentrations (mg/L).
* This dataset is under an embargo period for 18 months from the completion of the project extension (NESP TWQ 4.8).
The broad aim of this study dataset was to characterise the water quality impacts and relative signatures of a range of distinct landuse types found across the Russell-Mulgrave catchment, and quantify the sugarcane industry’s specific role in end-of-catchment water quality. Subcatchment waterway sites were selected to represent the major land uses of the region, and were classed as sugarcane, urban, banana, or natural rainforest land use categories. Sites were also selected based on wet season accessibility to the site and the size of the waterway. A total of 9 sites were selected for the monitoring program through the period 2016-2018.
Water quality monitoring for Project 25 is based around integration of relatively traditional monitoring approaches (discrete sample collection for subsequent laboratory analysis) as well as emerging real-time (sensor-based) monitoring approaches. The development of real-time information and feedback on local water quality dynamics is a relatively novel approach to landholder engagement that is yet to be meaningfully explored in natural resource management programs. Project 25 will trial these new technologies from both the perspective of an engagement-extension tool, and also their reliability in water quality monitoring applications across multiple spatial scales (paddock to catchment). This program utilises emerging real time water quality monitoring (RTWQM) technologies including sensor and telemetry technologies that provide continuous measurement of nitrogen water quality concentrations.
Noting the inherent limitations associated with traditional grab sampling, such as extended analysis and holding times prior to reporting results, monitoring programs aiming at facilitating management change are increasingly shifting towards continuous measurements using in situ sensors. RTWQM equipment was deployed in three selected sub-catchments in the broader Project 25 monitoring design to provide real time water quality information on parameters such as nutrients (nitrate) back to local industry network. The spatial design aims to link to specific paddock management activities within the monitored catchment sites. This will eventually enable individual decisions making based on real rather than hypothetical average conditions. Localised comparative data will enable growers to compare performance with neighbours. The real time information from these systems provides a solid basis for farmers to adjust strategies at any time in a dynamic and autonomous manner.
Methods:
Real-time monitoring stations, based closely on those utilised in an earlier BBIFMAC case study (Burton et al., 2014), were installed at three sites identified in discussion with cane industry steering committee personnel, across the Russell-Mulgrave canefarming district. Sensors were current market?ready technologies, in this case TriOS NICO and OPUS optical sensors (https://www.trios.de/en/). Discrete manual sampling for nutrient water quality was also conducted at all sites on an approximate monthly basis during dry-season low flows to ground-truth sensor nitrate readings. Sampling frequency increased to daily (and occasionally several samples a day) during wet season flood events, particularly during early wet season ‘first-flush’ events to capture initial high concentration run-off dynamics from the immediate catchment area. Samples were manually collected by project scientists, or support staff trained individually in the correct sampling and quality assurance procedures developed in conjunction with the TropWATER Water Quality Laboratory. Calibration checks of each sensor were conducted at least every 3 months, using 0, 1 mg/L, 5 mg/L and 10 mg/L nitrate calibration standards provided by the TropWATER Water Quality Laboratory. Station design in 2017 initially involved water being pumped into a flow-through cell with the nitrate sensor housed in the sampling station. Some early power issues and equipment failures saw sites re-designed with the sensor installed instream in a PVC pipe, and subsequent measurements taken in situ.
Optical sensors are susceptible to reduced performance from biofouling and sedimentation of the optical lens (Steven et al., 2013). Optical sensors utilised during Project 25 were initially cleaned utilising an integrated compressed air blast system to automatically clean the optical window. Early observations of optical window cleanliness, and periodic calibration testing of sensors highlighted that at least monthly physical cleaning of lens was also required for satisfactory performance at some sites. Recent development of automated, externally mounted lens wiper technologies by TriOS saw these new cleaning technologies added to some sites towards end of 2018.
Other aspects of sampling station design and operation that can improve sensor performance also emerged during early stages of Project 25 sensor deployment and monitoring. The TriOS sensors utilised can operate theoretically with power supplies spanning 12V to 24V (±10%). Frequent initial situations of nitrate-N cycling emerged where system operating voltages approached or fluctuated around the lower 12V threshold (due to issues such as riparian shading of solar panels or sustained cloudy weather reducing battery recharge and voltage drop through cable lengths). Reconfiguring system design so nitrate sensor measurements were always taken at a nominal 24V power output reduced these effects significantly.
Format:
Data consists of an excel spreadsheet with stream height (m) and nitrate concentrations (mg/L) for each hydrological year of data recorded on separate, named spreadsheet tabs.
References:
Burton, E., T.J. McShane, and D. Stubbs D. 2014. A Sub Catchment Adaptive Management Approach To Water Quality in Sugarcane. Burdekin Bowen Integrated Floodplain Management Advisory Committee (BBIFMAC). 42pp.
Steven, ADL, Hodge, J, Cannard, T, Carlin, G, Franklin, H, McJannet, D, Moeseneder, C, Searle, R, 2014. Continuous Water Quality Monitoring on the Great Barrier Reef. CSIRO Final Report to Great Barrier Reef Foundation, 158pp.
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.1.7_Engaging-farmers-WQ and data\custodian\2018-2021-NESP-TWQ-4\4.8_Project25 respectively.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables convey 1. demographics (281 variables), 2. dietary consumption (324 variables), 3. physiological functions (1,040 variables), 4. occupation (61 variables), 5. questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood), 6. medications (29 variables), 7. mortality information linked from the National Death Index (15 variables), 8. survey weights (857 variables), 9. environmental exposure biomarker measurements (598 variables), and 10. chemical comments indicating which measurements are below or above the lower limit of detection (505 variables).
csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file. - The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. - "dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES. - "dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables. - “dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes. - “nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.
R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. - “w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data. - “m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.
Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order. - “example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together. - “example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model. - “example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design. - “example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterIn-situ chemical oxidation (ICO) is a remediation technology that involves the addition of chemicals to the substrate that degrade contaminants through oxidation processes. This series of field experiments conducted at the Old Casey Powerhouse/Workshop investigate the potential for the use of ICO technology in Antarctica on petroleum hydrocarbon contaminated sediments.
Surface application was made using 12.5% sodium hyperchlorite, 6.25% sodium hydrechlorite, 30% hydrogen peroxide and Fentons Reagent (sodium hypchlorite with an iron catalyst) on five separate areas of petroleum hydrocarbon contaminated sediments. Sampling was conducted before and after chemical application from the top soil section (0 - 5 cm) and at depth (10 - 15 cm).
The data are stored in an excel file.
This work was completed as part of ASAC project 1163 (ASAC_1163).
The spreadsheet is divided up as follows:
The first 51 sheets are the raw GC-FID data for the 99/00 field season, labelled by sample name. These sheets use the same format as the radiometric GC-FID spreadsheet in the metadata record entitled 'Mineralisation results using 14C octadecane at a range of temperatures'. Sample name format consists of a location or experiment indicator (CW=Casey Workshop, BR= Small-scale field trial), the year the sample was collected (00=2000), the sample type (S=Soil) and a sequence number.
SUMMARY and PRINTABLE VERSION are the same data in different formats, PRINTABLE VERSION is printer friendly. This summary data includes the hydrocarbon concentrations corrected for dry weight of soil and biodegradation and weathering indices.
GRAPHS are graphs.
FIELD MEASUREMENTS show the results of the measurements taken in the field and include PID (ppm), Soil temperature (C), Air temperature (C), Ph and MC (moisture content) (%).
NOTES shows the chemicals added to each trial, and a short summary of the samples.
The next 21 sheets show the raw GC-FID data for the 00/01 field season, labelled to previously explained method. PRINTABLE (0001) is a summary of the raw GC-FID data.
The next 3 sheets show the raw GC-FID data for the 01/02 field season, labelled to previously explained method. PRINTABLE (0102) is a summary of the raw GC-FID data.
MPN-NOTES shows lab book references and set up summary for the Most Probable Number (MPN) analysis.
MPN-DETAILS shows the set up details, calculations and results for each MPN analysis.
MPN-RESULTS shows the raw MPN data.
MPN-Calculations show the results from the MPN Calculator.
The fields in the dataset are: Retention Time Area % Area Height of peak Amount Int Type Units Peak Type Codes
Facebook
TwitterInfaunal marine invertebrates were collected from inside and outside of patches of white bacterial mats from several sites in the Windmill Islands, Antarctica, around Casey station during the 2006-07 summer. Samples were collected from McGrady Cove inner and outer, the tide gauge near the Casey wharf, Stevenson's Cove and Brown Bay inner. Sediment cores of 10cm depth and 5cm diameter were collected by divers using a PVC corer from inside (4 cores) and outside (4 cores) each bacterial patch. The size of each patch varied from site to site. Cores were sieved at 500 microns and the extracted fauna preserved in 4 percent neutral buffered formalin. All fauna were counted and identified to species where possible or assigned to morphospecies based on previous infaunal sampling around Casey.
An excel spreadsheet is available for download at the URL given below. The spreadsheet does not represent the complete dataset, and is only the bacterial mat infauna data.
Regarding the infauna dataset:
This work was completed as part of ASAC 2201 (ASAC_2201).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://raw.githubusercontent.com/Masterx-AI/Project_Retail_Analysis_with_Walmart/main/Wallmart1.jpg" alt="">
One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.
Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.
The dataset is taken from Kaggle.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
The Superstore Sales Data dataset, available in an Excel format as "Superstore.xlsx," is a comprehensive collection of sales and customer-related information from a retail superstore. This dataset comprises* three distinct tables*, each providing specific insights into the store's operations and customer interactions.
Facebook
TwitterIt contains the raw sequences (.fasta files) quality scores (.qual files), flowgram data from the 454 instrument (.flow files), an index of each sequence read and the sample to which it belongs (.groups file) which can be read into the software package 'mothur' (www.mothur.org). This is provided for both the bacterial fraction (16S) and the fungal fraction (ITS).
These files were generated using the Roche 454 Titanium platform on 223 soils from the 8 locations previously mentioned. The data are generated using the 16S primers 27F and 519R and the fungal data are from the primers ITS1F and ITS 4.
In the referenced study the data were analysed in mother version 1.23.1.
In addition, the chemical and geographical data is provided as an excel spreadsheet. These parameters were all recorded but not necessarily implemented in the structural equation modelling. They are also the raw data and were transformed to reduce skewness and kurtosis.
Full method details are in the manuscript.
Facebook
TwitterThis metadata record describes Simone Ingham's survey of vegetation sampling sites on Macquarie Island in the summer of 2001/2002. The Biolab base station needs to be accurately surveyed from the AUSLIG (now Geoscience Australia) GPS base station (AUS211) and the GPS data re-processed using new coordinates. The re-processing is required to establish the relative and absolute values of Simone's survey. Checks will be need to be made in the field with GPS dual frequency receivers on permanent markers surveyed by Simone. This will serve as a check on the survey. This report has been compiled from Simone's report and email's from Paul Standen / Ultimate Positioning. Paul's report is not definitive regarding the relative and absolute accuracies of Simone's survey.
A pdf copy of the report is available for download from the URL given below.
Copies of the GPS data, and an excel spreadsheet summarising the locations of the various monitored sites are available for download at the URL given below.
This work was completed as part of ASAC project 1015 (ASAC_1015).
The fields in the excel spreadsheet are:
Site name Description Location Latitude Longitude Altitude Comments Measured Outlines mapped
Facebook
TwitterSize fractionated chlorophyll a data (total and less than 20 µm) analysed using high performance liquid chromatography (HPLC). Underway samples were taken using a seawater line in the oceanographic lab on RSV Aurora Australis (approx. depth 4 m). CTD samples were taken using Niskin bottles attached to a CTD rosette. Six depths were sampled per station, based on fluorescence profiles from the CTD. Two of the two of six samples always included both near-surface (approximately 10 m) and the depth of the chlorophyll maximum where applicable. HPLC analyses were conducted according to the method of Wright et al. (2010). Column chlorophylls (µg L-1) and integrated chlorophylls (mg m-2) are shown in two separate tabs within the Excel spreadsheet.
Facebook
TwitterThe dataset contains lake areas and perimeters given in metres of the lakes found within the Vestfold Hills near Davis Station Antarctica. The data are held in an excel spreadsheet. The area of the lakes is given in square metres (and perimeters in metres). The last two columns are the areas in square km, and then hectares.
The fields in this dataset are: lake number area perimeter development of coastline
Facebook
TwitterThe data are comprised of a spreadsheet with locations (latitude and longitude) and labels for moss bed quadrats. The quadrats are located in two sites: Antarctic Specially Protected Area 135 (formerly Site of Special Scientific Interest 16) near Casey Station; and on Robertson Ridge south of Casey. The letter in the quadrant label indicates the vegetation community type: before bryophyte, M for transitional and A for lichen. The original 60 quadrats are now described in metadata record "AAS_4046_quadrat_locations". This metadata record describes nine additional locations for associated work conducted at the sites in 2007-08 by Ellen Ryan-Colton. The latitude and longitude were collected using a handheld GPS. The quadrats are identified by markers in the field comprised of small metal discs glued to rocks.
Moss bed and Antarctic vegetation are generally rare and fragile in Antarctica. Mapping the exact location of moss bed is very useful for management of protected areas and the follow up of biological research work.
This work was completed as part of ASAC Project 1313 (ASAC_1313), Impact of global climate change on Antarctic flora: long-term monitoring and ecophysiological studies of bryophyte community dynamics in the Windmill Islands.
Facebook
TwitterAnimals are multicellular, eukaryotic organisms in the biological kingdom Animalia. With few exceptions, animals consume organic material, breathe oxygen, have myocytes and are able to move, can reproduce sexually, and grow from a hollow sphere of cells, the blastula, during embryonic development. As of 2022, 2.16 million living animal species have been described—of which around 1.05 million are insects, over 85,000 are mollusks, and around 65,000 are vertebrates. It has been estimated there are around 7.77 million animal species. Animals range in length from 8.5 micrometers (0.00033 in) to 33.6 meters (110 ft). They have complex interactions with each other and their environments, forming intricate food webs. The scientific study of animals is known as zoology.
This dataset encompasses a diverse array of attributes pertaining to various animal species worldwide. The dataset prominently includes fields such as Animal, Height (cm), Weight (kg), Color, Lifespan (years), Diet, Habitat, Predators, Average Speed (km/h), Countries Found, Conservation Status, Family, Gestation Period (days), Top Speed (km/h), Social Structure, and Offspring per Birth. These columns collectively offer a comprehensive understanding of animal characteristics, habitats, behaviors, and conservation statuses. Researchers and enthusiasts can utilize this dataset to analyze animal traits, study their habitats, explore dietary patterns, assess conservation needs, and conduct a wide range of ecological research and wildlife studies.
https://i.imgur.com/2V3vbKL.png" alt="">
This dataset was generated using information from: https://www.wikipedia.org/. If you wish to delve deeper, you can explore the website.
Cover Photo by: Image by brgfx on Freepik
Thumbnail by: Dog icons created by Flat Icons - Flaticon
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.