34 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. q

    Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio

    • qubeshub.org
    Updated Jul 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shelly Gaynor (2020). Cleaning Biodiversity Data: A Botanical Example Using Excel or RStudio [Dataset]. http://doi.org/10.25334/DRGD-F069
    Explore at:
    Dataset updated
    Jul 16, 2020
    Dataset provided by
    QUBES
    Authors
    Shelly Gaynor
    Description

    Access and clean an open source herbarium dataset using Excel or RStudio.

  3. Excel-project: Glassdoor Data Cleaning

    • kaggle.com
    zip
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Lira (2023). Excel-project: Glassdoor Data Cleaning [Dataset]. https://www.kaggle.com/datasets/luisliraportfolio/excel-project-clean-dataset/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(12085049 bytes)Available download formats
    Dataset updated
    Sep 26, 2023
    Authors
    Luis Lira
    Description

    Dataset

    This dataset was created by Luis Lira

    Contents

  4. Project 2:Excel data cleaning & dashboard creation

    • kaggle.com
    Updated Jun 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George M122 (2024). Project 2:Excel data cleaning & dashboard creation [Dataset]. https://www.kaggle.com/datasets/georgem122/project-2excel-data-cleaning-and-dashboard-creation/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    George M122
    Description

    Dataset

    This dataset was created by George M122

    Contents

  5. d

    Data from: Data cleaning and enrichment through data integration: networking...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar (2025). Data cleaning and enrichment through data integration: networking the Italian academia [Dataset]. https://search.dataone.org/view/sha256%3Ab583b4db2874926c7b8d8bad19da36c9a4021fea18d77573f228fad5e332f0ff
    Explore at:
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Irene Finocchi; Alessio Martino; Blerina Sinaimeri; Fariba Ranjbar
    Description

    We describe a bibliometric network characterizing co-authorship collaborations in the entire Italian academic community. The network, consisting of 38,220 nodes and 507,050 edges, is built upon two distinct data sources: faculty information provided by the Italian Ministry of University and Research and publications available in Semantic Scholar. Both nodes and edges are associated with a large variety of semantic data, including gender, bibliometric indexes, authors' and publications' research fields, and temporal information. While linking data between the two original sources posed many challenges, the network has been carefully validated to assess its reliability and to understand its graph-theoretic characteristics. By resembling several features of social networks, our dataset can be profitably leveraged in experimental studies in the wide social network analytics domain as well as in more specific bibliometric contexts. , The proposed network is built starting from two distinct data sources:

    the entire dataset dump from Semantic Scholar (with particular emphasis on the authors and papers datasets) the entire list of Italian faculty members as maintained by Cineca (under appointment by the Italian Ministry of University and Research).

    By means of a custom name-identity recognition algorithm (details are available in the accompanying paper published in Scientific Data), the names of the authors in the Semantic Scholar dataset have been mapped against the names contained in the Cineca dataset and authors with no match (e.g., because of not being part of an Italian university) have been discarded. The remaining authors will compose the nodes of the network, which have been enriched with node-related (i.e., author-related) attributes. In order to build the network edges, we leveraged the papers dataset from Semantic Scholar: specifically, any two authors are said to be connected if there is at least one pap..., , # Data cleaning and enrichment through data integration: networking the Italian academia

    https://doi.org/10.5061/dryad.wpzgmsbwj

    Description of the data and file structure

    This repository contains two main data files:

    • edge_data_AGG.csv, the full network in comma-separated edge list format (this file contains mainly temporal co-authorship information);
    • Coauthorship_Network_AGG.graphml, the full network in GraphML format.Â

    along with several supplementary data, listed below, useful only to build the network (i.e., for reproducibility only):

    • University-City-match.xlsx, an Excel file that maps the name of a university against the city where its respective headquarter is located;
    • Areas-SS-CINECA-match.xlsx, an Excel file that maps the research areas in Cineca against the research areas in Semantic Scholar.

    Description of the main data files

    The Coauthorship_Network_AGG.graphml is intended to be the core file which c...

  6. E

    Data from: Facebook Data for Sentiment Analysis

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    • +1more
    binary format
    Updated Jul 16, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2013). Facebook Data for Sentiment Analysis [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1057
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 16, 2013
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Corpus consisting of 10,000 Facebook posts manually annotated on sentiment (2,587 positive, 5,174 neutral, 1,991 negative and 248 bipolar posts). The archive contains data and statistics in an Excel file (FBData.xlsx) and gold data in two text files with posts (gold-posts.txt) and labels (gols-labels.txt) on corresponding lines.

  7. w

    General Household Survey, Panel 2023-2024 - Nigeria

    • microdata.worldbank.org
    • microdata.nigerianstat.gov.ng
    • +1more
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (NBS) (2024). General Household Survey, Panel 2023-2024 - Nigeria [Dataset]. https://microdata.worldbank.org/index.php/catalog/6410
    Explore at:
    Dataset updated
    Nov 21, 2024
    Dataset provided by
    National Bureau of Statistics, Nigeria
    Authors
    National Bureau of Statistics (NBS)
    Time period covered
    2023 - 2024
    Area covered
    Nigeria
    Description

    Abstract

    The General Household Survey-Panel (GHS-Panel) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program. The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, interinstitutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of approximately 5,000 households, which are also representative of the six geopolitical zones. The 2023/24 GHS-Panel is the fifth round of the survey with prior rounds conducted in 2010/11, 2012/13, 2015/16 and 2018/19. The GHS-Panel households were visited twice: during post-planting period (July - September 2023) and during post-harvest period (January - March 2024).

    Geographic coverage

    National

    Analysis unit

    • Households • Individuals • Agricultural plots • Communities

    Universe

    The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The original GHS‑Panel sample was fully integrated with the 2010 GHS sample. The GHS sample consisted of 60 Primary Sampling Units (PSUs) or Enumeration Areas (EAs), chosen from each of the 37 states in Nigeria. This resulted in a total of 2,220 EAs nationally. Each EA contributed 10 households to the GHS sample, resulting in a sample size of 22,200 households. Out of these 22,200 households, 5,000 households from 500 EAs were selected for the panel component, and 4,916 households completed their interviews in the first wave.

    After nearly a decade of visiting the same households, a partial refresh of the GHS‑Panel sample was implemented in Wave 4 and maintained for Wave 5. The refresh was conducted to maintain the integrity and representativeness of the sample. The refresh EAs were selected from the same sampling frame as the original GHS‑Panel sample in 2010. A listing of households was conducted in the 360 EAs, and 10 households were randomly selected in each EA, resulting in a total refresh sample of approximately 3,600 households.

    In addition to these 3,600 refresh households, a subsample of the original 5,000 GHS‑Panel households from 2010 were selected to be included in the new sample. This “long panel” sample of 1,590 households was designed to be nationally representative to enable continued longitudinal analysis for the sample going back to 2010. The long panel sample consisted of 159 EAs systematically selected across Nigeria’s six geopolitical zones.

    The combined sample of refresh and long panel EAs in Wave 5 that were eligible for inclusion consisted of 518 EAs based on the EAs selected in Wave 4. The combined sample generally maintains both the national and zonal representativeness of the original GHS‑Panel sample.

    Sampling deviation

    Although 518 EAs were identified for the post-planting visit, conflict events prevented interviewers from visiting eight EAs in the North West zone of the country. The EAs were located in the states of Zamfara, Katsina, Kebbi and Sokoto. Therefore, the final number of EAs visited both post-planting and post-harvest comprised 157 long panel EAs and 354 refresh EAs. The combined sample is also roughly equally distributed across the six geopolitical zones.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    The GHS-Panel Wave 5 consisted of three questionnaires for each of the two visits. The Household Questionnaire was administered to all households in the sample. The Agriculture Questionnaire was administered to all households engaged in agricultural activities such as crop farming, livestock rearing, and other agricultural and related activities. The Community Questionnaire was administered to the community to collect information on the socio-economic indicators of the enumeration areas where the sample households reside.

    GHS-Panel Household Questionnaire: The Household Questionnaire provided information on demographics; education; health; labour; childcare; early child development; food and non-food expenditure; household nonfarm enterprises; food security and shocks; safety nets; housing conditions; assets; information and communication technology; economic shocks; and other sources of household income. Household location was geo-referenced in order to be able to later link the GHS-Panel data to other available geographic data sets (forthcoming).

    GHS-Panel Agriculture Questionnaire: The Agriculture Questionnaire solicited information on land ownership and use; farm labour; inputs use; GPS land area measurement and coordinates of household plots; agricultural capital; irrigation; crop harvest and utilization; animal holdings and costs; household fishing activities; and digital farming information. Some information is collected at the crop level to allow for detailed analysis for individual crops.

    GHS-Panel Community Questionnaire: The Community Questionnaire solicited information on access to infrastructure and transportation; community organizations; resource management; changes in the community; key events; community needs, actions, and achievements; social norms; and local retail price information.

    The Household Questionnaire was slightly different for the two visits. Some information was collected only in the post-planting visit, some only in the post-harvest visit, and some in both visits.

    The Agriculture Questionnaire collected different information during each visit, but for the same plots and crops.

    The Community Questionnaire collected prices during both visits, and different community level information during the two visits.

    Cleaning operations

    CAPI: Wave five exercise was conducted using Computer Assisted Person Interview (CAPI) techniques. All the questionnaires (household, agriculture, and community questionnaires) were implemented in both the post-planting and post-harvest visits of Wave 5 using the CAPI software, Survey Solutions. The Survey Solutions software was developed and maintained by the Living Standards Measurement Unit within the Development Economics Data Group (DECDG) at the World Bank. Each enumerator was given a tablet which they used to conduct the interviews. Overall, implementation of survey using Survey Solutions CAPI was highly successful, as it allowed for timely availability of the data from completed interviews.

    DATA COMMUNICATION SYSTEM: The data communication system used in Wave 5 was highly automated. Each field team was given a mobile modem which allowed for internet connectivity and daily synchronization of their tablets. This ensured that head office in Abuja had access to the data in real-time. Once the interview was completed and uploaded to the server, the data was first reviewed by the Data Editors. The data was also downloaded from the server, and Stata dofile was run on the downloaded data to check for additional errors that were not captured by the Survey Solutions application. An excel error file was generated following the running of the Stata dofile on the raw dataset. Information contained in the excel error files were then communicated back to respective field interviewers for their action. This monitoring activity was done on a daily basis throughout the duration of the survey, both in the post-planting and post-harvest.

    DATA CLEANING: The data cleaning process was done in three main stages. The first stage was to ensure proper quality control during the fieldwork. This was achieved in part by incorporating validation and consistency checks into the Survey Solutions application used for the data collection and designed to highlight many of the errors that occurred during the fieldwork.

    The second stage cleaning involved the use of Data Editors and Data Assistants (Headquarters in Survey Solutions). As indicated above, once the interview is completed and uploaded to the server, the Data Editors review completed interview for inconsistencies and extreme values. Depending on the outcome, they can either approve or reject the case. If rejected, the case goes back to the respective interviewer’s tablet upon synchronization. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences, these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. Additional errors observed were compiled into error reports that were regularly sent to the teams. These errors were then corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then approved by the Data Editor. After the Data Editor’s approval of the interview on Survey Solutions server, the Headquarters also reviews and depending on the outcome, can either reject or approve.

    The third stage of cleaning involved a comprehensive review of the final raw data following the first and second stage cleaning. Every variable was examined individually for (1) consistency with other sections and variables, (2) out of range responses, and (3) outliers. However, special care was taken to avoid making strong assumptions when resolving potential errors. Some minor errors remain in the data where the diagnosis and/or solution were unclear to the data cleaning team.

    Response

  8. z

    Shanghai experiment of consequence conditions on effort - Dataset -...

    • portal.zero.govt.nz
    • catalogue.data.govt.nz
    Updated Feb 1, 2001
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zero.govt.nz (2001). Shanghai experiment of consequence conditions on effort - Dataset - data.govt.nz - discover and use data [Dataset]. https://portal.zero.govt.nz/77d6ef04507c10508fcfc67a7c24be32/dataset/oai-figshare-com-article-10277999
    Explore at:
    Dataset updated
    Feb 1, 2001
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Shanghai
    Description

    This data set supports the journal paper "Manipulating the consequences of tests: How Shanghai teens react to different consequences", published in Educational Research and Evaluation, v26 (n5-6), pp.221-251.The data were obtained to test the impact of different levels of consequence for taking a test on student test-taking effort. The data are part of the PhD project of Anran Zhao, supervised by Brown & Meissel.The data set is in MS Excel format. Sheet 1 provides an anonymous wide-format data set post-cleaning and missing value analysis of the data.Sheet 2 provides a description of each variable.

  9. s

    Cleaning Robot Market Size, Share, Growth Analysis, By Product(Vacuum...

    • skyquestt.com
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SkyQuest Technology (2024). Cleaning Robot Market Size, Share, Growth Analysis, By Product(Vacuum Cleaning Robots, Floor Cleaning Robots, Window Cleaning Robots, Pool Cleaning Robots), By Application(Residential, Commercial, Industrial, and others), By Sales Channel(Online, Offline, and Others), By Region - Industry Forecast 2024-2031 [Dataset]. https://www.skyquestt.com/report/cleaning-robot-market
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    SkyQuest Technology
    License

    https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/

    Time period covered
    2024 - 2031
    Area covered
    Global
    Description

    Global Cleaning Robot Market size was valued at USD 4.19 billion in 2022 and is poised to grow from USD 4.97 billion in 2023 to USD 12.81 billion by 2031, growing at a CAGR of 22.9% in the forecast period (2024-2031).

  10. i

    Household Income and Expenditure 2010 - Tuvalu

    • catalog.ihsn.org
    • dev.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Statistics Division (2019). Household Income and Expenditure 2010 - Tuvalu [Dataset]. http://catalog.ihsn.org/catalog/3203
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    Central Statistics Division
    Time period covered
    2010
    Area covered
    Tuvalu
    Description

    Abstract

    The main objectives of the survey were: - To obtain weights for the revision of the Consumer Price Index (CPI) for Funafuti; - To provide information on the nature and distribution of household income, expenditure and food consumption patterns; - To provide data on the household sector's contribution to the National Accounts - To provide information on economic activity of men and women to study gender issues - To undertake some poverty analysis

    Geographic coverage

    National, including Funafuti and Outer islands

    Analysis unit

    • Household
    • individual

    Universe

    All the private household are included in the sampling frame. In each household selected, the current resident are surveyed, and people who are usual resident but are currently away (work, health, holydays reasons, or border student for example. If the household had been residing in Tuvalu for less than one year: - but intend to reside more than 12 months => The household is included - do not intend to reside more than 12 months => out of scope

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    It was decided that 33% (one third) sample was sufficient to achieve suitable levels of accuracy for key estimates in the survey. So the sample selection was spread proportionally across all the island except Niulakita as it was considered too small. For selection purposes, each island was treated as a separate stratum and independent samples were selected from each. The strategy used was to list each dwelling on the island by their geographical position and run a systematic skip through the list to achieve the 33% sample. This approach assured that the sample would be spread out across each island as much as possible and thus more representative.

    For details please refer to Table 1.1 of the Report.

    Sampling deviation

    Only the island of Niulakita was not included in the sampling frame, considered too small.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    There were three main survey forms used to collect data for the survey. Each question are writen in English and translated in Tuvaluan on the same version of the questionnaire. The questionnaires were designed based on the 2004 survey questionnaire.

    HOUSEHOLD FORM - composition of the household and demographic profile of each members - dwelling information - dwelling expenditure - transport expenditure - education expenditure - health expenditure - land and property expenditure - household furnishing - home appliances - cultural and social payments - holydays/travel costs - Loans and saving - clothing - other major expenditure items

    INDIVIDUAL FORM - health and education - labor force (individu aged 15 and above) - employment activity and income (individu aged 15 and above): wages and salaries, working own business, agriculture and livestock, fishing, income from handicraft, income from gambling, small scale activies, jobs in the last 12 months, other income, childreen income, tobacco and alcohol use, other activities, and seafarer

    DIARY (one diary per week, on a 2 weeks period, 2 diaries per household were required) - All kind of expenses - Home production - food and drink (eaten by the household, given away, sold) - Goods taken from own business (consumed, given away) - Monetary gift (given away, received, winning from gambling) - Non monetary gift (given away, received, winning from gambling)

    Questionnaire Design Flaws Questionnaire design flaws address any problems with the way questions were worded which will result in an incorrect answer provided by the respondent. Despite every effort to minimize this problem during the design of the respective survey questionnaires and the diaries, problems were still identified during the analysis of the data. Some examples are provided below:

    Gifts, Remittances & Donations Collecting information on the following: - the receipt and provision of gifts - the receipt and provision of remittances - the provision of donations to the church, other communities and family occasions is a very difficult task in a HIES. The extent of these activities in Tuvalu is very high, so every effort should be made to address these activities as best as possible. A key problem lies in identifying the best form (questionnaire or diary) for covering such activities. A general rule of thumb for a HIES is that if the activity occurs on a regular basis, and involves the exchange of small monetary amounts or in-kind gifts, the diary is more appropriate. On the other hand, if the activity is less infrequent, and involves larger sums of money, the questionnaire with a recall approach is preferred. It is not always easy to distinguish between the two for the different activities, and as such, both the diary and questionnaire were used to collect this information. Unfortunately it probably wasn?t made clear enough as to what types of transactions were being collected from the different sources, and as such some transactions might have been missed, and others counted twice. The effects of these problems are hopefully minimal overall.

    Defining Remittances Because people have different interpretations of what constitutes remittances, the questionnaire needs to be very clear as to how this concept is defined in the survey. Unfortunately this wasn?t explained clearly enough so it was difficult to distinguish between a remittance, which should be of a more regular nature, and a one-off monetary gift which was transferred between two households.

    Business Expenses Still Recorded The aim of the survey is to measure "household" expenditure, and as such, any expenditure made by a household for an item or service which was primarily used for a business activity should be excluded. It was not always clear in the questionnaire that this was the case, and as such some business expenses were included. Efforts were made during data cleaning to remove any such business expenses which would impact significantly on survey results.

    Purchased goods given away as a gift When a household makes a gift donation of an item it has purchased, this is recorded in section 5 of the diary. Unfortunately it was difficult to know how to treat these items as it was not clear as to whether this item had been recorded already in section 1 of the diary which covers purchases. The decision was made to exclude all information of gifts given which were considered to be purchases, as these items were assumed to have already been recorded already in section 1. Ideally these items should be treated as a purchased gift given away, which in turn is not household consumption expenditure, but this was not possible.

    Some key items missed in the Questionnaire Although not a big issue, some key expenditure items were omitted from the questionnaire when it would have been best to collect them via this schedule. A key example being electric fans which many households in Tuvalu own.

    Cleaning operations

    Consistency of the data: - each questionnaire was checked by the supervisor during and after the collection - before data entry, all the questionnaire were coded - the CSPRo data entry system included inconsistency checks which allow the NSO staff to point some errors and to correct them with imputation estimation from their own knowledge (no time for double entry), 4 data entry operators. - after data entry, outliers were identified in order to check their consistency.

    All data entry, including editing, edit checks and queries, was done using CSPro (Census Survey Processing System) with additional data editing and cleaning taking place in Excel.

    The staff from the CSD was responsible for undertaking the coding and data entry, with assistance from an additional four temporary staff to help produce results in a more timely manner.

    Although enumeration didn't get completed until mid June, the coding and data entry commenced as soon as forms where available from Funafuti, which was towards the end of March. The coding and data entry was then completed around the middle of July.

    A visit from an SPC consultant then took place to undertake initial cleaning of the data, primarily addressing missing data items and missing schedules. Once the initial data cleaning was undertaken in CSPro, data was transferred to Excel where it was closely scrutinized to check that all responses were sensible. In the cases where unusual values were identified, original forms were consulted for these households and modifications made to the data if required.

    Despite the best efforts being made to clean the data file in preparation for the analysis, no doubt errors will still exist in the data, due to its size and complexity. Having said this, they are not expected to have significant impacts on the survey results.

    Under-Reporting and Incorrect Reporting as a result of Poor Field Work Procedures The most crucial stage of any survey activity, whether it be a population census or a survey such as a HIES is the fieldwork. It is crucial for intense checking to take place in the field before survey forms are returned to the office for data processing. Unfortunately, it became evident during the cleaning of the data that fieldwork wasn?t checked as thoroughly as required, and as such some unexpected values appeared in the questionnaires, as well as unusual results appearing in the diaries. Efforts were made to indentify the main issues which would have the greatest impact on final results, and this information was modified using local knowledge, to a more reasonable answer, when required.

    Data Entry Errors Data entry errors are always expected, but can be kept to a minimum with

  11. R

    Use of Open Government Data by Brazilian Public Institutions - Dataset

    • datarepositorium.uminho.pt
    tsv
    Updated Aug 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Repositório de Dados da Universidade do Minho (2024). Use of Open Government Data by Brazilian Public Institutions - Dataset [Dataset]. http://doi.org/10.34622/datarepositorium/YSZBRR
    Explore at:
    tsv(4169), tsv(47002), tsv(2825), tsv(72782), tsv(132726), tsv(3978), tsv(91790), tsv(50775)Available download formats
    Dataset updated
    Aug 23, 2024
    Dataset provided by
    Repositório de Dados da Universidade do Minho
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Brazil
    Description

    This dataset contains the results of a survey about the use of open government data applied to public agents working in public institutions in Brazil. It has two sets, one with questionnaire responses and metadata and the second with a coding table with interview extracts: 1) In the first dataset, each row holds a response to a questionnaire about the public agent's perceptions of the use and reuse of open government data in Brazilian public institutions. Columns store the questionnaire questions. Data were collected between 8 June and 13 July 2021, and this sample is composed of responses from 40 federal, state, and municipal public administrators. Thus, this dataset contains 40 rows and 158 columns. Data were collected on the LimeSurvey platform, where it was screened for missing values and incomplete responses. After cleaning, data were exported to Excel in tabular format. Questionnaire responses are provided in two files ResultsSurvey_OGDUseBRPubInstitutions_DataSet_PT and ResultsSurvey_OGDUseBRPubInstitutions_DataSet_EN. They contain the same information in Portuguese and English. 2) The second dataset records the code table of the interviews about the benefits, barriers, enablers, and drivers of open government data (OGD) use in Brazilian public institutions. A questionnaire applied to public agents working in Brazilian public institutions was followed up by interviews to broaden an understanding of the use of OGD. Nine interviews were conducted between May 17-31, 2022. This dataset represents the perspective of these public agents. The dataset contains 97 lines and six columns. Each row of the dataset lists the factor code used in the questionnaire, the factor descriptions in Portuguese and English, the interviewee code, the transcription extract of an interviewee narration collected in Portuguese, and the English translation. After collection in Portuguese, interviews were automatically transcribed using the NVivo Transcription software. Then, they were anonymized, and a human reviewed the transcriptions. Interviews were coded using NVivo and used the questionnaire factors to guide coding. Coded extracts were translated to English using Google and Microsoft translators. Then, translated extracts were revised by a human and were used for reporting. The coding table was exported to Excel. Interviews extracts are provided in one file, InterviewsExtracts_OGDUseBR_PublicInstitutions_Dataset.

  12. Dataset for "Cognitive behavioural therapy self-help intervention...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chelsea Coumoundouros; Chelsea Coumoundouros; Paul Farrand; Paul Farrand; Alexander Hamilton; Alexander Hamilton; Louise Von Essen; Robbert Sanderman; Joanne Woodford; Joanne Woodford; Louise Von Essen; Robbert Sanderman (2024). Dataset for "Cognitive behavioural therapy self-help intervention preferences among informal caregivers of adults with chronic kidney disease: an online cross-sectional survey" [Dataset]. http://doi.org/10.5281/zenodo.7104638
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Chelsea Coumoundouros; Chelsea Coumoundouros; Paul Farrand; Paul Farrand; Alexander Hamilton; Alexander Hamilton; Louise Von Essen; Robbert Sanderman; Joanne Woodford; Joanne Woodford; Louise Von Essen; Robbert Sanderman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data and R code used for the analysis of data for the publication: Coumoundouros et al., Cognitive behavioural therapy self-help intervention preferences among informal caregivers of adults with chronic kidney disease: an online cross-sectional survey. BMC Nephrology

    Summary of study

    An online cross-sectional survey for informal caregivers (e.g. family and friends) of people living with chronic kidney disease in the United Kingdom. Study aimed to examine informal caregivers' cognitive behavioural therapy self-help intervention preferences, and describe the caregiving situation (e.g. types of care activities) and informal caregiver's mental health (depression, anxiety and stress symptoms).

    Participants were eligible to participate if they were at least 18 years old, lived in the United Kingdom, and provided unpaid care to someone living with chronic kidney disease who was at least 18 years old.

    The online survey included questions regarding (1) informal caregiver's characteristics; (2) care recipient's characteristics; (3) intervention preferences (e.g. content, delivery format); and (4) informal caregiver's mental health. Informal caregiver's mental health was assessed using the 21 item Depression, Anxiety, and Stress Scale (DASS-21), which is composed of three subscales measuring depression, anxiety, and stress, respectively.

    Sixty-five individuals participated in the survey.

    See the published article for full study details.

    Description of uploaded files

    1. ENTWINE_ESR14_Kidney Carer Survey Data_FULL_2022-08-30: Excel file with the complete, raw survey data. Note: the first half of participant's postal codes was collected, however this data was removed from the uploaded dataset to ensure participant anonymity.

    2. ENTWINE_ESR14_Kidney Carer Survey Data_Clean DASS-21 Data_2022-08-30: Excel file with cleaned data for the DASS-21 scale. Data cleaning involved imputation of missing data if participants were missing data for one item within a subscale of the DASS-21. Missing values were imputed by finding the mean of all other items within the relevant subscale.

    3. ENTWINE_ESR14_Kidney Carer Survey_KEY_2022-08-30: Excel file with key linking item labels in uploaded datasets with the corresponding survey question.

    4. R Code for Kidney Carer Survey_2022-08-30: R file of R code used to analyse survey data.

    5. R code for Kidney Carer Survey_PDF_2022-08-30: PDF file of R code used to analyse survey data.

  13. KAP WASH 2019 in South Sudan's Ajuong Thok and Pamir Camps - South Sudan

    • datacatalog.ihsn.org
    • microdata.unhcr.org
    • +1more
    Updated Oct 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UNHCR (2021). KAP WASH 2019 in South Sudan's Ajuong Thok and Pamir Camps - South Sudan [Dataset]. https://datacatalog.ihsn.org/catalog/9787
    Explore at:
    Dataset updated
    Oct 14, 2021
    Dataset provided by
    United Nations High Commissioner for Refugeeshttp://www.unhcr.org/
    Samaritan's Purse
    Time period covered
    2019
    Area covered
    South Sudan
    Description

    Abstract

    A Knowledge, Attitudes and Practices (KAP) survey was conducted in Ajuong Thok and Pamir Refugee Camps in October 2019 to determine the current Water, Sanitation and Hygiene (WASH) conditions as well as hygiene attitudes and practices within the households (HHs) surveyed. The assessment utilized a systematic random sampling method, and a total of 1,474 HHs (735 HHs in Ajuong Thok and 739 HHs in Pamir) were surveyed using mobile data collection (MDC) within a period of 21 days. Data was cleaned and analyzed in Excel. The summary of the results is presented in this report.

    The findings show that the overall average number of liters of water per person per day was 23.4, in both Ajuong Thok and Pamir Camps, which was slightly higher than the recommended United Nations High Commissioner for Refugees (UNHCR) minimum standard of at least 20 liters of water available per person per day. This is a slight improvement from the 21 liters reported the previous year. The average HH size was six people. Women comprised 83% of the surveyed respondents and males 17%. Almost all the respondents were refugees, constituting 99.5% (n=1,466). The refugees were aware of the key health and hygiene practices, possibly as a result of routine health and hygiene messages delivered to them by Samaritan´s Purse (SP) and other health partners. Most refugees had knowledge about keeping the water containers clean, washing hands during critical times, safe excreta disposal and disease prevention.

    Geographic coverage

    Ajuong Thok and Pamir Refugee Camps

    Analysis unit

    Households

    Universe

    All households in Ajuong Thok and Pamir Refugee Camps

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    Households were selected using systematic random sampling. Enumerators systematically walked through the camp block by block, row by row, in such a way as to pass each HH. Within blocks, enumerators started at one corner, then systematically used the sampling interval as they walked up and down each of the rows throughout the block, covering every block in Ajuong Thok and Pamir.

    In each location, the first HH sampled in a block was generated using an Excel tool customized by UNHCR which generated a Random Start and Sampling Interval.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The survey questionnaire used to collect the data consists of the following sections: - Demographics - Water collection and storage - Drinking water hygiene - Hygiene - Sanitation - Messaging - Distribution (NFI) - Diarrhea prevalence, knowledge and health seeking behaviour - Menstrual hygiene

    Cleaning operations

    The data collected was uploaded to a server at the end of each day. IFormBuilder generated a Microsoft (MS) Excel spreadsheet dataset which was then cleaned and analyzed using MS Excel.

    Given that SP is currently implementing a WASH program in Ajuong Thok and Pamir, the assessment data collected in these camps will not only serve as the endline for UNHCR 2018 programming but also as the baseline for 2019 programming.

    Data was anonymized through decoding and local suppression.

  14. w

    National Family Survey 2019-2021 - India

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated May 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Family Survey 2019-2021 - India [Dataset]. https://microdata.worldbank.org/index.php/catalog/4482
    Explore at:
    Dataset updated
    May 12, 2022
    Dataset provided by
    Ministry of Health and Family Welfare (MoHFW)
    International Institute for Population Sciences (IIPS)
    Time period covered
    2019 - 2021
    Area covered
    India
    Description

    Abstract

    The National Family Health Survey 2019-21 (NFHS-5), the fifth in the NFHS series, provides information on population, health, and nutrition for India, each state/union territory (UT), and for 707 districts.

    The primary objective of the 2019-21 round of National Family Health Surveys is to provide essential data on health and family welfare, as well as data on emerging issues in these areas, such as levels of fertility, infant and child mortality, maternal and child health, and other health and family welfare indicators by background characteristics at the national and state levels. Similar to NFHS-4, NFHS-5 also provides information on several emerging issues including perinatal mortality, high-risk sexual behaviour, safe injections, tuberculosis, noncommunicable diseases, and the use of emergency contraception.

    The information collected through NFHS-5 is intended to assist policymakers and programme managers in setting benchmarks and examining progress over time in India’s health sector. Besides providing evidence on the effectiveness of ongoing programmes, NFHS-5 data will help to identify the need for new programmes in specific health areas.

    The clinical, anthropometric, and biochemical (CAB) component of NFHS-5 is designed to provide vital estimates of the prevalence of malnutrition, anaemia, hypertension, high blood glucose levels, and waist and hip circumference, Vitamin D3, HbA1c, and malaria parasites through a series of biomarker tests and measurements.

    Geographic coverage

    National coverage

    Analysis unit

    • Household
    • Individual
    • Children age 0-5
    • Woman age 15-49
    • Man age 15 to 54

    Universe

    The survey covered all de jure household members (usual residents), all women aged 15-49, all men age 15-54, and all children aged 0-5 resident in the household.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    A uniform sample design, which is representative at the national, state/union territory, and district level, was adopted in each round of the survey. Each district is stratified into urban and rural areas. Each rural stratum is sub-stratified into smaller substrata which are created considering the village population and the percentage of the population belonging to scheduled castes and scheduled tribes (SC/ST). Within each explicit rural sampling stratum, a sample of villages was selected as Primary Sampling Units (PSUs); before the PSU selection, PSUs were sorted according to the literacy rate of women age 6+ years. Within each urban sampling stratum, a sample of Census Enumeration Blocks (CEBs) was selected as PSUs. Before the PSU selection, PSUs were sorted according to the percentage of SC/ST population. In the second stage of selection, a fixed number of 22 households per cluster was selected with an equal probability systematic selection from a newly created list of households in the selected PSUs. The list of households was created as a result of the mapping and household listing operation conducted in each selected PSU before the household selection in the second stage. In all, 30,456 Primary Sampling Units (PSUs) were selected across the country in NFHS-5 drawn from 707 districts as on March 31st 2017, of which fieldwork was completed in 30,198 PSUs.

    For further details on sample design, see Section 1.2 of the final report.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    Four survey schedules/questionnaires: Household, Woman, Man, and Biomarker were canvassed in 18 local languages using Computer Assisted Personal Interviewing (CAPI).

    Cleaning operations

    Electronic data collected in the 2019-21 National Family Health Survey were received on a daily basis via the SyncCloud system at the International Institute for Population Sciences, where the data were stored on a password-protected computer. Secondary editing of the data, which required resolution of computer-identified inconsistencies and coding of open-ended questions, was conducted in the field by the Field Agencies and at the Field Agencies central office, and IIPS checked the secondary edits before the dataset was finalized.

    Field-check tables were produced by IIPS and the Field Agencies on a regular basis to identify certain types of errors that might have occurred in eliciting information and recording question responses. Information from the field-check tables on the performance of each fieldwork team and individual investigator was promptly shared with the Field Agencies during the fieldwork so that the performance of the teams could be improved, if required.

    Response rate

    A total of 664,972 households were selected for the sample, of which 653,144 were occupied. Among the occupied households, 636,699 were successfully interviewed, for a response rate of 98 percent.

    In the interviewed households, 747,176 eligible women age 15-49 were identified for individual women’s interviews. Interviews were completed with 724,115 women, for a response rate of 97 percent. In all, there were 111,179 eligible men age 15-54 in households selected for the state module. Interviews were completed with 101,839 men, for a response rate of 92 percent.

  15. i

    General Household Survey, Panel 2018-2019, Wave 4 - Nigeria

    • catalog.ihsn.org
    • microdata.nigerianstat.gov.ng
    • +2more
    Updated Jan 16, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (NBS) (2021). General Household Survey, Panel 2018-2019, Wave 4 - Nigeria [Dataset]. https://catalog.ihsn.org/catalog/8805
    Explore at:
    Dataset updated
    Jan 16, 2021
    Dataset authored and provided by
    National Bureau of Statistics (NBS)
    Time period covered
    2018 - 2019
    Area covered
    Nigeria
    Description

    Abstract

    The General Household Survey-Panel (GHS-Panel) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program. The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, interinstitutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of approximately 5,000 households, which are also representative of the six geopolitical zones. The 2018/19 is the fourth round of the survey with prior rounds conducted in 2010/11, 2012/13, and 2015/16. GHS-Panel households were visited twice: first after the planting season (post-planting) between July and September 2018 and second after the harvest season (post-harvest) between January and February 2019.

    Geographic coverage

    National

    Analysis unit

    • Households
    • Individuals
    • Agricultural plots
    • Communities

    Universe

    The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The original GHS-Panel sample of 5,000 households across 500 enumeration areas (EAs) and was designed to be representative at the national level as well as at the zonal level. The complete sampling information for the GHS-Panel is described in the Basic Information Document for GHS-Panel 2010/2011. However, after a nearly a decade of visiting the same households, a partial refresh of the GHS-Panel sample was implemented in Wave 4.

    For the partial refresh of the sample, a new set of 360 EAs were randomly selected which consisted of 60 EAs per zone. The refresh EAs were selected from the same sampling frame as the original GHS-Panel sample in 2010 (the “master frame”). A listing of all households was conducted in the 360 EAs and 10 households were randomly selected in each EA, resulting in a total refresh sample of approximated 3,600 households.

    In addition to these 3,600 refresh households, a subsample of the original 5,000 GHS-Panel households from 2010 were selected to be included in the new sample. This “long panel” sample was designed to be nationally representative to enable continued longitudinal analysis for the sample going back to 2010. The long panel sample consisted of 159 EAs systematically selected across the 6 geopolitical Zones. The systematic selection ensured that the distribution of EAs across the 6 Zones (and urban and rural areas within) is proportional to the original GHS-Panel sample. Interviewers attempted to interview all households that originally resided in the 159 EAs and were successfully interviewed in the previous visit in 2016. This includes households that had moved away from their original location in 2010. In all, interviewers attempted to interview 1,507 households from the original panel sample.

    The combined sample of refresh and long panel EAs consisted of 519 EAs. The total number of households that were successfully interviewed in both visits was 4,976.

    Sampling deviation

    While the combined sample generally maintains both national and Zonal representativeness of the original GHS-Panel sample, the security situation in the North East of Nigeria prevented full coverage of the Zone. Due to security concerns, rural areas of Borno state were fully excluded from the refresh sample and some inaccessible urban areas were also excluded. Security concerns also prevented interviewers from visiting some communities in other parts of the country where conflict events were occurring. Refresh EAs that could not be accessed were replaced with another randomly selected EA in the Zone so as not to compromise the sample size. As a result, the combined sample is representative of areas of Nigeria that were accessible during 2018/19. The sample will not reflect conditions in areas that were undergoing conflict during that period. This compromise was necessary to ensure the safety of interviewers.

    Mode of data collection

    Computer Assisted Personal Interview [capi]

    Research instrument

    The GHS-Panel Wave 4 consists of three questionnaires for each of the two visits. The Household Questionnaire was administered to all households in the sample. The Agriculture Questionnaire was administered to all households engaged in agricultural activities such as crop farming, livestock rearing and other agricultural and related activities. The Community Questionnaire was administered to the community to collect information on the socio-economic indicators of the enumeration areas where the sample households reside.

    GHS-Panel Household Questionnaire: The Household Questionnaire provides information on demographics; education; health (including anthropometric measurement for children); labor; food and non-food expenditure; household nonfarm income-generating activities; food security and shocks; safety nets; housing conditions; assets; information and communication technology; and other sources of household income. Household location is geo-referenced in order to be able to later link the GHS-Panel data to other available geographic data sets.

    GHS-Panel Agriculture Questionnaire: The Agriculture Questionnaire solicits information on land ownership and use; farm labor; inputs use; GPS land area measurement and coordinates of household plots; agricultural capital; irrigation; crop harvest and utilization; animal holdings and costs; and household fishing activities. Some information is collected at the crop level to allow for detailed analysis for individual crops.

    GHS-Panel Community Questionnaire: The Community Questionnaire solicits information on access to infrastructure; community organizations; resource management; changes in the community; key events; community needs, actions and achievements; and local retail price information.

    The Household Questionnaire is slightly different for the two visits. Some information was collected only in the post-planting visit, some only in the post-harvest visit, and some in both visits.

    The Agriculture Questionnaire collects different information during each visit, but for the same plots and crops.

    Cleaning operations

    CAPI: For the first time in GHS-Panel, the Wave four exercise was conducted using Computer Assisted Person Interview (CAPI) techniques. All the questionnaires, household, agriculture and community questionnaires were implemented in both the post-planting and post-harvest visits of Wave 4 using the CAPI software, Survey Solutions. The Survey Solutions software was developed and maintained by the Survey Unit within the Development Economics Data Group (DECDG) at the World Bank. Each enumerator was given tablets which they used to conduct the interviews. Overall, implementation of survey using Survey Solutions CAPI was highly successful, as it allowed for timely availability of the data from completed interviews.

    DATA COMMUNICATION SYSTEM: The data communication system used in Wave 4 was highly automated. Each field team was given a mobile modem allow for internet connectivity and daily synchronization of their tablet. This ensured that head office in Abuja has access to the data in real-time. Once the interview is completed and uploaded to the server, the data is first reviewed by the Data Editors. The data is also downloaded from the server, and Stata dofile was run on the downloaded data to check for additional errors that were not captured by the Survey Solutions application. An excel error file is generated following the running of the Stata dofile on the raw dataset. Information contained in the excel error files are communicated back to respective field interviewers for action by the interviewers. This action is done on a daily basis throughout the duration of the survey, both in the post-planting and post-harvest.

    DATA CLEANING: The data cleaning process was done in three main stages. The first stage was to ensure proper quality control during the fieldwork. This was achieved in part by incorporating validation and consistency checks into the Survey Solutions application used for the data collection and designed to highlight many of the errors that occurred during the fieldwork.

    The second stage cleaning involved the use of Data Editors and Data Assistants (Headquarters in Survey Solutions). As indicated above, once the interview is completed and uploaded to the server, the Data Editors review completed interview for inconsistencies and extreme values. Depending on the outcome, they can either approve or reject the case. If rejected, the case goes back to the respective interviewer’s tablet upon synchronization. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences, these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. Additional errors observed were compiled into error reports that were regularly sent to the teams. These errors were then corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then approved by the Data Editor. After the Data Editor’s approval of the interview on Survey Solutions server, the Headquarters also reviews and depending on the outcome, can either reject or approve.

    The third stage of cleaning involved a comprehensive review of the final raw data following

  16. NHANES 1988-2018

    • figshare.com
    application/gzip
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 134,310 participants and 4,740 variables. The variables convey 1) demographic information, 2) dietary consumption, 3) physical examination results, 4) occupation, 5) questionnaire items (e.g., physical activity, general health status, medical conditions), 6) medications, 7) mortality status linked from the National Death Index, 8) survey weights, 9) environmental exposure biomarker measurements, and 10) chemical comments that indicate which measurements are below or above the lower limit of detection. We also provide a data dictionary listing the variables and their descriptions to help researchers browse the data. We also provide R markdown files to show example codes on calculating summary statistics and running regression models to help accelerate high-throughput analysis of the exposome and secular trends on cancer mortality. csv Data Record: The curated NHANES datasets and the data dictionaries includes 13 .csv files and 1 excel file. The curated NHANES datasets involves 10 .csv formatted files, one for each module and labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. The eleventh file is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 4,740 variables in NHANES ("dictionary_nhanes.csv"). The 12th csv file contains the harmonized categories for the categorical variables ("dictionary_harmonized_categories.csv"). The 13th file contains the dictionary for descriptors on the drugs codes (“dictionary_drug_codes.csv”). The 14th file is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES datasets (“nhanes_inconsistencies_documentation.xlsx”). R Data Record: For researchers who want to conduct their analysis in the R programming language, the curated NHANES datasets and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. We provided an .RData file that contains all the aforementioned datasets as R data objects (“w - nhanes_1988_2018.RData”). Also in this .RData file, we make available all R scripts on customized functions that were written to curate the data. We also provide an .R file that shows how we used the customized functions (i.e. our pipeline) to curate the data (“m - nhanes_1988_2018.R”).

  17. d

    Data from: Functional morphology and efficiency of the antenna cleaner in...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 26, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Hackmann; Henry Delacave; Adam Robinson; David Labonte; Walter Federle (2015). Functional morphology and efficiency of the antenna cleaner in Camponotus rufifemur ants [Dataset]. http://doi.org/10.5061/dryad.88q18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 26, 2015
    Dataset provided by
    Dryad
    Authors
    Alexander Hackmann; Henry Delacave; Adam Robinson; David Labonte; Walter Federle
    Time period covered
    2015
    Area covered
    UK, Cambridge
    Description

    Data for manuscript “Functional morphology and efficiency of the antenna cleaner in Camponotus rufifemur ants"Excel file includes 3 data sheets. One sheet for each experiment. The corresponding figures from the manuscript are mentioned above the actual data.Manuscript data.xlsx

  18. m

    Early onset preeclampsia and eclampsia in low-resource settings

    • data.mendeley.com
    Updated Oct 19, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Solwayo Ngwenya (2019). Early onset preeclampsia and eclampsia in low-resource settings [Dataset]. http://doi.org/10.17632/wrkvzf567k.2
    Explore at:
    Dataset updated
    Oct 19, 2019
    Authors
    Solwayo Ngwenya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This was a retrospective cross-sectional study carried out at Mpilo Central Hospital a government teaching and tertiary referral centre. It covered the period from February 1, 2016 to July 30, 2018. The aim of the study was to assess the incidence of early-onset severe preeclampsia and eclampsia in a low-resource setting and associated factors. Early-onset severe preeclampsia was diagnosed in those patients with high blood pressure(SBP ≥160, DBP ≥110mmHg) and either severe headaches, epigastric pain and deranged biochemical/haematological blood indices. Eclampsia was diagnosed in women who had a grand mal seizure with features of preeclampsia and no previous history of a seizure disorder such as epilepsy. Women with such history were excluded from the study. All women who were between 20-33+6weeks' of gestation and met the above criteria were included in the study. Early neonatal death was recorded within 7 days of birth.

    A paper data collection tool was used to collect information from the labour ward delivery registers, perinatal registers and mortality registers. Data were also collected from neonatal intensive care unit and special care baby unit. Hospital case notes were retrieved and data collected from there as well. The data tool collected maternal, fetal and neonatal demographic, clinical and out-come information.

    Data were entered into Microsoft Excel Inc., then exported to SPSS 20 for analysis. Data cleaning and coding were done in SPSS Version 20 before final analysis. Simple descriptive statistics were performed and presented as frequencies and percentages for categorical variables. Continuous variables were checked for normality using Shapiro Wilk test. Mean and standard deviation(SD) were reported for normal data. Tests of association between variables were performed using Pearson chi-square and Fisher’s exact tests. A p value of <0.05 was considered statistically significant. The incidence of early-onset severe preeclampsia and eclampsia at the unit was 1.0%. There was a statistically significant difference between place of dwelling and maternal complications, with urban dwellers suffering more complications. Tests of association were done between various variables and fetal survival to discharged home showed the following associations; gestational age, mother’s booking status, mother’s systolic blood pressure and diastolic blood pressure, receiving corticosteroid therapy and fetal birth weight.

  19. u

    University of Cape Town Student Admissions Data 2006-2014 - South Africa

    • datafirst.uct.ac.za
    Updated Jul 28, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. https://www.datafirst.uct.ac.za/dataportal/index.php/catalog/556
    Explore at:
    Dataset updated
    Jul 28, 2020
    Dataset authored and provided by
    UCT Student Administration
    Time period covered
    2006 - 2014
    Area covered
    South Africa
    Description

    Abstract

    This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

    The dataset was separated into the following data files:

    1. Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.
    2. Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.
    3. Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).
    4. Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

    Analysis unit

    Applications, individuals

    Kind of data

    Administrative records [adm]

    Mode of data collection

    Other [oth]

    Cleaning operations

    The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

  20. g

    Location of “video protection” cameras (BO city of Paris, 01/02/2019)

    • gimi9.com
    • data.europa.eu
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Location of “video protection” cameras (BO city of Paris, 01/02/2019) [Dataset]. https://gimi9.com/dataset/eu_5cdede708b4c4123aa5376f2
    Explore at:
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Paris
    Description

    Extract the list of locations of the 1,424 cameras as described in the Official Bulletin of the City of Paris of 1 February 2019 To make this list available according to the principles of Open Data (open license, standard data format) To geotagger these locations in order to view them on a map (in progress, 707 of 1424, see map) There are actually two lists of locations: — Annex 1 (1 391 pitches): the Video Protection Plan for the Police Prefecture — Annex 2 (890 locations): the video protection plan for Paris, list of cameras visible by authorised agents of the City of Paris These two lists have many locations in common. NextInpact summarised the situation in this article. The source code is available here: https://github.com/ColinMaudry/geo-videoprotection-paris Modus operandi Convert PDF to MS Excel file using online tool PDFtables.com Cleaning non-data text in LibreOffice

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
141 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu