94 datasets found

i
Project for Statistics on Living Standards and Development 1993 - South...
catalog.ihsn.org
microdata.fao.org
+2more
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southern Africa Labour and Development Research Unit (2019). Project for Statistics on Living Standards and Development 1993 - South Africa [Dataset]. https://catalog.ihsn.org/catalog/4628
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
Southern Africa Labour and Development Research Unit
Time period covered
1993
Area covered
South Africa
Description
Abstract

The Project for Statistics on Living standards and Development was a coutrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

Geographic coverage

National coverage

Analysis unit

Households

Individuals

Community

Universe

All Household members.

Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sample size is 9,000 households

The sample design adopted for the study was a two-stage self-weightingdesign in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households.

The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution.in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained and weights had to be added.

The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups.

In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one.

In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.

Census population data, however, was available only for 1991. An assumption on population growth was thus made to obtain an approximation of the population size for 1993, the year of the survey. The sampling interval at the level of the household was determined in the following way: Based on the decision to have a take of 125 individuals on average per cluster (i.e. assuming 5 members per household to give an average cluster size of 25 households), the interval of households to be selected was determined as the census population divided by 118.1, i.e. allowing for population growth since the census. It was subsequently discovered that population growth was slightly over-estimated but this had little effect on the findings of the survey.

Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described abovefor the households in ESDs.

Mode of data collection

Face-to-face [f2f]

Research instrument

The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demography, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.

In addition to the detailed household questionnaire referred to above, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.

Cleaning operations

All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.

These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question

Data appraisal

The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.
Data from: Expensive but Worth It: Live Projects in Statistics, Data...
tandf.figshare.com
pdf
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Ritter; L. Allison Jones-Farmer; Frederick W. Faltin (2025). Expensive but Worth It: Live Projects in Statistics, Data Science, and Analytics Courses [Dataset]. http://doi.org/10.6084/m9.figshare.26813062.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26813062.v1
Dataset updated
Apr 1, 2025
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Christian Ritter; L. Allison Jones-Farmer; Frederick W. Faltin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Students in statistics, data science, analytics, and related fields study the theory and methodology of data-related topics. Some, but not all, are exposed to experiential learning courses that cover essential parts of the life cycle of practical problem-solving. Experiential learning enables students to convert real-world issues into solvable technical questions and effectively communicate their findings to clients. We describe several experiential learning course designs in statistics, data science, and analytics curricula. We present findings from interviews with faculty from the U.S., Europe, and the Middle East and surveys of former students. We observe that courses featuring live projects and coaching by experienced faculty have a high career impact, as reported by former participants. However, such courses are labor-intensive for both instructors and students. We give estimates of the required effort to deliver courses with live projects and the perceived benefits and tradeoffs of such courses. Overall, we conclude that courses offering live-project experiences, despite being more time-consuming than traditional courses, offer significant benefits for students regarding career impact and skill development, making them worthwhile investments. Supplementary materials for this article are available online.
Data for this project include human subjects PII and cannot be shared.
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data for this project include human subjects PII and cannot be shared. [Dataset]. https://catalog.data.gov/dataset/data-for-this-project-include-human-subjects-pii-and-cannot-be-shared
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Data on approximately 2 million births occurring in NJ, OH, and PA from 2000 - 2005. Linked to PM2.5 and ozone concentration estimates from EPA CMAQ fused model. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: Birth data can be acquired through application to the state health statistics departments of NJ, OH, and PA. Contact author for code. rappazzo.kristen@epa.gov. Format: No data included. This dataset is associated with the following publication: Rappazzo, K., D. Lobdell, L. Messer, C. Poole, and J. Daniels. Comparison of gestational dating methods and implications for exposure-outcome associations: an example with PM2.5 and preterm birth. JOURNAL OF OCCUPATIONAL AND ENVIRONMENTAL MEDICINE. Lippincott Williams & Wilkins, Philadelphia, PA, USA, 74(2): 138-143, (2017).
Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
i
Grant Giving Statistics for Metro Ideas Project
instrumentl.com
Updated Jan 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Grant Giving Statistics for Metro Ideas Project [Dataset]. https://www.instrumentl.com/990-report/metro-ideas-project
Explore at:
Dataset updated
Jan 6, 2022
Variables measured
Total Assets, Total Giving
Description
Financial overview and grant giving statistics of Metro Ideas Project
d
Fair Trade Commission 105 Annual Commissioned Research Project Research...
data.gov.tw
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fair Trade Commission, EY, Fair Trade Commission 105 Annual Commissioned Research Project Research Topic Table [Dataset]. https://data.gov.tw/en/datasets/32393
Explore at:
csvAvailable download formats
Dataset authored and provided by
Fair Trade Commission, EY
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
This dataset is a list of research topics for the Fair Trade Commission's commissioned research project in the year 2016.
w
Dataset of book subjects that contain Statistics in research and development...
workwithdata.com
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2024). Dataset of book subjects that contain Statistics in research and development [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Statistics+in+research+and+development&j=1&j0=books
Explore at:
Dataset updated
Nov 7, 2024
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about book subjects. It has 2 rows and is filtered where the books is Statistics in research and development. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
d
Puerto Rico Research Topics Discussed at the USGS Natural Hazards Internal...
catalog.data.gov
data.usgs.gov
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Puerto Rico Research Topics Discussed at the USGS Natural Hazards Internal Meeting [Dataset]. https://catalog.data.gov/dataset/puerto-rico-research-topics-discussed-at-the-usgs-natural-hazards-internal-meeting
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Puerto Rico
Description
Underserved communities, especially those in coastal areas in Puerto Rico, face significant threats from natural hazards such as hurricanes and rising sea levels. Limited funding hinders the investment in costly mitigation measures, increasing exposure to natural disasters. Providing coastal resources and data products through effective communication mechanisms is fundamental to improving the well-being of these underserved coastal communities. The overall objectives of the pilot effort to engage and connect with underrepresented coastal communities in Puerto Rico were the following: (1) compile a comprehensive database of the projects and resources relevant to natural hazards in Puerto Rico; (2) foster connections with Puerto Rican interested parties to better understand their priorities regarding coastal hazards and provide them with pertinent U.S. Geological Survey (USGS) resources; and (3) identify knowledge gaps to guide future USGS projects in Puerto Rico. To address these objectives, the research team held a virtual internal meeting amongst USGS colleagues (organized with a professional facilitator) to identify and gather information on existing USGS data, knowledge, and tools available for natural hazards and resources in Puerto Rico. The goals of the meeting were to: (1) exchange knowledge among colleagues, (2) broaden the network of participants, (3) foster potential collaborative relationships with researchers engaged in USGS hazards projects in Puerto Rico, and (4) document all the research taking place in Puerto Rico related to natural hazards and resources. The result was a database of USGS natural hazards projects being conducted or recently completed in Puerto Rico. For further information about this data, refer to the associated journal article (Torres-García and others, 2024).
Shopping Mall Customer Data Segmentation Analysis
kaggle.com
zip
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
Explore at:
zip(5890828 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
DataZng
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Demographic Analysis of Shopping Behavior: Insights and Recommendations

Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Appendix S1 - Epidemiology of Functional Abdominal Bloating and Its Impact...
plos.figshare.com
pdf
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meijing Wu; Yanfang Zhao; Rui Wang; Wenxin Zheng; Xiaojing Guo; Shunquan Wu; Xiuqiang Ma; Jia He (2023). Appendix S1 - Epidemiology of Functional Abdominal Bloating and Its Impact on Health Related Quality of Life: Male-Female Stratified Propensity Score Analysis in a Population Based Survey in Mainland China [Dataset]. http://doi.org/10.1371/journal.pone.0102320.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0102320.s001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Meijing Wu; Yanfang Zhao; Rui Wang; Wenxin Zheng; Xiaojing Guo; Shunquan Wu; Xiuqiang Ma; Jia He
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Characteristics of baseline covariates and standardized bias before and after PS adjusted using weighting by the odds in 20% of the total respondents, a cross sectional study in five cities, china, 2007–2008 (n = 3,179). (PDF)
Dataset - Understanding the software and data used in the social sciences
eprints.soton.ac.uk
Updated Mar 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chue Hong, Neil; Aragon, Selina; Antonioletti, Mario; Walker, Johanna (2023). Dataset - Understanding the software and data used in the social sciences [Dataset]. http://doi.org/10.5281/zenodo.7785710
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.7785710
Dataset updated
Mar 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chue Hong, Neil; Aragon, Selina; Antonioletti, Mario; Walker, Johanna
Description
This is a repository for a UKRI Economic and Social Research Council (ESRC) funded project to understand the software used to analyse social sciences data. Any software produced has been made available under a BSD 2-Clause license and any data and other non-software derivative is made available under a CC-BY 4.0 International License. Note that the software that analysed the survey is provided for illustrative purposes - it will not work on the decoupled anonymised data set. Exceptions to this are: Data from the UKRI ESRC is mostly made available under a CC BY-NC-SA 4.0 Licence. Data from Gateway to Research is made available under an Open Government Licence (Version 3.0). Contents Survey data & analysis: esrc_data-survey-analysis-data.zip Other data: esrc_data-other-data.zip Transcripts: esrc_data-transcripts.zip Data Management Plan: esrc_data-dmp.zip Survey data & analysis The survey ran from 3rd February 2022 to 6th March 2023 during which 168 responses were received. Of these responses, three were removed because they were supplied by people from outside the UK without a clear indication of involvement with the UK or associated infrastructure. A fourth response was removed as both came from the same person which leaves us with 164 responses in the data. The survey responses, Question (Q) Q1-Q16, have been decoupled from the demographic data, Q17-Q23. Questions Q24-Q28 are for follow-up and have been removed from the data. The institutions (Q17) and funding sources (Q18) have been provided in a separate file as this could be used to identify respondents. Q17, Q18 and Q19-Q23 have all been independently shuffled. The data has been made available as Comma Separated Values (CSV) with the question number as the header of each column and the encoded responses in the column below. To see what the question and the responses correspond to you will have to consult the survey-results-key.csv which decodes the question and responses accordingly. A pdf copy of the survey questions is available on GitHub. The survey data has been decoupled into: survey-results-key.csv - maps a question number and the responses to the actual question values. q1-16-survey-results.csv- the non-demographic component of the survey responses (Q1-Q16). q19-23-demographics.csv - the demographic part of the survey (Q19-Q21, Q23). q17-institutions.csv - the institution/location of the respondent (Q17). q18-funding.csv - funding sources within the last 5 years (Q18). Please note the code that has been used to do the analysis will not run with the decoupled survey data. Other data files included CleanedLocations.csv - normalised version of the institutions that the survey respondents volunteered. DTPs.csv - information on the UKRI Doctoral Training Partnerships (DTPs) scaped from the UKRI DTP contacts web page in October 2021. projectsearch-1646403729132.csv.gz - data snapshot from the UKRI Gateway to Research released on the 24th February 2022 made available under an Open Government Licence. locations.csv - latitude and longitude for the institutions in the cleaned locations. subjects.csv - research classifications for the ESRC projects for the 24th February data snapshot. topics.csv - topic classification for the ESRC projects for the 24th February data snapshot. Interview transcripts The interview transcripts have been anonymised and converted to markdown so that it's easier to process in general. List of interview transcripts: 1269794877.md 1578450175.md 1792505583.md 2964377624.md 3270614512.md 40983347262.md 4288358080.md 4561769548.md 4938919540.md 5037840428.md 5766299900.md 5996360861.md 6422621713.md 6776362537.md 7183719943.md 7227322280.md 7336263536.md 75909371872.md 7869268779.md 8031500357.md 9253010492.md Data Management Plan The study's Data Management Plan is provided in PDF format and shows the different data sets used throughout the duration of the study and where they have been deposited, as well as how long the SSI will keep these records.
Google Certificate BellaBeats Capstone Project
kaggle.com
zip
Updated Jan 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Porzelius (2023). Google Certificate BellaBeats Capstone Project [Dataset]. https://www.kaggle.com/datasets/jasonporzelius/google-certificate-bellabeats-capstone-project
Explore at:
zip(169161 bytes)Available download formats
Dataset updated
Jan 5, 2023
Authors
Jason Porzelius
Description
Introduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.

Section 1 - Ask:

A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project? 2. What is the business task that this data analysis project is attempting to solve?

B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.

Identify the business task. *The business task is: -As provided by co-founder Urška Sršen, the business task for this project is to gain insight into how consumers are using their non-BellaBeats smart devices in order to guide upcoming marketing strategies for the company which will help drive future growth. Specifically, the researcher was tasked with applying insights driven by the data analysis process to 1 BellaBeats product and presenting those insights to BellaBeats stakeholders.

Section 2 - Prepare:

A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?

B. Key Tasks:

Research and communicate the source of the data, and how it is stored/organized to stakeholders. *The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
*Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were: -sleepDay_merged.csv -dailyActivity_merged.csv

Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...
d
Central Police University independent research project topic
data.gov.tw
api, csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Police University, Central Police University independent research project topic [Dataset]. https://data.gov.tw/en/datasets/152820
Explore at:
api, csvAvailable download formats
Dataset authored and provided by
Central Police University
License
https://data.gov.tw/licensehttps://data.gov.tw/license
Description
Summary of topics of the Central Police Universitys self-research projects from 106 to 110 years
a
UDOT Region 4 - Arches Hotspot Preliminary Project Ideas Map 2018
uplan.hub.arcgis.com
Updated Jan 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UPlan Map Center (2018). UDOT Region 4 - Arches Hotspot Preliminary Project Ideas Map 2018 [Dataset]. https://uplan.hub.arcgis.com/maps/b395fdb59fca4799976faab5b4eb7f94
Explore at:
Dataset updated
Jan 13, 2018
Dataset authored and provided by
UPlan Map Center
Area covered

Description
Purpose: This map contains project data for the Arches recreational hot spot study, PIN 16097, for the Arches Hotspot Preliminary Project Ideas App 2018 study and is embedded within that storymap. It illustrates proposed parking, cycling trail, and other recreational transportation projects.The data was completed in 2018 by Jones and DeMille Engineers. For questions on the data, please contact Adam Perschon at adam.p@jonesanddemille.com. It was transferred ownership from Paul Damron to Bracken on 6/23/23.Go Live Date: January 2018Project PIN: 16097ePM Project Name: Moab Area Recreational Hot Spot StudyOwner: Bracken Davis (bdavis1@utah.gov)Update Interval: One-time creation.Data Location: MoabHotspotStudy hosted feature layer.Associated Apps: Arches Hotspot Preliminary Project storymapUDOT Region 4 - Arches Hotspot Improvement Projects 2018 storymapUDOT Region 4 - Arches Hotspot Additional Study Information 2018 storymapExpected Life of Data:There is no foreseeable end date for this data.
d
Climate Change and Environmental Issues Dataset from Ukrainian Telegram...
search.dataone.org
Updated Oct 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ustyianovych, Taras; Fedushko, Solomia (2025). Climate Change and Environmental Issues Dataset from Ukrainian Telegram Channels [Dataset]. http://doi.org/10.7910/DVN/NL06IX
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/NL06IX
Dataset updated
Oct 29, 2025
Dataset provided by
Harvard Dataverse
Authors
Ustyianovych, Taras; Fedushko, Solomia
Description
Overview This repository contains two datasets that were collected and processed as part of a study on public perception of environmental issues and climate change in Ukraine. The datasets are derived from Ukrainian Telegram news channels and include metadata, raw text, and user reactions to posts related to climate events and environmental topics. These datasets are intended to support academic research on the relationship between public discourse, user sentiment, and climate indicators. The datasets are located in the data folder with respect to their extension: csv and parquet. If you decide to read the climate_text_data_final in CSV format, please set the encoding to utf-16. Datasets climate_text_data_final This dataset contains raw text data from Telegram posts, along with additional metadata. It provides a comprehensive view of the content and context of climate-related discussions. The dataset can be joined with the final_reactions_data based on the channel_name and message_id. Please ensure the encoding is set to utf-16 when reading the CSV format of the dataset. Key Features: Post ID: Unique identifier for each Telegram post. Channel Name: The name of the Telegram channel where the post was published. Text: The raw text of the Telegram post. Metadata: Includes timestamp, number of views, and number of forwards. Purpose: This dataset is designed to support natural language processing (NLP) tasks, such as topic modeling, named entity recognition, and sentiment analysis. It provides a foundation for understanding the themes and narratives surrounding climate change and environmental issues in Ukrainian online information space. final_reactions_data This dataset contains user reactions to Telegram posts, represented as emoji counts. It provides a detailed view of how users engage with climate-related content. Key Features: Post ID: Unique identifier for each Telegram post. Channel Name: The name of the Telegram channel where the post was published. Emoji Reactions: Columns representing counts of various emojis used to react to the post. Is NA: A boolean value showing whether the emoji reaction columns have NaN or at least one non-NA value. Purpose: This dataset enables researchers to analyze user sentiment and engagement with climate-related content. It can be used to identify patterns in public reactions to environmental issues and assess the emotional tone of the discourse. The emojis can be classified into categories to reduce dimensionality and work with a combined representation of emojis. Further, statistics on particular emoji class can be generated. This will lead to a solid understanding of user engagement patterns. Research Context The datasets were collected as part of a study aimed at understanding public attitudes toward environmental issues and exploring the relationship between public perception and climate indicators, especially in the period of the full-scale Russian aggression against Ukraine. The study focused on Telegram channels due to their popularity and influence in Ukraine. The research objectives included: Developing a methodology for automated data collection from Ukrainian Telegram channels on climate-related topics. Conducting a comprehensive analysis of the collected data using natural language processing and statistical methods to identify key topics, trends, and patterns. Investigating the relationship between message characteristics and user reactions to determine factors influencing public perception of environmental issues. The study analyzed content from seven influential Telegram news channels: DW Ukraine, BBC Ukrainian, Ukrayinska Pravda, Voice of America, Radio Liberty, Babel, and ZN.UA. These channels were selected based on their audience size, credibility, and regularity of coverage of environmental issues. The data collection period spanned five years (01.01.2020 - 14.01.2025), allowing for an analysis of trends over time, including the impact of the Russian war in Ukraine on public discourse. Ethical Considerations The datasets do not contain any personally identifiable information (PII). However, we acknowledge that the dataset may contain sensitive content due to the nature of the data. Some records may describe war-related activities, destruction, harm, or other sensitive topics. We have made every effort to remain unbiased in collecting data from the selected channels and have not censored any content. The dataset will undergo ethical clearance at Lviv Polytechnic National University to ensure compliance with ethical standards and guidelines for data collection, processing, and usage. This process aims to address potential concerns related to sensitive content and ensure the responsible use of the dataset in academic research. Recommendations for Ethical Use: Fairness and Bias: Evaluate results with fairness metrics to ensure that analyses are not biased or discriminatory. Transparency: Use tools for interpretability and explainability to ensure...
Academy of Program/Project & Engineering Leadership ASK the Academy Past...
catalog.data.gov
datasets.ai
+4more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2025). Academy of Program/Project & Engineering Leadership ASK the Academy Past Issues [Dataset]. https://catalog.data.gov/dataset/academy-of-program-project-engineering-leadership-ask-the-academy-past-issues
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Academy of Program/Project & Engineering Leadership's Ask the Academy magazine past issues.
HCUP Fast Stats
catalog.data.gov
healthdata.gov
+2more
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agency for Healthcare Research and Quality, Department of Health & Human Services (2025). HCUP Fast Stats [Dataset]. https://catalog.data.gov/dataset/hcup-fast-stats
Explore at:
Dataset updated
Jul 16, 2025
Dataset provided by
United States Department of Health and Human Serviceshttp://www.hhs.gov/
Agency for Healthcare Research and Qualityhttp://www.ahrq.gov/
Description
Healthcare Cost and Utilization Project (HCUP) Fast Stats provides easy access to the latest HCUP-based statistics for health care information topics. HCUP Fast Stats uses visual statistical displays in stand-alone graphs, trend figures, or simple tables to convey complex information at a glance. Fast Stats is updated regularly for timely, topic-specific national and State-level statistics. Fast Stats topics and graphics on hospital stays and emergency department visits, including information at the national, and state levels, trends over time, and selected priority topics such as: State Trends in Hospital User by Payer National Hospital Utilization and Costs Hurricane Impact on Hospital Use Opioids & Neonatal Abstinence Syndrome Severe Maternal Morbidity
Academy of Program/Project & Engineering Leadership ASK Magazine Past Issues...
data.nasa.gov
datasets.ai
+3more
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Academy of Program/Project & Engineering Leadership ASK Magazine Past Issues [Dataset]. https://data.nasa.gov/dataset/academy-of-program-project-engineering-leadership-ask-magazine-past-issues
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Academy of Program/Project & Engineering Leadership's ASK Magazine archive.
Approved research projects by the Committee for the Protection of Human...
catalog.data.gov
data.chhs.ca.gov
+2more
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Health and Human Services Agency (2024). Approved research projects by the Committee for the Protection of Human Subjects [Dataset]. https://catalog.data.gov/dataset/approved-research-projects-by-the-committee-for-the-protection-of-human-subjects-4b600
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
California Health and Human Services Agencyhttps://www.chhs.ca.gov/
Description
This dataset contains research projects approved by the California Health and Human Services Agency (CalHHS) Committee for the Protection of Human Subjects (CPHS). CPHS is the CalHHS institutional review board and reviews all research involving human participants conducted or supported by the CalHHS and all research using private information held by CalHHS and all other state agencies.
u
Synthetic Administrative Data: Census 1991, 2023
datacatalogue.ukdataservice.ac.uk
Updated Feb 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shlomo, N, University of Manchester; Kim, M, University of Manchester (2024). Synthetic Administrative Data: Census 1991, 2023 [Dataset]. http://doi.org/10.5255/UKDA-SN-856310
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-856310
Dataset updated
Feb 21, 2024
Authors
Shlomo, N, University of Manchester; Kim, M, University of Manchester
Area covered
United Kingdom
Description
We create a synthetic administrative dataset to be used in the development of the R package for calculating quality indicators for administrative data (see: https://github.com/sook-tusk/qualadmin) that mimic the properties of a real administrative dataset according to specifications by the ONS. Taking over 1 million records from a synthetic 1991 UK census dataset, we deleted records, moved records to a different geography and duplicated records to a different geography according to pre-specified proportions for each broad ethnic group (White, Non-white) and gender (males, females). The final size of the synthetic administrative data was 1033664 individuals.
National Statistical Institutes (NSIs) are directing resources into advancing the use of administrative data in official statistics systems. This is a top priority for the UK Office for National Statistics (ONS) as they are undergoing transformations in their statistical systems to make more use of administrative data for future censuses and population statistics. Administrative data are defined as secondary data sources since they are produced by other agencies as a result of an event or a transaction relating to administrative procedures of organisations, public administrations and government agencies. Nevertheless, they have the potential to become important data sources for the production of official statistics by significantly reducing the cost and burden of response and improving the efficiency of such systems. Embedding administrative data in statistical systems is not without costs and it is vital to understand where potential errors may arise. The Total Administrative Data Error Framework sets out all possible sources of error when using administrative data as statistical data, depending on whether it is a single data source or integrated with other data sources such as survey data. For a single administrative data, one of the main sources of error is coverage and representation to the target population of interest. This is particularly relevant when administrative data is delivered over time, such as tax data for maintaining the Business Register. For sub-project 1 of this research project, we develop quality indicators that allow the statistical agency to assess if the administrative data is representative to the target population and which sub-groups may be missing or over-covered. This is essential for producing unbiased estimates from administrative data. Another priority at statistical agencies is to produce a statistical register for population characteristic estimates, such as employment statistics, from multiple sources of administrative and survey data. Using administrative data to build a spine, survey data can be integrated using record linkage and statistical matching approaches on a set of common matching variables. This will be the topic for sub-project 2, which will be split into several topics of research. The first topic is whether adding statistical predictions and correlation structures improves the linkage and data integration. The second topic is to research a mass imputation framework for imputing missing target variables in the statistical register where the missing data may be due to multiple underlying mechanisms. Therefore, the third topic will aim to improve the mass imputation framework to mitigate against possible measurement errors, for example by adding benchmarks and other constraints into the approaches. On completion of a statistical register, estimates for key target variables at local areas can easily be aggregated. However, it is essential to also measure the precision of these estimates through mean square errors and this will be the fourth topic of the sub-project. Finally, this new way of producing official statistics is compared to the more common method of incorporating administrative data through survey weights and model-based estimation approaches. In other words, we evaluate whether it is better 'to weight' or 'to impute' for population characteristic estimates - a key question under investigation by survey statisticians in the last decade.

Facebook

Twitter

Click to copy link

Link copied

Cite

Southern Africa Labour and Development Research Unit (2019). Project for Statistics on Living Standards and Development 1993 - South Africa [Dataset]. https://catalog.ihsn.org/catalog/4628

Project for Statistics on Living Standards and Development 1993 - South Africa

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 29, 2019

Dataset authored and provided by

Southern Africa Labour and Development Research Unit

Time period covered

1993

Area covered

South Africa

Description

Abstract

The Project for Statistics on Living standards and Development was a coutrywide World Bank Living Standards Measurement Survey. It covered approximately 9000 households, drawn from a representative sample of South African households. The fieldwork was undertaken during the nine months leading up to the country's first democratic elections at the end of April 1994. The purpose of the survey was to collect statistical information about the conditions under which South Africans live in order to provide policymakers with the data necessary for planning strategies. This data would aid the implementation of goals such as those outlined in the Government of National Unity's Reconstruction and Development Programme.

Geographic coverage

National coverage

Analysis unit

Households
Individuals
Community

Universe

All Household members.

Individuals in hospitals, old age homes, hotels and hostels of educational institutions were not included in the sample. Migrant labour hostels were included. In addition to those that turned up in the selected ESDs, a sample of three hostels was chosen from a national list provided by the Human Sciences Research Council and within each of these hostels a representative sample was drawn on a similar basis as described above for the households in ESDs.

Kind of data

Sample survey data [ssd]

Sampling procedure

Sample size is 9,000 households

The sample design adopted for the study was a two-stage self-weightingdesign in which the first stage units were Census Enumerator Subdistricts (ESDs, or their equivalent) and the second stage were households.

The advantage of using such a design is that it provides a representative sample that need not be based on accurate census population distribution.in the case of South Africa, the sample will automatically include many poor people, without the need to go beyond this and oversample the poor. Proportionate sampling as in such a self-weighting sample design offers the simplest possible data files for further analysis, as weights do not have to be added. However, in the end this advantage could not be retained and weights had to be added.

The sampling frame was drawn up on the basis of small, clearly demarcated area units, each with a population estimate. The nature of the self-weighting procedure adopted ensured that this population estimate was not important for determining the final sample, however. For most of the country, census ESDs were used. Where some ESDs comprised relatively large populations as for instance in some black townships such as Soweto, aerial photographs were used to divide the areas into blocks of approximately equal population size. In other instances, particularly in some of the former homelands, the area units were not ESDs but villages or village groups.

In the sample design chosen, the area stage units (generally ESDs) were selected with probability proportional to size, based on the census population. Systematic sampling was used throughout that is, sampling at fixed interval in a list of ESDs, starting at a randomly selected starting point. Given that sampling was self-weighting, the impact of stratification was expected to be modest. The main objective was to ensure that the racial and geographic breakdown approximated the national population distribution. This was done by listing the area stage units (ESDs) by statistical region and then within the statistical region by urban or rural. Within these sub-statistical regions, the ESDs were then listed in order of percentage African. The sampling interval for the selection of the ESDs was obtained by dividing the 1991 census population of 38,120,853 by the 300 clusters to be selected. This yielded 105,800. Starting at a randomly selected point, every 105,800th person down the cluster list was selected. This ensured both geographic and racial diversity (ESDs were ordered by statistical sub-region and proportion of the population African). In three or four instances, the ESD chosen was judged inaccessible and replaced with a similar one.

In the second sampling stage the unit of analysis was the household. In each selected ESD a listing or enumeration of households was carried out by means of a field operation. From the households listed in an ESD a sample of households was selected by systematic sampling. Even though the ultimate enumeration unit was the household, in most cases "stands" were used as enumeration units. However, when a stand was chosen as the enumeration unit all households on that stand had to be interviewed.

Census population data, however, was available only for 1991. An assumption on population growth was thus made to obtain an approximation of the population size for 1993, the year of the survey. The sampling interval at the level of the household was determined in the following way: Based on the decision to have a take of 125 individuals on average per cluster (i.e. assuming 5 members per household to give an average cluster size of 25 households), the interval of households to be selected was determined as the census population divided by 118.1, i.e. allowing for population growth since the census. It was subsequently discovered that population growth was slightly over-estimated but this had little effect on the findings of the survey.

Mode of data collection

Face-to-face [f2f]

Research instrument

The main instrument used in the survey was a comprehensive household questionnaire. This questionnaire covered a wide range of topics but was not intended to provide exhaustive coverage of any single subject. In other words, it was an integrated questionnaire aimed at capturing different aspects of living standards. The topics covered included demography, household services, household expenditure, educational status and expenditure, remittances and marital maintenance, land access and use, employment and income, health status and expenditure and anthropometry (children under the age of six were weighed and their heights measured). This questionnaire was available to households in two languages, namely English and Afrikaans. In addition, interviewers had in their possession a translation in the dominant African language/s of the region.

In addition to the detailed household questionnaire referred to above, a community questionnaire was administered in each cluster of the sample. The purpose of this questionnaire was to elicit information on the facilities available to the community in each cluster. Questions related primarily to the provision of education, health and recreational facilities. Furthermore there was a detailed section for the prices of a range of commodities from two retail sources in or near the cluster: a formal source such as a supermarket and a less formal one such as the "corner cafe" or a "spaza". The purpose of this latter section was to obtain a measure of regional price variation both by region and by retail source. These prices were obtained by the interviewer. For the questions relating to the provision of facilities, respondents were "prominent" members of the community such as school principals, priests and chiefs.

Cleaning operations

All the questionnaires were checked when received. Where information was incomplete or appeared contradictory, the questionnaire was sent back to the relevant survey organization. As soon as the data was available, it was captured using local development platform ADE. This was completed in February 1994. Following this, a series of exploratory programs were written to highlight inconsistencies and outlier. For example, all person level files were linked together to ensure that the same person code reported in different sections of the questionnaire corresponded to the same person. The error reports from these programs were compared to the questionnaires and the necessary alterations made. This was a lengthy process, as several files were checked more than once, and completed at the beginning of August 1994. In some cases questionnaires would contain missing values, or comments that the respondent did not know, or refused to answer a question.

These responses are coded in the data files with the following values: VALUE MEANING -1 : The data was not available on the questionnaire or form -2 : The field is not applicable -3 : Respondent refused to answer -4 : Respondent did not know answer to question

Data appraisal

The data collected in clusters 217 and 218 should be viewed as highly unreliable and therefore removed from the data set. The data currently available on the web site has been revised to remove the data from these clusters. Researchers who have downloaded the data in the past should revise their data sets. For information on the data in those clusters, contact SALDRU http://www.saldru.uct.ac.za/.

Clear search

Close search

Google apps

Main menu

Project for Statistics on Living Standards and Development 1993 - South...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Data appraisal

Data from: Expensive but Worth It: Live Projects in Statistics, Data...

Data for this project include human subjects PII and cannot be shared.

Social Media and Mental Health

Grant Giving Statistics for Metro Ideas Project

Fair Trade Commission 105 Annual Commissioned Research Project Research...

Dataset of book subjects that contain Statistics in research and development...

Puerto Rico Research Topics Discussed at the USGS Natural Hazards Internal...

Shopping Mall Customer Data Segmentation Analysis

Appendix S1 - Epidemiology of Functional Abdominal Bloating and Its Impact...

Dataset - Understanding the software and data used in the social sciences

Google Certificate BellaBeats Capstone Project

Central Police University independent research project topic

UDOT Region 4 - Arches Hotspot Preliminary Project Ideas Map 2018

Climate Change and Environmental Issues Dataset from Ukrainian Telegram...

Academy of Program/Project & Engineering Leadership ASK the Academy Past...

HCUP Fast Stats

Academy of Program/Project & Engineering Leadership ASK Magazine Past Issues...

Approved research projects by the Committee for the Protection of Human...

Synthetic Administrative Data: Census 1991, 2023

Project for Statistics on Living Standards and Development 1993 - South AfricaSee More Versions

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Data appraisal

Project for Statistics on Living Standards and Development 1993 - South Africa