Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.
MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.
INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details
The dataset is easy to understand and self-explanatory!
It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundMany technological, biological, social, and information networks fall into the broad class of ‘small-world’ networks: they have tightly interconnected clusters of nodes, and a shortest mean path length that is similar to a matched random graph (same number of nodes and edges). This semi-quantitative definition leads to a categorical distinction (‘small/not-small’) rather than a quantitative, continuous grading of networks, and can lead to uncertainty about a network's small-world status. Moreover, systems described by small-world networks are often studied using an equivalent canonical network model – the Watts-Strogatz (WS) model. However, the process of establishing an equivalent WS model is imprecise and there is a pressing need to discover ways in which this equivalence may be quantified.Methodology/Principal FindingsWe defined a precise measure of ‘small-world-ness’ S based on the trade off between high local clustering and short path length. A network is now deemed a ‘small-world’ if S>1 - an assertion which may be tested statistically. We then examined the behavior of S on a large data-set of real-world systems. We found that all these systems were linked by a linear relationship between their S values and the network size n. Moreover, we show a method for assigning a unique Watts-Strogatz (WS) model to any real-world network, and show analytically that the WS models associated with our sample of networks also show linearity between S and n. Linearity between S and n is not, however, inevitable, and neither is S maximal for an arbitrary network of given size. Linearity may, however, be explained by a common limiting growth process.Conclusions/SignificanceWe have shown how the notion of a small-world network may be quantified. Several key properties of the metric are described and the use of WS canonical models is placed on a more secure footing.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Despite strong interest in how noise affects marine mammals, little is known about the most abundant and commonly exposed taxa. Social delphinids occur in groups of hundreds of individuals that travel quickly, change behavior ephemerally, and are not amenable to conventional tagging methods, posing challenges in quantifying noise impacts. We integrated drone-based photogrammetry, strategically-placed acoustic recorders, and broad-scale visual observations to provide complimentary measurements of different aspects of behavior for short- and long-beaked common dolphins. We measured behavioral responses during controlled exposure experiments (CEEs) of military mid-frequency (3-4 kHz) active sonar (MFAS) using simulated and actual Navy sonar sources. We used latent-state Bayesian models to evaluate response probability and persistence in exposure and post-exposure phases. Changes in sub-group movement and aggregation parameters were commonly detected during different phases of MFAS CEEs but not control CEEs. Responses were more evident in short-beaked common dolphins (n=14 CEEs), and a direct relationship between response probability and received level was observed. Long-beaked common dolphins (n=20) showed less consistent responses, although contextual differences may have limited which movement responses could be detected. These are the first experimental behavioral response data for these abundant dolphins to directly inform impact assessments for military sonars.
Methods
We used complementary visual and acoustic sampling methods at variable spatial scales to measure different aspects of common dolphin behavior in known and controlled MFAS exposure and non-exposure contexts. Three fundamentally different data collection systems were used to sample group behavior. A broad-scale visual sampling of subgroup movement was conducted using theodolite tracking from shore-based stations. Assessments of whole-group and sub-group sizes, movement, and behavior were conducted at 2-minute intervals from shore-based and vessel platforms using high-powered binoculars and standardized sampling regimes. Aerial UAS-based photogrammetry quantified the movement of a single focal subgroup. The UAS consisted of a large (1.07 m diameter) custom-built octocopter drone launched and retrieved by hand from vessel platforms. The drone carried a vertically gimballed camera (at least 16MP) and sensors that allowed precise spatial positioning, allowing spatially explicit photogrammetry to infer movement speed and directionality. Remote-deployed (drifting) passive acoustic monitoring (PAM) sensors were strategically deployed around focal groups to examine both basic aspects of subspecies-specific common dolphin acoustic (whistling) behavior and potential group responses in whistling to MFAS on variable temporal scales (Casey et al., in press). This integration allowed us to evaluate potential changes in movement, social cohesion, and acoustic behavior and their covariance associated with the absence or occurrence of exposure to MFAS. The collective raw data set consists of several GB of continuous broadband acoustic data and hundreds of thousands of photogrammetry images.
Three sets of quantitative response variables were analyzed from the different data streams: directional persistence and variation in speed of the focal subgroup from UAS photogrammetry; group vocal activity (whistle counts) from passive acoustic records; and number of sub-groups within a larger group being tracked by the shore station overlook. We fit separate Bayesian hidden Markov models (HMMs) to each set of response data, with the HMM assumed to have two states: a baseline state and an enhanced state that was estimated in sequential 5-s blocks throughout each CEE. The number of subgroups was recorded during periodic observations every 2 minutes and assumed constant across time blocks between observations. The number of subgroups was treated as missing data 30 seconds before each change was noted to introduce prior uncertainty about the precise timing of the change. For movement, two parameters relating to directional persistence and variation in speed were estimated by fitting a continuous time-correlated random walk model to spatially explicit photogrammetry data in the form of location tracks for focal individuals that were sequentially tracked throughout each CEE as a proxy for subgroup movement.
Movement parameters were assumed to be normally distributed. Whistle counts were treated as normally distributed but truncated as positive because negative count data is not possible. Subgroup counts were assumed to be Poisson distributed as they were distinct, small values. In all cases, the response variable mean was modeled as a function of the HMM with a log link:
log(Responset) = l0 + l1Z t
where at each 5-s time block t, the hidden state took values of Zt = 0 to identify one state with a baseline response level l0, or Zt = 1 to identify an “enhanced” state, with l1 representing the enhancement of the quantitative value of the response variable. A flat uniform (-30,30) prior distribution was used for l0 in each response model, and a uniform (0,30) prior distribution was adopted for each l1 to constrain enhancements to be positive. For whistle and subgroup counts, the enhanced state indicated increased vocal activity and more subgroups. A common indicator variable was estimated for the latent state for both the movement parameters, such that switching to the enhanced state described less directional persistence and more variation in velocity. Speed was derived as a function of these two parameters and was used here as a proxy for their joint responses, representing directional displacement over time.
To assess differences in the behavior states between experimental phases, the block-specific latent states were modeled as a function of phase-specific probabilities, Z t ~ Bernoulli (pphaset), to learn about the probability pphase of being in an enhanced state during each phase. For each pre-exposure, exposure, and post-exposure phase, this probability was assigned a flat uniform (0,1) prior probability. The model was programmed in R (R version 3.6.1; The R Foundation for Statistical Computing) with the nimble package (de Valpine et al. 2020) to estimate posterior distributions of model parameters using Markov Chain Monte Carlo (MCMC) sampling. Inference was based on 100,000 MCMC samples following a burn-in of 100,000, with chain convergence determined by visual inspection of three MCMC chains and corroborated by convergence diagnostics (Brooks and Gelman, 1998). To compare behavior across phases, we compared the posterior distribution of the pphase parameters for each response variable, specifically by monitoring the MCMC output to assess the “probability of response” as the proportion of iterations for which pexposure was greater or less than ppre-exposure and the “probability of persistence” as the proportion of iterations for which ppost-exposre was greater or less than ppre-exposure. These probabilities of response and persistence thus estimated the extent of separation (non-overlap) between the distributions of pairs of pphase parameters: if the two distributions of interest were identical, then p=0.5, and if the two were non-overlapping, then p=1. Similarly, we estimated the average values of the response variables in each phase by predicting phase-specific functions of the parameters:
Mean.responsephase = exp(l0 + l1pphase)
and simply derived average speed as the mean of the speed estimates for 5-second blocks in each phase.
Facebook
TwitterDataset Overview
| Attribute | Details |
|---|---|
| Time Span | 2015–2025 |
| Countries Included | 20 global economies |
| Total Records | 220 rows |
| Total Features | 12 quantitative & qualitative attributes |
| Data Type | Synthetic, statistically coherent |
| Tools Used | Python (Faker, NumPy, Pandas) |
| License | CC BY-NC 4.0 – Attribution Non-Commercial |
| Creator | Emirhan Akkuş – Kaggle Expert |
This dataset provides a macro-level simulation of how artificial intelligence and automation have transformed global workforce dynamics, productivity growth, and job distribution during the last decade. It is designed for predictive analytics, forecasting, visualization, and policy research applications.
Data Generation Process | Step | Description | | :-------------------------- | :---------------------------------------------------------------------------------------------------------------- | | 1. Initialization | A baseline AI investment and automation rate were defined for each country (between 5–80 billion USD and 10–40%). | | 2. Temporal Simulation | Yearly values were simulated for 2015–2025 using exponential and non-linear growth models with controlled noise. | | 3. Correlation Modeling | Employment, productivity, and salary were dynamically linked to automation and AI investment levels. | | 4. Randomization | Gaussian noise (±2%) was introduced to prevent perfect correlation and ensure natural variability. | | 5. Policy Simulation | Synthetic indexes were calculated for AI readiness, policy maturity, and reskilling investment efforts. | | 6. Export | Final data were consolidated and exported to CSV using Pandas for easy reproducibility. |
The dataset was generated to maintain internal coherence — as automation and AI investment increase, employment tends to slightly decline, productivity grows, and reskilling budgets expand proportionally.
Column Definitions | Column | Description | Value Range / Type | | :----------------------------------- | :---------------------------------------------- | :------------------------- | | Year | Observation year between 2015–2025 | Integer | | Country | Country name | Categorical (20 unique) | | AI_Investment_BillionUSD | Annual AI investment (in billions of USD) | Continuous (5–200) | | Automation_Rate_Percent | Percentage of workforce automated | Continuous (10–95%) | | Employment_Rate_Percent | Percentage of total population employed | Continuous (50–80%) | | Average_Salary_USD | Mean annual salary in USD | Continuous (25,000–90,000) | | Productivity_Index | Productivity score scaled 0–100 | Continuous | | Reskilling_Investment_MillionUSD | Government/corporate reskilling investment | Continuous (100–5,000) | | AI_Policy_Index | Policy readiness index (0–1) | Float | | Job_Displacement_Million | Estimated number of jobs replaced by automation | Continuous (0–3 million) | | Job_Creation_Million | New AI-driven jobs created | Continuous (0–4 million) | | AI_Readiness_Score | Composite readiness and adoption index | Continuous (0–100) |
Each feature is designed to maintain realistic relationships between AI investments, automation, and socio-economic outcomes.
Analytical Applications | Application Area | Example Analyses | | :---------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Exploratory Data Analysis (EDA) | Study how AI investment evolves across countries, compare productivity and employment patterns, or compute correlation...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two types of data: phase images and trained model files.
Real phase images - these phase images are contained with the files named with the prefix "real_". The type of the data files is ".npz", to be loaded with NumPy (np.load()), as a dictionary. The data is stored within the key ["arr_0"]. The images depict cells [1], organoids [2], phantoms [3-4] and regular 3D printed structures with high scattering properties [5]. The images have been augmented in order to expand the volume of the training dataset. All images are of shape (256,256,1). It is a big dataset containing 27,189 images of each type for training the unwrapping model:
unwrapped - continuous phase distribution (float32)
wrapped - phase wrapped into mod2(\pi) (float32)
wrapcount - wrap count phase maps coded in the integer form (0,1,2...) (uint8)
Synthetic phase images - phase images in these files were generated algorithmically in the MATLAB programming language. The files containing this dataset have a prefix "synthetic_". The type of the data files is ".npz", to be loaded with NumPy (np.load()), as a dictionary. The data is stored within the key ["arr_0"]. Phase images contained in the synthetic dataset can be split into 3 types by their type: spherical distribution, simulated cells w/ spherical background and simulated cells w/ introduced linear tilt. All images are of shape (256,256,1). This dataset contains 10,000 images of each type for training the unwrapping and denoising models:
unwrapped - continuous phase distribution (float32)
wrapped - phase wrapped into mod2(\pi) (float32)
wrapcount - wrap count phase maps coded in the integer form (0,1,2...) (uint8)
noised - wrapped phase images w/ synthetic noise (float32).
Trained models - trained model files. These model files are in the format ".h5", which contains the model architecture and the weights. They have been developed and saved with the keras library, and are loaded with the keras.models.load_model() function. The models list:
Unet_Denoising.h5 - U-Net model used for denoising as an image translation task. The input is a wrapped phase image with noise and the output is the same wrapped phase distribution, but denoised. Model is trained on the synthetic phase dataset.
Attn_Unet_Unwrapping.h5 - U-Net model with Attention Gates and Residual Blocks trained for the semantic segmentation task. The input of the model is the wrapped phase image and its output is the wrap count map. Model is trained on the real phase dataset.
[1] M. Baczewska, W. Krauze, A. Kuś, P. Stępień, K. Tokarska, K. Zukowski, E. Malinowska, Z. Brzózka, and M. Kujawińska, “On-chip holographic tomography for quantifying refractive index changes of cells’ dynamics,” in Quantitative Phase Imaging VIII, vol. 11970 Y. Liu, G. Popescu, and Y. Park, eds., International Society for Optics and Photonics (SPIE, 2022), p. 1197008. [2] P. Stępień, M. Ziemczonok, M. Kujawińska, M. Baczewska, L. Valenti, A. Cherubini, E. Casirati, and W. Krauze, “Numerical refractive index correction for the stitching procedure in tomographic quantitative phase imaging,” Biomed. Opt. Express 13, 5709–5720 (2022). [3] M. Ziemczonok, A. Kuś, P. Wasylczyk, and M. Kujawińska, “3d-printed biological cell phantom for testing 3d quantitative phase imaging systems,” Sci. Reports 9, 1–9 (2019). [4] M. Ziemczonok, A. Kuś, and M. Kujawińska, “Optical diffraction tomography meets metrology — measurement accuracy on cellular and subcellular level,” Measurement 195, 111106 (2022). [5] W. Krauze, A. Kuś, M. Ziemczonok, M. Haimowitz, S. Chowdhury, and M. Kujawińska, “3d scattering microphantom sample to assess quantitative accuracy in tomographic phase microscopy techniques,” Sci. Reports 12, 1–9 (2022).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains two types of data: phase images and trained model files.
Real phase images - these phase images are contained with the files named with the prefix "real_". The type of the data files is ".npz", to be loaded with NumPy (np.load()), as a dictionary. The data is stored within the key ["arr_0"]. The images depict cells [1], organoids [2], phantoms [3-4] and regular 3D printed structures with high scattering properties [5]. The images have been augmented in order to expand the volume of the training dataset. All images are of shape (256,256,1). It is a big dataset containing 27,189 images of each type for training the unwrapping model:
unwrapped - continuous phase distribution (float32)
wrapped - phase wrapped into mod2\(\pi\) (float32)
wrapcount - wrap count phase maps coded in the integer form (0,1,2...) (uint8)
Synthetic phase images - phase images in these files were generated algorithmically in the MATLAB programming language. The files containing this dataset have a prefix "synthetic_". The type of the data files is ".npz", to be loaded with NumPy (np.load()), as a dictionary. The data is stored within the key ["arr_0"]. Phase images contained in the synthetic dataset can be split into 3 types by their type: spherical distribution, simulated cells w/ spherical background and simulated cells w/ introduced linear tilt. All images are of shape (256,256,1). This dataset contains 10,000 images of each type for training the unwrapping and denoising models:
unwrapped - continuous phase distribution (float32)
wrapped - phase wrapped into mod2\(\pi\) (float32)
wrapcount - wrap count phase maps coded in the integer form (0,1,2...) (uint8)
noised - wrapped phase images w/ synthetic noise (float32).
Trained models - trained model files. These model files are in the format ".h5", which contains the model architecture and the weights. They have been developed and saved with the keras library, and are loaded with the keras.models.load_model() function. The models list:
Unet_Denoising_1.h5 - U-Net model used for denoising as an image translation task. The input is a wrapped phase image with noise and the output is the same wrapped phase distribution, but denoised. Model is trained on the synthetic phase dataset.
Unet_Denoising_2.h5 - Similar model to the Unet_Denoising_1.h5, which denoises wrapped phase images with equally good performance.
Attn_Unet_Unwrapping.h5 - U-Net model with Attention Gates and Residual Blocks trained for the semantic segmentation task. The input of the model is the wrapped phase image and its output is the wrap count map. Model is trained on the real phase dataset.
[1] M. Baczewska, W. Krauze, A. Kuś, P. Stępień, K. Tokarska, K. Zukowski, E. Malinowska, Z. Brzózka, and M. Kujawińska, “On-chip holographic tomography for quantifying refractive index changes of cells’ dynamics,” in Quantitative Phase Imaging VIII, vol. 11970 Y. Liu, G. Popescu, and Y. Park, eds., International Society for Optics and Photonics (SPIE, 2022), p. 1197008.
[2] P. Stępień, M. Ziemczonok, M. Kujawińska, M. Baczewska, L. Valenti, A. Cherubini, E. Casirati, and W. Krauze, “Numerical refractive index correction for the stitching procedure in tomographic quantitative phase imaging,” Biomed. Opt. Express 13, 5709–5720 (2022).
[3] M. Ziemczonok, A. Kuś, P. Wasylczyk, and M. Kujawińska, “3d-printed biological cell phantom for testing 3d quantitative phase imaging systems,” Sci. Reports 9, 1–9 (2019).
[4] M. Ziemczonok, A. Kuś, and M. Kujawińska, “Optical diffraction tomography meets metrology — measurement accuracy on cellular and subcellular level,” Measurement 195, 111106 (2022).
[5] W. Krauze, A. Kuś, M. Ziemczonok, M. Haimowitz, S. Chowdhury, and M. Kujawińska, “3d scattering microphantom sample to assess quantitative accuracy in tomographic phase microscopy techniques,” Sci. Reports 12, 1–9 (2022).
Facebook
TwitterSocial Impact (SI) is conducting an impact evaluation of the MCC Tanzania Water Sector Project. The impact of the WSP will be assessed through a rigorous, quasi-experimental impact evaluation design that combines a difference-in-differences (DD) approach with generalized propensity score matching (GPSM), also called continuous propensity score matching. GPSM is an extension of traditional propensity score matching which facilitates the evaluation of the impact of continuous rather than binary treatment. The design reflects particular characteristics of the Tanzania WSP. First, the impacts of the upgraded water infrastructure are expected to be diffuse in each city; therefore, identifying a counterfactual through experimental methods is not feasible. Further, the main treatment is considered to be exposure to an increased supply of water due to the Water Sector Project infrastructure upgrades, and households will be affected differentially depending on their starting conditions (e.g. availability of water) and their position along the distribution grid. Thus, a continuous treatment approach is needed to measure the impacts of incremental increases in water supply. The GPSM technique (which will be carried out after the completion of end-line data collection) enables comparisons of outcomes between similar households that experience varying levels of improvements to water supply due to the intervention. The evaluation questions to be answered address a range of topics, including: the project's impact on water supply, access to water, and water quality; the project's impact on water consumption, water-related illness, and investment in human capital; differences in project impact by gender and socioeconomic status; the project's effect on businesses, schools, and health centers; project implementation; unintended consequences of the project; and the sustainability of the project over time. In addition to the main analysis described above, additional qualitative, direct observation (e.g. water quality tests), secondary data review, and geospatial data collection components were incorporated to facilitate comprehensive, context-specific responses to these evaluation questions.
Urban municipalities of the cities of Dar es Salaam (Ilala, Kinondoni, and Temeke) and Morogoro (Morogoro Urban)
Main analysis: households and individuals. (Some analyses using water quality or supply data are done at the cluster level (enumeration areas). Qualitative analyses used data collected from community members, project stakeholders, enterprises, health centers, and schools.
The household and phone surveys were administered to one respondent per household, and collected information corresponding to the household as well as to each current household member (usual residents). The water quality tests were administered to up to two sources per cluster (either household tap, or other shared source in the cluster). The qualitative components included focus group discussions of residents across each city, semi-structured interviews of community-level water sector stakeholders, and key informant interviews of key project stakeholders.
Sample survey data [ssd]
Households were sampled from both cities using a two-stage cluster sampling methodology, with stratification in Dar es Salaam by the current water supply to an area. Clusters were defined as census enumeration areas (EAs). The sample frame for clusters was an inventory of enumeration areas used for the 2012 census in Tanzania, obtained from the Tanzania National Bureau of Statistics (NBS). The required sample size was 5008 households (8 households from 626 clusters), split between the two cities evenly. In Morogoro, 313 clusters were randomly sampled from the master inventory. In Dar es Salaam, with the availability of information about water supply by ward, clusters were chosen by stratified random sampling, out of 5 strata corresponding to different levels of current water access through the public distribution network. Selection of clusters was done using a random number generator in Stata 12 software. After selecting 313 clusters in each city, maps were obtained from the NBS. For each of the selected clusters, listing teams worked with local community representatives to enumerate all households in each EA and generate a complete sample frame of households. From each cluster's household list, 8 households per cluster (EA) were randomly selected for the household survey using a unique random number table for each cluster; additional households from the list could be accessed in order to replace households as needed due to non-response. After the households were interviewed, a sub-set of eligible households were selected for water quality testing (up to two per cluster). Following the household survey, the full household sample was included in three rounds of a follow-up survey administered by phone, by the EDI team.
If the listing team encountered any EAs in either city that had been demarcated strictly as an institution (e.g. hospital, school, jail) with only staff residing in the cluster, but had not been previously excluded from the sample frame, that EA was replaced by the next eligible EA from the list based on its random number, and the institutional cluster was excluded from the sample frame altogether. If community members or local officials declined to be involved in the surveying for any reason, that cluster was replaced. No deviations were made in the sampling procedures for the household survey. For the water quality testing, a much smaller sample of household taps was available for testing compared to initial expectations, so the eligibility for water quality tests was expanded at the beginning of these exercises to include shared sources in the community. Qualitative sampling was purposive and therefore was tailored to the specific objective of interviewing each type of respondent; while focus groups were initially planned to be a mix of males and females, after the first focus group the team decided to limit the participants to females only.
Household Survey Phone Survey
EDI employed a data processing and quality control team, which was tasked with ensuring the quality of data collected through the Surveybe system. Daily checks of questionnaire data were conducted in Stata using a continually updated checking do-file, which flagged discrepancies and data inconsistencies. Each supervisor was provided a set of data checks to address with each team on a continuous basis. Data processing and quality control staff were primarily based at EDI headquarters, but were present in the field for several weeks during the beginning stages of each phase of data collection. This presence allowed them to participate in feedback sessions with interviewers demonstrating how data checks were conducted and how errors would be communicated to supervisors. The data processing team updated their checks periodically to accommodate new checks arising during the survey period, often in coordination with SI. SI's data quality monitoring strategy included continuous technical support to EDI over the entire period of data collection, field presence at all critical junctures during preparations for data collection, and several rounds of independent data verification with interim datasets provided throughout the survey period by EDI. SI wrote do files (using Stata 12) to monitor the quality of the data as it was received, updating them on a continuous basis and making adjustments after continuous communications with EDI's data manager and data processing teams, and ran through each of the datasets at numerous points between May and September 2013, communicating concerns or questions to EDI through a standard form used throughout the period of data collection. After the conclusion of the data collection, SI conducted a comprehensive data quality review of all datasets, inclusive of quantitative and qualitative datasets. SI submitted requests resulting from this review to EDI. EDI responded to these requests and subsequently delivered final datasets to SI via MCA-T. EDI produced several briefs on data quality assurance during data collection, and SI produced a data quality report for internal review by MCC.
Response rates for the household survey in Dar es Salaam were above 87%, and above 92% in Morogoro. Overall, for the phone follow-up survey, response rates were 85% (round 1), 88% (round 2), and 90% (round 3); 81% of households overall participated in all 3 rounds while 90% participated in at least one round. Water quality test samples were drawn from household taps when available, or otherwise from other shared-source locations in the survey cluster (water quality results are intended to be representative at the cluster level). In Dar es Salaam, 95% of sampled clusters were covered by water quality tests, along with 99% of sampled clusters in Morogoro. Sampling for qualitative data collection components was purposive, in order to include specific types of respondents and target areas of each city with specific characteristics; this purposive sampling made extensive use of preliminary quantitative survey data and geospatial data. Qualitative research included, in total, 14 focus group discussions, 52 semi-structured interviews, and 10 key informant interviews.
Sampling errors were calculated for all estimated quantities of indicators when
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In recent decades, with the support and traction of a number of key policy steps, the scale of China’s sports industry has achieved a new leap. The optimization of industrial structure has made new progress. From "nascent" to "strong", China’s sports industry grows in importance of the national economy. In the meantime, sport is a significant way to promote health. With the rapid growth of people’s requirements for sport and health, it is urgent to re-evaluate the past development path and formulate new directions so as to continuously improve and optimize the system. This study systematically sorts out China’s sports industry documents at different stages, and describes the focus of each stage and the overall evolution track. On this basis, text mining and quantitative evaluation being used to extract high-frequency words of sports industry documents, and a sports industry document evaluation system including 9 first-level indicators and 47 second-level indicators is established. In this study, text similarity analysis is used to realize intelligent PMC index analysis, which effectively improves the analysis efficiency and makes up for the deficiency of simple qualitative analysis. According to the study, China’s sports industry policies are scientific and effective. Combining with the development direction of industrial transformation, it provides ideas for the future adjustment and optimization of sports industry evolution path.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample of family names (U.S. Census) and given names (mortgage data).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Blockchain technology is widely used in almost every domain of life nowadays including healthcare sector. Although there are existing frameworks to govern healthcare data but they have certain limitations in effectiveness of data governance to ensure security and privacy. This study aimed to evaluate effectiveness of health care data governance frameworks, examining security and privacy concerns and limitations within the existing frameworks of ISO Standards, GDPR, and HIPAA. In this study quantitative research approach was followed. A sample of 250 participants from Islamabad, Lahore and Karachi based healthcare experts, IT specialist, blockchain research and developer, administrator was selected. The collected data was analyzed though frequencies and descriptive statistical tests with the help of SPSS. The results revealed un-satisfaction for data governance frameworks, i.e., ISO standards, GDPR, and HIPAA in terms of security concerns, i.e., data encryption, access controls, audit trails, interoperability and standards, smart contracts for compliance, data integrity, regulatory compliance monitoring and privacy concerns, i.e., consent management, anonymization and pseudonymization, data minimization. The participants agreed that there is a need of integration of reliable data governance framework in health care data management. Various personalized governance techniques, targeted security upgrades, and continuous improvement in the specific customized data governance framework has been presented based on the findings of the study. An implementation of blockchain-based systems is recommended in order to ensure and expand the security and privacy of healthcare data management.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Continuous norming methods have seldom been subjected to scientific review. In this simulation study, we compared parametric with semi-parametric continuous norming methods in psychometric tests by constructing a fictitious population model within which a latent ability increases with age across seven age groups. We drew samples of different sizes (n = 50, 75, 100, 150, 250, 500 and 1,000 per age group) and simulated the results of an easy, medium, and difficult test scale based on Item Response Theory (IRT). We subjected the resulting data to different continuous norming methods and compared the data fit under the different test conditions with a representative cross-validation dataset of n = 10,000 per age group. The most significant differences were found in suboptimal (i.e., too easy or too difficult) test scales and in ability levels that were far from the population mean. We discuss the results with regard to the selection of the appropriate modeling techniques in psychometric test construction, the required sample sizes, and the requirement to report appropriate quantitative and qualitative test quality criteria for continuous norming methods in test manuals.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean absolute prediction error in the N-fold cross-validation set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of principal leadership stiles mentioned in literature background paragraph.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics on water, sanitation and hygiene in study villages based on data from school heads.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Stemming from the traditional use of field observers to score states and events, the study of animal behaviour often relies on analyses of discrete behavioural categories. Many studies of acoustic communication record sequences of animal sounds, classify vocalizations, and then examine how call categories are used relative to behavioural states and events. However, acoustic parameters can also convey information independent of call type, offering complementary study approaches to call classifications. Animal-attached tags can continuously sample high-resolution behavioural data on sounds and movements, which enables testing how acoustic parameters of signals relate to parameters of animal motion. Here we present this approach through case studies on wild common bottlenose dolphins (Tursiops truncatus). Using data from sound-and-movement recording tags deployed in Sarasota (FL), we parameterized dolphin vocalizations and motion to investigate how senders and receivers modified movement parameters (including vectorial dynamic body acceleration, “VeDBA”, a proxy for activity intensity) as a function of signal parameters. We show that: 1) VeDBA of one female during consortships had a negative relationship with centroid frequency of male calls, matching predictions about agonistic interactions based on motivation-structural rules; 2) VeDBA of four males had a positive relationship with modulation rate of their pulsed vocalizations, confirming predictions that click-repetition rate of these calls increases with agonism intensity. Tags offer opportunities to study animal behaviour through analyses of continuously-sampled quantitative parameters, which can complement traditional methods and facilitate research replication. Our case studies illustrate the value of this approach to investigate communicative roles of acoustic parameter changes.Dataset containing acoustic and motion parameters derived as time series from Dtag data, during periods of bottlenose dolphin social interaction. The Dtag analyzed was deployed on a female bottlenose dolphin in Sarasota Bay, FL.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Themes are presented with example quotations from each round.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Chi square of Fisher’s exact test for categorical variables, Wilcoxon Normal for continuous variables.Calculated as the ratio of the number of tablets or gel not returned over expected product use days cumulatively across all follow-up visits.**Estimated at 3-month visit. Sample size is smaller because 60 participants missed the first quarterly visit; calculated as percentage of days in past week with self-reported product use.Note: ACASI = audio computer-assisted self-interview. CRF = case report form. ns = non-significant. All continuous variable summaries are median, mean (minimum, maximum).
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.
MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.
INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details
The dataset is easy to understand and self-explanatory!
It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?