Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard
This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.
Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.
These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.
This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.
Facebook
TwitterThe project likely increased the incomes of most but not all participants. The spillover effects of change in household income was not as wide as originally anticipated. In general, we found that the program is much more effective for the high performing households. Indeed, the upper quantile, high performing households exhibit a 50% larger impact on their income in targeted activities, and their observed household living standards (as measured by per-capita consumption expenditures) increase significantly 2-3 years after joining the RBD program. In contrast, the lower quantile households show no increase in living standards, even after 3-4 years in the program.
The project was delivered in two small districts in northwest Nicaragua: Leon and Chinadega. These two districts cover a rather small geographic coverage.
Producers: those persons living on the farm who make the decision about farm's production, inputs to production
The sample list contained information about potential farmer leaders, the location of their farms, the communities where the eligible farmers could be found, and a radius of coverage within which about 30 farmers could be found (using the leader's farm as the origin). The program did not dispose of a complete list of names of potential satellite farmers. In order to get more precise information about the number and location of eligible farmers around the leader, a quasi-census of eligible farmers was carried out, using specific criteria provided by the RBD Program for each type of activity (Table 2). These criteria specified minimum and maximum farm sizes, minimum levels of farmer experience in that target crops, and also stipulated that it must be possible to reach the farm by road during all seasons. Starting at the leader's farm, the quasi-census verified the characteristics of all neighboring farmers until a sampling quota of 30 eligible farmers was reached, or until the maximum radius was reached. Using the quasi-census, 3000 farmers were identified, spread over 140 geographical units (clusters). From every list of clusters, we expected to randomly select 12 farmers.
Sample survey data [ssd]
The challenge of this and all impact evaluation efforts is to identify a control group that is identical to the treatment group in every way except that they have not benefitted from the intervention under evaluation.the evaluation team worked with the RBD implementation team to identify all geographic clusters that would eventually be observed RBD services. The evaluation team then selected a subset of these clusters for random assignment to either early or late treatment status. This strategy not only created a temporary conventional control group, it also randomized the duration of time in the program, a feature that will prove vital in the continuous treatment estimates presented below.In late treatment clusters, services were not initiated until approximately 18 months later, in early 2009 at the time of the midline survey. Because clusters were randomly allocated to early and late treatment conditions, we can anticipate that on average the late treatment group should function as a valid control group, identical to the early group in every way except early receipt of RBD services. The economic status of the late group in 2009 should thus be a good predictor of what the status of the early group would have been in the absence of RBD services. Both early and late treatment clusters were then surveyed again near the end of the program in 2011. Once the random assignment of early and late clusters was made, the impact evaluation team created a roster of all eligible producers in these clusters, and then randomly selected a sample of 1600 households split between early and late areas. These 1600 households were then invited to participate in the impact study, and completed a baseline survey in late 2007, just as the RBD project was beginning in the early treatment clusters. Within these clusters, 64% of the eligible households chose to participate in the RBD project. A second-round survey was applied to all 1600 households in the first quarter of 2009, just as the RBD project was rolled out in the late treatment area. While it was not clear at baseline which of the eligible households in the late treatment areas would choose to participate in the project, those households made their participation decision around the time of the second-round survey. Similar to the early treatment clusters, 57% of eligible households in late treatment clusters elected to participate. Because the timing of the surveys and project rollout allow determination of farmer type in both early and late treatment areas (participants versus non-participants), the impact evaluation has the opportunity to study impacts on both eligible households as well as impacts on participating or complier households. The evaluation here will primarily focus on the complier households as we are interested in the impact of the program on the types of self-selecting individuals who adopt it.
In some cases, the number of eligible farmers within the permitted radius was insufficient for the creation of a nucleus, and these potential farmers were therefore not included in the original sample. In numerous cases, the quota of 30 farmers was difficult to reach. Combined with the fact that 4% of farmers rejected to be interviewed, and that some 10% were deemed ineligible at the moment of the baseline survey, this all resulted in slightly fewer surveys per cluster than originally planned.
While it was not clear at baseline which of the eligible households in the late treatment areas would choose to participate in the project, those households made their participation decision around the time of the second-round survey. Similar to the early treatment clusters, 57% of eligible households in late treatment clusters elected to participate. Because the timing of the surveys and project rollout allow determination of farmer type in both early and late treatment areas (participants versus non-participants), the impact evaluation has the opportunity to study impacts on both eligible households as well as impacts on participating or complier households. The evaluation here will primarily focus on the complier households as we are interested in the impact of the program on the types of self-selecting individuals who adopt it. From every list of clusters, we expected to randomly select 12 farmers. In practice, there were fewer eligible farmers than we initially assumed. In some cases, the number of eligible farmers within the permitted radius was insufficient for the creation of a nucleus, and these potential farmers were therefore not included in the original sample. In numerous cases, the quota of 30 farmers was difficult to reach. Combined with the fact that 4% of farmers rejected to be interviewed, and that some 10% were deemed ineligible at the moment of the baseline survey, this all resulted in slightly fewer surveys per cluster than originally planned.At the end of this second sampling stage, 1600 farmers (and their households) were interviewed (see Table 6). There are slightly more early (treatment) farmers than late (control) farmers. Within the blocks, there is an uneven number of interviews between early and late groups, especially with the sesame activity.
Regarding the variables used to compute the aggregate expenditures, the evaluation team did the following task in the cleaning process:
1) Identification of mistyped data by finding extreme values of per capita durable and non durable aggregate expenditures growth. 2) Revision of every missing value to verify if it was a mistyped data. 3) Consistency between section 3.C, 3.CA and 3.CO to verify if there was information that was not typed.
In most cases, it was identified that the enumerator wrote an incorrect code. However, enumerators were encouraged to write observations if they had some doubt about the farmer’s answer. This type of information was key for the cleaning data process.
In other cases, wrong codes of frequency or total value were evident but there was not additional information from the enumerator (e.g., a household consumes 50 pounds of sugar per day). By comparing this information with the other round survey and considering that the size of household had not changed, we concluded that household consumption was the same amount of food but the frequency or the value was not coherent.
Finally, if there was a household with only one missing value in only one round of the survey, we impute a value for this unique missing value. For example, if the missing value was a food value, we take the average of the value of the same food declared by other households living in the same municipality.
At the end of this second sampling stage, 1600 farmers (and their households) were interviewed.There are slightly more early (treatment) farmers than late (control) farmers. Within the blocks, there is an uneven number of interviews between early and late groups, especially with the sesame activity. Some sesame areas contained fewer eligible farmers, resulting in a lower number of interviews per GU. Across departments, the largest differences are found in some bean GUs: Chinandega has twice as many bean GUs as León. This difference is mainly explained because the GUs are spread across four municipalities in Chinandega, and only two municipalities in León.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Analyst: Alexandra Loop Date: 12/02/2024
Business Task:
Question to be Answered : - What are trends in non-Bellabeat smart device usage? - What do these trends suggest for Bellabeat customers? - How could these trends help influence Bellabeat marketing strategy?
Description of Data Sources:
Data Set to be studied: FitBit Fitness Tracker Data: Pattern Recognition with tracker data: Improve Your Overall Health
Data privacy: Data was sourced from a public dataset available on Kaggle. Information has been anonymized prior to being posted online.
Bias: Due to the degree of anonymity in this study, the only demographic data available in this study is weight, and other cultural differences or lifestyle requirements cannot be accounted for. The sample size is quite small. The time period of the study is only a month so the observer effect could conceivably still be influencing the sample groups. We also have no information on the weather in the region studied. April and May are very variable months in terms of accessible outdoor activities.
Process:
Cleaning Process: After going through the data to find duplicates, whitespace, and nulls, I have determined that this set of data has been well-cleaned and already aggregated into several reasonably sized spreadsheets.
Trim: No issues found
Consistent length ID: No issues found
Irrelevant columns: In WLI_M the fat column is not consistently filled in so it is not productive to use it in analysis Sedentary_active_distance was mostly filled with nulls and could confuse the data I have removed the columns
Irrelevant Rows: 77 rows in daily_Activity_merged had 0s across the board. As there is little chance that someone would take zero steps I decided to interpret these days as ones where people did not put on the fitbit. As such they are irrelevant rows. Removed 77 columns. 85 rows in daily_intensities_merged registered 0 minutes of sedentary activity, which I do not believe to be possible. Row 241 logged 2 minutes of sedentary activity. I have determined it to be unusable. Row 322 likewise does not add up to a day’s minutes and has been deleted. Removed 85 columns 7 rows had 1440 sedentary minutes, which I have determined to be time on but not used. Implication of the presence noted.
Scientifically debunked information: BMI as a measurement has been determined to be problematic on many lines, it misrepresents non-white people who have different healthy body types, does not account for muscle mass or scoliosis, has been known to change definitions in accordance with business interests rather than health data, and was never meant to be used as a measure of individual health. I have removed the BMI column from the Weight Log Info chart.
Cleaning Process 1:
I have elected to see what can be found in the data as it was organized by the providers first.
Cleaning Process 2:
I calculated and removed rows where the participants did not put on the fitbit. These rows were removed, and the implications of their presence have been noted.
Found Averages, Minimum, and Maximum Values of Steps, distance, types of active minutes, and calories.
Found the sum of all kinds of minutes documented to check for inconsistencies.
Found the difference between total minutes and a full 1440 minutes.
I tried to make a pie chart to convey the average minutes of activity, and so created a duplicate dataset to trim down and remove misleading data caused by different inputs.
Analysis:
Observations: On average, the participants do not seem interested in moderate physical activity as it was the category with the fewest number of active minutes. Perhaps advertise the effectiveness of low impact workouts. Very few participants volunteered their weights, but none of them lost weight. The person with the highest weight volunteered it only once near the beginning. Given evidence from the Health At Every Size movement, we cannot deny the possibility that having to be weight conscious could have had negative effects on this individual. I would suggest that weight would be a counterproductive focus for our marketing campaign as it would make heavier people less likely to want to participate, and any claims of weight loss would be statistically unfounded, and open us up to false advertising lawsuits. Fully half of the participants had days where they did not put on their fitbit at all during the day. For a total number of 77-84 lost days of data, meaning that on average participants who did not wear their fitbit daily lost 5 days of data, though of course some lost significantly more. I would suggest focusing on creating a biometric tracker that is comfortable and rarely needs to be charged so that people will gain more reliable resources from it. 400 full days of data are recorded, meaning that the participants did not take the device off to sleep, shower, or swim. 280 more have 16...
Facebook
TwitterThe Palestinian society's access to information and communication technology tools is one of the main inputs to achieve social development and economic change to the status of Palestinian society; on the basis of its impact on the revolution of information and communications technology that has become a feature of this era. Therefore, and within the scope of the efforts exerted by the Palestinian Central Bureau of Statistics in providing official Palestinian statistics on various areas of life for the Palestinian community, PCBS implemented the household survey for information and communications technology for the year 2019. The main objective of this report is to present the trends of accessing and using information and communication technology by households and individuals in Palestine, and enriching the information and communications technology database with indicators that meet national needs and are in line with international recommendations.
Palestine, West Bank, Gaza strip
Household, Individual
All Palestinian households and individuals (10 years and above) whose usual place of residence in 2019 was in the state of Palestine.
Sample survey data [ssd]
Sampling Frame The sampling frame consists of master sample which were enumerated in the 2017 census. Each enumeration area consists of buildings and housing units with an average of about 150 households. These enumeration areas are used as primary sampling units (PSUs) in the first stage of the sampling selection.
Sample size The estimated sample size is 8,040 households.
Sample Design The sample is three stages stratified cluster (pps) sample. The design comprised three stages: Stage (1): Selection a stratified sample of 536 enumeration areas with (pps) method. Stage (2): Selection a stratified random sample of 15 households from each enumeration area selected in the first stage. Stage (3): Selection one person of the (10 years and above) age group in a random method by using KISH TABLES.
Sample Strata The population was divided by: 1- Governorate (16 governorates, where Jerusalem was considered as two statistical areas) 2- Type of Locality (urban, rural, refugee camps).
Computer Assisted Personal Interview [capi]
Questionnaire The survey questionnaire consists of identification data, quality controls and three main sections: Section I: Data on household members that include identification fields, the characteristics of household members (demographic and social) such as the relationship of individuals to the head of household, sex, date of birth and age.
Section II: Household data include information regarding computer processing, access to the Internet, and possession of various media and computer equipment. This section includes information on topics related to the use of computer and Internet, as well as supervision by households of their children (5-17 years old) while using the computer and Internet, and protective measures taken by the household in the home.
Section III: Data on Individuals (10 years and over) about computer use, access to the Internet and possession of a mobile phone.
Programming Consistency Check The data collection program was designed in accordance with the questionnaire's design and its skips. The program was examined more than once before the conducting of the training course by the project management where the notes and modifications were reflected on the program by the Data Processing Department after ensuring that it was free of errors before going to the field.
Using PC-tablet devices reduced data processing stages, and fieldworkers collected data and sent it directly to server, and project management withdraw the data at any time.
In order to work in parallel with Jerusalem (J1), a data entry program was developed using the same technology and using the same database used for PC-tablet devices.
Data Cleaning After the completion of data entry and audit phase, data is cleaned by conducting internal tests for the outlier answers and comprehensive audit rules through using SPSS program to extract and modify errors and discrepancies to prepare clean and accurate data ready for tabulation and publishing.
Tabulation After finalizing checking and cleaning data from any errors. Tables extracted according to prepared list of tables.
The response rate in the West Bank reached 77.6% while in the Gaza Strip it reached 92.7%.
Sampling Errors Data of this survey affected by sampling errors due to use of the sample and not a complete enumeration. Therefore, certain differences are expected in comparison with the real values obtained through censuses. Variance were calculated for the most important indicators, There is no problem to disseminate results at the national level and at the level of the West Bank and Gaza Strip.
Non-Sampling Errors Non-Sampling errors are possible at all stages of the project, during data collection or processing. These are referred to non-response errors, response errors, interviewing errors and data entry errors. To avoid errors and reduce their effects, strenuous efforts were made to train the field workers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, as well as practical and theoretical training during the training course.
The implementation of the survey encountered non-response where the case (household was not present at home) during the fieldwork visit become the high percentage of the non response cases. The total non-response rate reached 17.5%. The refusal percentage reached 2.9% which is relatively low percentage compared to the household surveys conducted by PCBS, and the reason is the questionnaire survey is clear.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While self-report data is a staple of modern psychological studies, they rely on participants accurately self-reporting. Two constructs that impede accurate results are insufficient effort responding (IER) and response styles. These constructs share conceptual underpinnings and both utilized to reduce cognitive effort when responding to self-report scales. Little research has extensively explored the relationship of the two constructs. The current study explored the relationship of the two constructs across even-point and odd-point scales, as well as before and after data cleaning procedures. We utilized IRTrees, a statistical method for modeling response styles, to examine the relationship between IER and response styles. To capture the wide range of IER metrics available, we employed several forms of IER assessment in our analyses and generated IER factors based on the type of IER being detected. Our results indicated an overall modest relationship between IER and response styles, which varied depending on the type of IER metric being considered or type of scale being evaluated. As expected, data cleaning also changed the relationships of some of the variables. We posit the difference between the constructs may be the degree of cognitive effort participants are willing to expend. Future research and applications are discussed.
Facebook
TwitterThe General Household Survey-Panel (GHS-Panel) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program. The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, interinstitutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of approximately 5,000 households, which are also representative of the six geopolitical zones. The 2018/19 is the fourth round of the survey with prior rounds conducted in 2010/11, 2012/13, and 2015/16. GHS-Panel households were visited twice: first after the planting season (post-planting) between July and September 2018 and second after the harvest season (post-harvest) between January and February 2019.
National
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
The original GHS-Panel sample of 5,000 households across 500 enumeration areas (EAs) and was designed to be representative at the national level as well as at the zonal level. The complete sampling information for the GHS-Panel is described in the Basic Information Document for GHS-Panel 2010/2011. However, after a nearly a decade of visiting the same households, a partial refresh of the GHS-Panel sample was implemented in Wave 4.
For the partial refresh of the sample, a new set of 360 EAs were randomly selected which consisted of 60 EAs per zone. The refresh EAs were selected from the same sampling frame as the original GHS-Panel sample in 2010 (the “master frame”). A listing of all households was conducted in the 360 EAs and 10 households were randomly selected in each EA, resulting in a total refresh sample of approximated 3,600 households.
In addition to these 3,600 refresh households, a subsample of the original 5,000 GHS-Panel households from 2010 were selected to be included in the new sample. This “long panel” sample was designed to be nationally representative to enable continued longitudinal analysis for the sample going back to 2010. The long panel sample consisted of 159 EAs systematically selected across the 6 geopolitical Zones. The systematic selection ensured that the distribution of EAs across the 6 Zones (and urban and rural areas within) is proportional to the original GHS-Panel sample. Interviewers attempted to interview all households that originally resided in the 159 EAs and were successfully interviewed in the previous visit in 2016. This includes households that had moved away from their original location in 2010. In all, interviewers attempted to interview 1,507 households from the original panel sample.
The combined sample of refresh and long panel EAs consisted of 519 EAs. The total number of households that were successfully interviewed in both visits was 4,976.
While the combined sample generally maintains both national and Zonal representativeness of the original GHS-Panel sample, the security situation in the North East of Nigeria prevented full coverage of the Zone. Due to security concerns, rural areas of Borno state were fully excluded from the refresh sample and some inaccessible urban areas were also excluded. Security concerns also prevented interviewers from visiting some communities in other parts of the country where conflict events were occurring. Refresh EAs that could not be accessed were replaced with another randomly selected EA in the Zone so as not to compromise the sample size. As a result, the combined sample is representative of areas of Nigeria that were accessible during 2018/19. The sample will not reflect conditions in areas that were undergoing conflict during that period. This compromise was necessary to ensure the safety of interviewers.
Computer Assisted Personal Interview [capi]
The GHS-Panel Wave 4 consists of three questionnaires for each of the two visits. The Household Questionnaire was administered to all households in the sample. The Agriculture Questionnaire was administered to all households engaged in agricultural activities such as crop farming, livestock rearing and other agricultural and related activities. The Community Questionnaire was administered to the community to collect information on the socio-economic indicators of the enumeration areas where the sample households reside.
GHS-Panel Household Questionnaire: The Household Questionnaire provides information on demographics; education; health (including anthropometric measurement for children); labor; food and non-food expenditure; household nonfarm income-generating activities; food security and shocks; safety nets; housing conditions; assets; information and communication technology; and other sources of household income. Household location is geo-referenced in order to be able to later link the GHS-Panel data to other available geographic data sets.
GHS-Panel Agriculture Questionnaire: The Agriculture Questionnaire solicits information on land ownership and use; farm labor; inputs use; GPS land area measurement and coordinates of household plots; agricultural capital; irrigation; crop harvest and utilization; animal holdings and costs; and household fishing activities. Some information is collected at the crop level to allow for detailed analysis for individual crops.
GHS-Panel Community Questionnaire: The Community Questionnaire solicits information on access to infrastructure; community organizations; resource management; changes in the community; key events; community needs, actions and achievements; and local retail price information.
The Household Questionnaire is slightly different for the two visits. Some information was collected only in the post-planting visit, some only in the post-harvest visit, and some in both visits.
The Agriculture Questionnaire collects different information during each visit, but for the same plots and crops.
CAPI: For the first time in GHS-Panel, the Wave four exercise was conducted using Computer Assisted Person Interview (CAPI) techniques. All the questionnaires, household, agriculture and community questionnaires were implemented in both the post-planting and post-harvest visits of Wave 4 using the CAPI software, Survey Solutions. The Survey Solutions software was developed and maintained by the Survey Unit within the Development Economics Data Group (DECDG) at the World Bank. Each enumerator was given tablets which they used to conduct the interviews. Overall, implementation of survey using Survey Solutions CAPI was highly successful, as it allowed for timely availability of the data from completed interviews.
DATA COMMUNICATION SYSTEM: The data communication system used in Wave 4 was highly automated. Each field team was given a mobile modem allow for internet connectivity and daily synchronization of their tablet. This ensured that head office in Abuja has access to the data in real-time. Once the interview is completed and uploaded to the server, the data is first reviewed by the Data Editors. The data is also downloaded from the server, and Stata dofile was run on the downloaded data to check for additional errors that were not captured by the Survey Solutions application. An excel error file is generated following the running of the Stata dofile on the raw dataset. Information contained in the excel error files are communicated back to respective field interviewers for action by the interviewers. This action is done on a daily basis throughout the duration of the survey, both in the post-planting and post-harvest.
DATA CLEANING: The data cleaning process was done in three main stages. The first stage was to ensure proper quality control during the fieldwork. This was achieved in part by incorporating validation and consistency checks into the Survey Solutions application used for the data collection and designed to highlight many of the errors that occurred during the fieldwork.
The second stage cleaning involved the use of Data Editors and Data Assistants (Headquarters in Survey Solutions). As indicated above, once the interview is completed and uploaded to the server, the Data Editors review completed interview for inconsistencies and extreme values. Depending on the outcome, they can either approve or reject the case. If rejected, the case goes back to the respective interviewer’s tablet upon synchronization. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences, these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. Additional errors observed were compiled into error reports that were regularly sent to the teams. These errors were then corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then approved by the Data Editor. After the Data Editor’s approval of the interview on Survey Solutions server, the Headquarters also reviews and depending on the outcome, can either reject or approve.
The third stage of cleaning involved a comprehensive review of the final raw data following
Facebook
TwitterSyngenta is committed to increasing crop productivity and to using limited resources such as land, water and inputs more efficiently. Since 2014, Syngenta has been measuring trends in agricultural input efficiency on a global network of real farms. The Good Growth Plan dataset shows aggregated productivity and resource efficiency indicators by harvest year. The data has been collected from more than 4,000 farms and covers more than 20 different crops in 46 countries. The data (except USA data and for Barley in UK, Germany, Poland, Czech Republic, France and Spain) was collected, consolidated and reported by Kynetec (previously Market Probe), an independent market research agency. It can be used as benchmarks for crop yield and input efficiency.
National coverage
Agricultural holdings
Sample survey data [ssd]
A. Sample design Farms are grouped in clusters, which represent a crop grown in an area with homogenous agro- ecological conditions and include comparable types of farms. The sample includes reference and benchmark farms. The reference farms were selected by Syngenta and the benchmark farms were randomly selected by Kynetec within the same cluster.
B. Sample size Sample sizes for each cluster are determined with the aim to measure statistically significant increases in crop efficiency over time. This is done by Kynetec based on target productivity increases and assumptions regarding the variability of farm metrics in each cluster. The smaller the expected increase, the larger the sample size needed to measure significant differences over time. Variability within clusters is assumed based on public research and expert opinion. In addition, growers are also grouped in clusters as a means of keeping variances under control, as well as distinguishing between growers in terms of crop size, region and technological level. A minimum sample size of 20 interviews per cluster is needed. The minimum number of reference farms is 5 of 20. The optimal number of reference farms is 10 of 20 (balanced sample).
C. Selection procedure The respondents were picked randomly using a “quota based random sampling” procedure. Growers were first randomly selected and then checked if they complied with the quotas for crops, region, farm size etc. To avoid clustering high number of interviews at one sampling point, interviewers were instructed to do a maximum of 5 interviews in one village.
BF Screened from Italy were selected based on the following criterion:
(a) Barley growers in the North/Centre of Italy
North and Centre of Italy (regions: Piemonte, Veneto, Emilia Romagna, Umbria and Marche)
High to medium mechanized farming (small farms to large farms that have an high number of machinery, i.e. harvester)
Good Level of Tech Adoption (the level of tech adoption regards the technology adopt in the farm (i.e. mechanized harvesting, good level of CP product/modern CP technologies and/or genetics)
Adopt Syngenta products and services (only for RF)
Background info: Rotation with several crop (sunflower, corn, sorghum, beans & veggies);
Grains is sold to collectors or used inside the farm (cow or bio-digestor);
Farm with technician inside.
Need to identify benchmark farms that have similar size and use the same practices and input.
Winter barley
(b) Grain corn growers on irrigated fields in the North of Italy (Po Valley)
Region: Po Valley
Irrigated fields
Professional grower willing to invest in irrigation technologies and input to maximize yield;
Takes decision using ROI as KPI and not minimizing costs;
Adopt Syngenta products and services. (only for RF)
Areas where water is available but in a limited form. We exclude no irrigation areas and plenty of availability with submersion. Ideally areas where pivot, rotolone or drip irrigation is used.
Background info: Destination: Grain and cob
(c) Wine grape integrated producer in the North & Centre of Italy
Wine grapes
North & Centre of Italy (Regions: Piemonte, Veneto, Friuli, Toscana, Marche, Abruzzo)
High-tech farm (regards the technology adopted in the farm (i.e. mechanized farm, good level of CP product/modern CP technologies or good level of irrigation technology, ecc.)
Wine Grape Integrated Producer;
Farms that export wine in the world and they pay attention to the needs of different customers (not just Italian/European but also American);
Farms seeking elements of differentiation that may represent a plus value for their products.
Need to identify benchmark farms that have similar size and use the same practices and input. No organic farms.
(d) Tomato growers for processing industry in the North of Italy (Po Valley: Parma, Piacenza, Ferrara)
Region: Po Valley (Cities: Parma, Piacenza, Ferrara)
North-West of Italy (Cities: Parma, Piacenza)
North-East of Italy (City: Ferrara)
Commercial grower for processing
Mechanized farming
High level of tech adoption (concern the technology adopted in the farm (i.e. transplanting machine, mechanized harvesting, good level of CP product/modern CP technologies and genetics))
Irrigated by drip irrigation
Adopt Syngenta products and services (only for RF)
Rotation with wheat
Need to identify benchmark farms that have similar size but use local practices. Rotation with wheat.
(e) Tomato growers for processing industry in the South of Italy (Puglia)
Region: Puglia
Commercial grower for processing
Mechanized farming
High level of tech adoption (concern the technology adopted in the farm (i.e. transplanting machine, mechanized harvesting, good level of CP product/modern CP technologies and genetics))
Irrigated by drip irrigation
Adopt Syngenta products and services (only for RF)
Rotation with wheat and leguminosae
Need to identify benchmark farms that have similar size but use local practices. Rotation with wheat and leguminosae.
(d) Highly mechanized cereal growers in Puglia (Foggia)
Winter wheat growers
Region: Puglia (city: Foggia)
Adopt Syngenta products and services (only for RF)
Rotation with beans and/or vegetables is common;
Takes decision using ROI as KPI and not minimizing costs;
Need to identify benchmark farms that have similar size and use the same practices and input. No organic farm.
Producing for pasta industry (=durum wheat)
Face-to-face [f2f]
Data collection tool for 2019 covered the following information:
(A) PRE- HARVEST INFORMATION
PART I: Screening PART II: Contact Information PART III: Farm Characteristics a. Biodiversity conservation b. Soil conservation c. Soil erosion d. Description of growing area e. Training on crop cultivation and safety measures PART IV: Farming Practices - Before Harvest a. Planting and fruit development - Field crops b. Planting and fruit development - Tree crops c. Planting and fruit development - Sugarcane d. Planting and fruit development - Cauliflower e. Seed treatment
(B) HARVEST INFORMATION
PART V: Farming Practices - After Harvest a. Fertilizer usage b. Crop protection products c. Harvest timing & quality per crop - Field crops d. Harvest timing & quality per crop - Tree crops e. Harvest timing & quality per crop - Sugarcane f. Harvest timing & quality per crop - Banana g. After harvest PART VI - Other inputs - After Harvest a. Input costs b. Abiotic stress c. Irrigation
See all questionnaires in external materials tab
Data processing:
Kynetec uses SPSS (Statistical Package for the Social Sciences) for data entry, cleaning, analysis, and reporting. After collection, the farm data is entered into a local database, reviewed, and quality-checked by the local Kynetec agency. In the case of missing values or inconsistencies, farmers are re-contacted. In some cases, grower data is verified with local experts (e.g. retailers) to ensure data accuracy and validity. After country-level cleaning, the farm-level data is submitted to the global Kynetec headquarters for processing. In the case of missing values or inconsistences, the local Kynetec office was re-contacted to clarify and solve issues.
Quality assurance Various consistency checks and internal controls are implemented throughout the entire data collection and reporting process in order to ensure unbiased, high quality data.
• Screening: Each grower is screened and selected by Kynetec based on cluster-specific criteria to ensure a comparable group of growers within each cluster. This helps keeping variability low.
• Evaluation of the questionnaire: The questionnaire aligns with the global objective of the project and is adapted to the local context (e.g. interviewers and growers should understand what is asked). Each year the questionnaire is evaluated based on several criteria, and updated where needed.
• Briefing of interviewers: Each year, local interviewers - familiar with the local context of farming -are thoroughly briefed to fully comprehend the questionnaire to obtain unbiased, accurate answers from respondents.
• Cross-validation of the answers: o Kynetec captures all growers' responses through a digital data-entry tool. Various logical and consistency checks are automated in this tool (e.g. total crop size in hectares cannot be larger than farm size) o Kynetec cross validates the answers of the growers in three different ways: 1. Within the grower (check if growers respond consistently during the interview) 2. Across years (check if growers respond consistently throughout the years) 3. Within cluster (compare a grower's responses with those of others in the group)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cyclistic Bikes: A Comparison Between Casual and Annual Memberships
As part of the Google Data Analytics Certificate, I have been asked to complete a case study on the maximisation of Annual memberships vs those who choose the single and day-pass options.
The business goal of Cyclistic is clear, convert more members to Annual in an attempt to boost profits. The question is whether such a goal is truly profitable in the long term.
For this task, I will take the previous 12 months of data available from a public AWS server, https://divvy-tripdata.s3.amazonaws.com/index.html, and use that to build a forecast for the following years, looking for trends and possible problems that may impede Cyclistic’s ultimate goal
Sources and Tools
Rstudio: Tidyverse - Lubridate https://divvy-tripdata.s3.amazonaws.com/index.html
Business Goal
Under the direction of Lily Moreno and, by extension Cyclistic, the aim of this case study will be to analyse the differences in usage between Casual and Annual members.
For clarity, Casual members will be those who use the Day and Single Use options when using Cyclistic, whilst Annual refers to those who purchase a 12 month subscription to the service.
The ultimate goal is to see if there is a clear business reason to push forward with a marketing campaign to convert Casual users into Annual memberships
Tasks and Data Storage
The data I will be using was previously stored on an AWS server at https://divvy-tripdata.s3.amazonaws.com/index.html. This location is publicly accessible but the data within can only be downloaded and edited locally.
For the purposes of this task, I have downloaded the data for the year 2022, 12 separate files that I then collated into a single zip file to upload to Rstudio for the purposes of cleaning, arranging and studying the information. The original files will be located on my PC and at the AWS link. As part of the process, a backup file will be created within Rstudio to ensure that the original data is always available.
Process
After uploading the dat to Rstudio and putting in a naming convention, Month, the next step was to compare and equate the names of the coloumns. As the information came from 2022, 2 years after Cyclistic updated their naming conventions, this step was more of a formality to ensure that the files could later be joined into one. No irregularities were found at this stage.
As all coloumn names matched, there was no need to rename them. furthermore, all ride_id fields were already in character format.
Once this check was complete, all tables were compiled into one, named all_trips
Cleaning
The first issue found was the number of fields used to identify the different member types. The files used a four coloumn approach with "member" and "subscriber" for Annual and "Customer" and "casual" for the casual users. These four fields were aggregated into 2, Member and Casual.
As the original files only measured ride-level, more fields were added in the form of day, week, month, year to enable more opportunites to aggregate the data.
ride_length was added for consistency and to provide a clearer output. After adding this coloumn, the data was morphed from Factor to Numeric to ensure that the final output could be measured.
Analysis
Here, I will provide the final code used to descirbe the final process
mean(all_trips_v2$ride_length) #straight average (total ride length / rides) median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths max(all_trips_v2$ride_length) #longest ride min(all_trips_v2$ride_length) #shortest ride
summary(all_trips_v2$ride_length)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max) aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
all_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% #creates weekday field using wday() group_by(member_casual, weekday) %>% #groups by usertype and weekday summarise(number_of_rides = n() ...
Facebook
TwitterCyclistic, a bike sharing company, wants to analyze their user data to find the main differences in behavior between their two types of users. The Casual Riders are those who pay for each ride and the Annual Member who pays a yearly subscription to the service.
Key objectives: 1.Identify The Business Task: - Cyclistic wants to analyze the data to find the key differences between Casual Riders and Annual Members. The goal of this project is to reach out to the casual riders and incentivize them into paying for the annual subscription.
Key objectives: 1. Download Data And Store It Appropriately - Downloaded the data as .csv files, which were saved in their own folder to keep everything organized. I then uploaded those files into BigQuery for cleaning and analysis. For this project I downloaded all of 2022 and up to May of 2023, as this is the most recent data that I have access to.
Identify How It's Organized
Sort and Filter The Data and Determine The Credibility of The Data
Key objectives: 1.Clean The Data and Prepare The Data For Analysis: -I used some simple SQL code in order to determine that no members were missing, that no information was repeated and that there were no misspellings in the data as well.
--no misspelling in either member or casual. This ensures that all results will not have missing information.
SELECT
DISTINCT member_casual
FROM
table
--This shows how many casual riders and members used the service, should add up to the numb of rows in the dataset SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM table GROUP BY member_type
--Shows that every bike has a distinct ID. SELECT DISTINCT ride_id FROM table
--Shows that there are no typos in the types of bikes, so no data will be missing from results. SELECT DISTINCT rideable_type FROM table
Key objectives: 1. Aggregate Your Data So It's Useful and Accessible -I had to write some SQL code so that I could combine all the data from the different files I had uploaded onto BigQuery
select rideable_type, started_at, ended_at, member_casual from table 1 union all select rideable_type, started_at, ended_at, member_casual from table 2 union all select rideable_type, started_at, ended_at, member_casual from table 3 union all select rideable_type, started_at, ended_at, member_casual from table 4 union all select rideable_type, started_at, ended_at, member_casual from table 5 union all select rideable_type, started_at, ended_at, member_casual from table 6 union all select rideable_type, started_at, ended_at, member_casual from table 7 union all select rideable_type, started_at, ended_at, member_casual from table 8 union all select rideable_type, started_at, ended_at, member_casual from table 9 union all select rideable_type, started_at, ended_at, member_casual from table10 union all select rideable_type, started_at, ended_at, member_casual from table 11 union all select rideable_type, started_at, ended_at, member_casual from table 12 union all select rideable_type, started_at, ended_at, member_casual from table 13 union all select rideable_type, started_at, ended_at, member_casual from table 14 union all select rideable_type, started_at, ended_at, member_casual from table 15 union all select rideable_type, started_at, ended_at, member_casual from table 16 union all select rideable_type, started_at, ended_at, member_casual from table 17
--This shows how many casual and annual members used bikes SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM Aggregate Data Table GROUP BY member_type
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Phylogenomics dataset and the generated transcriptomic data for the study of 7 ancyromonads, 14 apusomonads and Meteora sporadica CRO19MET.Markers and supermatrices: phylogenomics_101_flagellates_97171aa.tar.gzRaw transcripts and peptides used for phylogenomics: 22_transcriptomes_brut.tar.gzTranscripts and peptides without cross-contamination due to batch extraction/sequencing: 22_transcriptomes_croco.tar.gzPeptides without bacterial contamination and redundancy. 22_transcriptomes_eukpep.tar.gzSRA in BioProject: PRJNA908224.Detailed explanation, read carefully before using these datasets:The scope of this study was to generate enough conserved phylogenomic markers to solve the species phylogeny of Apusomonadida and Ancyromonadida in the tree of eukaryotes (with the additional inclusion of the incertae sedis protist Meteora sporadica). For that, the original sets of de novo assembled transcripts from Spades (folder 01_transcripts_brut) were translated to proteins using TransDecoder and CD-HIT at 1% identity (folder 02_peptides_brut), and used to fill the phylogenomic dataset using BLASTp. As explained in the main text, they all 22 filled the dataset well (Table S1), and had high percentage of BUSCO completeness (Table S2); including higher than the reference apusomonad genome of Thecamonas trahens. We do not encourage the usage of this data brut sets unless all further analyses can be carefully checked in a case by case basis. Hence, with the aim to provide good quality data to the research community, we implemented a decontamination pipeline discussed below. From the original set of de novo assembled transcripts, CroCo detected most cross-contamination between the 1st sequencing batch (Table S3), which was also the one with more reads; > 10 million reads, compared to < 8 million reads in the 2nd and 3rd batches (Table S2). From the de-cross-contaminated transcripts (folder 03_transcripts_croco), the number of predicted peptides was much larger (from 26.19% to 68.81% more), except for Ancyromonas kenti who had around ten times more transcripts than other species (Table S2). This is because TransDecoder produces multiple peptides per transcript, which might not be real. After removing cross-contamination, the percentage of BUSCO completeness did not decrease for any species. There were some observed differences between taxa, such as apusomonads having more transcripts and peptides than ancyromonads, although it might be irrelevant to scrutinize partial transcriptomic data without genomics data backing up the results. Similarly, the 1st batch provided more transcripts and peptides than the 2nd and 3rd ones, probably because it had more reads to begin with. From that, we proceeded with only the peptides (folder 04_peptides_croco). Then, the supervised cleaning process with BAUVdb (Bacteria, Archaea, eUkaryotes and Viruses; Table S4) detected a low percentage of eukaryotic peptides: from 6.65% in Ancyromonas kenti, up to 17.11% in Fabomonas mesopelagica (folder 05_eukaryotic_peptides). The percentage of BUSCO completeness decreased for the subset with only eukaryotic hits, from only 0.4% in Chelonemonas dolani, up to 15.6% in Mylnikovia oxoniensis (the transcriptome with most peptides). Apusomonas proboscidea, due to being co-sequenced with a stramenopile, had 27% less of BUSCO completeness. On average, 7.2% of completeness decreased after cleaning the data from non-eukaryotic contaminants, which might represent a loss of truly eukaryotic peptides due to the limited taxon sampling of the BAUVdb (Table S2 and S4). Regarding the eggNOG-mapper analysis, only half of the peptides were annotated (55.63% on average), from 48.78% in Mylnikovia oxoniensis, up to 62.07% in Chelonemonas dolani. Altogether, the BUSCO completeness decreased by 4.2% in Chelonemonas geobuk, up to 19.4% in Ancyromonas mediterranea. Overall, we encourage anyone to use the subset of eukaryotic peptides for comparative genomics studies, in which the proteins under study can be easily checked. Since de novo transcriptomes are prone to show artificially duplicated peptides in comparative genomics analyses, we tested the peptide redundancy using CD-HIT to 90% identity. This procedure removed few peptides for most species (6.48% on average), except for the highly duplicated Mylnikovia oxoniensis (~42.1%), as well as for Multimonas media (20.51%), Apusomonas australiensis (15.5%) and Cavaliersmithia chaoae (9.57%). These four apusomonad species from the 1st sequencing batch are the ones with more transcripts and predicted peptides, but as other species from the batch, they have similar number of sequencing reads. As of now, it is not possible to discern between methodological issues or a biological meaning such as genome duplication or high alternative splicing to explain these differences. Interestingly, the BUSCO completeness value was identical for all species. Although the 255 markers for BUSCO are just a small subset of peptides, we suspect this process of reducing redundancy did not remove information, but errors during the processing of the data. We suggest users of this data to use this set (folder 06_eukpep_cdhit90pid) for high-throughput comparative genomics analyses, but always taking into account the information given here. Also, we did not observe any differences in terms of numbers of proteins, percentage of BUSCO completeness, or number of eggNOG annotated peptides between apusomonads and ancyromonads lineages. Neither with marine and freshwater organisms, nor between large and small apusomonads. Interestingly, we found that the subset of only eukaryote peptides reported from ~30% of BUSCO completeness using the bacteria db10 in Chelonemonas geobuk, up to 50% in Mylnikovia oxoniensis; a similar value found in the previously sequenced Thecamonas trahens refseq proteins (47.5%). In future studies, it would be interesting to compare these numbers with genomic data, and see how suited is RNAseq to perform further comparative genomics analyses.
Facebook
TwitterSyngenta is committed to increasing crop productivity and to using limited resources such as land, water and inputs more efficiently. Since 2014, Syngenta has been measuring trends in agricultural input efficiency on a global network of real farms. The Good Growth Plan dataset shows aggregated productivity and resource efficiency indicators by harvest year. The data has been collected from more than 4,000 farms and covers more than 20 different crops in 46 countries. The data (except USA data and for Barley in UK, Germany, Poland, Czech Republic, France and Spain) was collected, consolidated and reported by Kynetec (previously Market Probe), an independent market research agency. It can be used as benchmarks for crop yield and input efficiency.
National Coverage
Agricultural holdings
Sample survey data [ssd]
A. Sample design Farms are grouped in clusters, which represent a crop grown in an area with homogenous agro- ecological conditions and include comparable types of farms. The sample includes reference and benchmark farms. The reference farms were selected by Syngenta and the benchmark farms were randomly selected by Kynetec within the same cluster.
B. Sample size Sample sizes for each cluster are determined with the aim to measure statistically significant increases in crop efficiency over time. This is done by Kynetec based on target productivity increases and assumptions regarding the variability of farm metrics in each cluster. The smaller the expected increase, the larger the sample size needed to measure significant differences over time. Variability within clusters is assumed based on public research and expert opinion. In addition, growers are also grouped in clusters as a means of keeping variances under control, as well as distinguishing between growers in terms of crop size, region and technological level. A minimum sample size of 20 interviews per cluster is needed. The minimum number of reference farms is 5 of 20. The optimal number of reference farms is 10 of 20 (balanced sample).
C. Selection procedure The respondents were picked randomly using a “quota based random sampling” procedure. Growers were first randomly selected and then checked if they complied with the quotas for crops, region, farm size etc. To avoid clustering high number of interviews at one sampling point, interviewers were instructed to do a maximum of 5 interviews in one village.
Screened Bangladesh BF were from Jessore, Rajshahi, Rangpur, Bogra, Comilla and Mymensingh and were selected based on the following criterion:
- Rice growers
- Partly smallholder
- Professional farmer with rice being main income source
- Manual planting and harvesting. But land preparation and threshing are mechanized.
- Receive tech supports from SYT FFs, CP suppliers or dealers
- Hire labor
- Leading local farmer
- Using SYT products (read remark in next column)
- Loyal to SYT (only for RF - read remark in next column)
- Rice to rice rotation
Face-to-face [f2f]
Data collection tool for 2019 covered the following information:
(A) PRE- HARVEST INFORMATION
PART I: Screening PART II: Contact Information PART III: Farm Characteristics a. Biodiversity conservation b. Soil conservation c. Soil erosion d. Description of growing area e. Training on crop cultivation and safety measures PART IV: Farming Practices - Before Harvest a. Planting and fruit development - Field crops b. Planting and fruit development - Tree crops c. Planting and fruit development - Sugarcane d. Planting and fruit development - Cauliflower e. Seed treatment
(B) HARVEST INFORMATION
PART V: Farming Practices - After Harvest a. Fertilizer usage b. Crop protection products c. Harvest timing & quality per crop - Field crops d. Harvest timing & quality per crop - Tree crops e. Harvest timing & quality per crop - Sugarcane f. Harvest timing & quality per crop - Banana g. After harvest PART VI - Other inputs - After Harvest a. Input costs b. Abiotic stress c. Irrigation
See all questionnaires in external materials tab
Data processing:
Kynetec uses SPSS (Statistical Package for the Social Sciences) for data entry, cleaning, analysis, and reporting. After collection, the farm data is entered into a local database, reviewed, and quality-checked by the local Kynetec agency. In the case of missing values or inconsistencies, farmers are re-contacted. In some cases, grower data is verified with local experts (e.g. retailers) to ensure data accuracy and validity. After country-level cleaning, the farm-level data is submitted to the global Kynetec headquarters for processing. In the case of missing values or inconsistences, the local Kynetec office was re-contacted to clarify and solve issues.
B. Quality assurance Various consistency checks and internal controls are implemented throughout the entire data collection and reporting process in order to ensure unbiased, high quality data.
• Screening: Each grower is screened and selected by Kynetec based on cluster-specific criteria to ensure a comparable group of growers within each cluster. This helps keeping variability low.
• Evaluation of the questionnaire: The questionnaire aligns with the global objective of the project and is adapted to the local context (e.g. interviewers and growers should understand what is asked). Each year the questionnaire is evaluated based on several criteria, and updated where needed.
• Briefing of interviewers: Each year, local interviewers - familiar with the local context of farming -are thoroughly briefed to fully comprehend the questionnaire to obtain unbiased, accurate answers from respondents.
• Cross-validation of the answers:
o Kynetec captures all growers' responses through a digital data-entry tool. Various logical and consistency checks are automated in this tool (e.g. total crop size in hectares cannot be larger than farm size)
o Kynetec cross validates the answers of the growers in three different ways:
1. Within the grower (check if growers respond consistently during the interview)
2. Across years (check if growers respond consistently throughout the years)
3. Within cluster (compare a grower's responses with those of others in the group)
o All the above mentioned inconsistencies are followed up by contacting the growers and asking them to verify their answers. The data is updated after verification. All updates are tracked.
• Check and discuss evolutions and patterns: Global evolutions are calculated, discussed and reviewed on a monthly basis jointly by Kynetec and Syngenta.
• Sensitivity analysis: sensitivity analysis is conducted to evaluate the global results in terms of outliers, retention rates and overall statistical robustness. The results of the sensitivity analysis are discussed jointly by Kynetec and Syngenta.
• It is recommended that users interested in using the administrative level 1 variable in the location dataset use this variable with care and crosscheck it with the postal code variable.
Due to the above mentioned checks, irregularities in fertilizer usage data were discovered which had to be corrected:
For data collection wave 2014, respondents were asked to give a total estimate of the fertilizer NPK-rates that were applied in the fields. From 2015 onwards, the questionnaire was redesigned to be more precise and obtain data by individual fertilizer products. The new method of measuring fertilizer inputs leads to more accurate results, but also makes a year-on-year comparison difficult. After evaluating several solutions to this problems, 2014 fertilizer usage (NPK input) was re-estimated by calculating a weighted average of fertilizer usage in the following years.
Facebook
TwitterBackground: Adolescent girls in Kenya are disproportionately affected by early and unintended pregnancies, unsafe abortion and HIV infection. The In Their Hands (ITH) programme in Kenya aims to increase adolescents' use of high-quality sexual and reproductive health (SRH) services through targeted interventions. ITH Programme aims to promote use of contraception and testing for sexually transmitted infections (STIs) including HIV or pregnancy, for sexually active adolescent girls, 2) provide information, products and services on the adolescent girl's terms; and 3) promote communities support for girls and boys to access SRH services.
Objectives: The objectives of the evaluation are to assess: a) to what extent and how the new Adolescent Reproductive Health (ARH) partnership model and integrated system of delivery is working to meet its intended objectives and the needs of adolescents; b) adolescent user experiences across key quality dimensions and outcomes; c) how ITH programme has influenced adolescent voice, decision-making autonomy, power dynamics and provider accountability; d) how community support for adolescent reproductive and sexual health initiatives has changed as a result of this programme.
Methodology ITH programme is being implemented in two phases, a formative planning and experimentation in the first year from April 2017 to March 2018, and a national roll out and implementation from April 2018 to March 2020. This second phase is informed by an Annual Programme Review and thorough benchmarking and assessment which informed critical changes to performance and capacity so that ITH is fit for scale. It is expected that ITH will cover approximately 250,000 adolescent girls aged 15-19 in Kenya by April 2020. The programme is implemented by a consortium of Marie Stopes Kenya (MSK), Well Told Story, and Triggerise. ITH's key implementation strategies seek to increase adolescent motivation for service use, create a user-defined ecosystem and platform to provide girls with a network of accessible subsidized and discreet SRH services; and launch and sustain a national discourse campaign around adolescent sexuality and rights. The 3-year study will employ a mixed-methods approach with multiple data sources including secondary data, and qualitative and quantitative primary data with various stakeholders to explore their perceptions and attitudes towards adolescents SRH services. Quantitative data analysis will be done using STATA to provide descriptive statistics and statistical associations / correlations on key variables. All qualitative data will be analyzed using NVIVO software.
Study Duration: 36 months - between 2018 and 2020.
Narok and Homabay counties
Households
All adolescent girls aged 15-19 years resident in the household.
The sampling of adolescents for the household survey was based on expected changes in adolescent's intention to use contraception in future. According to the Kenya Demographic and Health Survey 2014, 23.8% of adolescents and young women reported not intending to use contraception in future. This was used as a baseline proportion for the intervention as it aimed to increase demand and reduce the proportion of sexually active adolescents who did not intend to use contraception in the future. Assuming that the project was to achieve an impact of at least 2.4 percentage points in the intervention counties (i.e. a reduction by 10%), a design effect of 1.5 and a non- response rate of 10%, a sample size of 1885 was estimated using Cochran's sample size formula for categorical data was adequate to detect this difference between baseline and end line time points. Based on data from the 2009 Kenya census, there were approximately 0.46 adolescents girls per a household, which meant that the study was to include approximately 4876 households from the two counties at both baseline and end line surveys.
We collected data among a representative sample of adolescent girls living in both urban and rural ITH areas to understand adolescents' access to information, use of SRH services and SRH-related decision making autonomy before the implementation of the intervention. Depending on the number of ITH health facilities in the two study counties, Homa Bay and Narok that, we sampled 3 sub-Counties in Homa Bay: West Kasipul, Ndhiwa and Kasipul; and 3 sub-Counties in Narok, Narok Town, Narok South and Narok East purposively. In each of the ITH intervention counties, there were sub-counties that had been prioritized for the project and our data collection focused on these sub-counties selected for intervention. A stratified sampling procedure was used to select wards with in the sub-counties and villages from the wards. Then households were selected from each village after all households in the villages were listed. The purposive selection of sub-counties closer to ITH intervention facilities meant that urban and semi-urban areas were oversampled due to the concentration of health facilities in urban areas.
Qualitative Sampling
Focus Group Discussion participants were recruited from the villages where the ITH adolescent household survey was conducted in both counties. A convenience sample of consenting adults living in the villages were invited to participate in the FGDS. The discussion was conducted in local languages. A facilitator and note-taker trained on how to use the focus group guide, how to facilitate the group to elicit the information sought, and how to take detailed notes. All focus group discussions took place in the local language and were tape-recorded, and the consent process included permission to tape-record the session. Participants were identified only by their first names and participants were asked not to share what was discussed outside of the focus group. Participants were read an informed consent form and asked to give written consent. In-depth interviews were conducted with purposively selected sample of consenting adolescent girls who participated in the adolescent survey. We conducted a total of 45 In-depth interviews with adolescent girls (20 in Homa Bay County and 25 in Narok County respectively). In addition, 8 FGDs (4 each per county) were conducted with mothers of adolescent girls who are usual residents of the villages which had been identified for the interviews and another 4 FGDs (2 each per county) with CHVs.
N/A
Face-to-face [f2f] for quantitative data collection and Focus Group Discussions and In Depth Interviews for qualitative data collection
The questionnaire covered; socio-demographic and household information, SRH knowledge and sources of information, sexual activity and relationships, family planning knowledge, access, choice and use when needed, exposure to family planning messages and voice and decision making autonomy and quality of care for those who visited health facilities in the 12 months before the survey. The questionnaire was piloted before the data collection and the questions reviewed for appropriateness, comprehension and flow. The questionnaire was piloted among a sample of 42 adolescent girls (two each per field interviewer) 15-19 from a community outside the study counties.
The questionnaire was originally developed in English and later translated into Kiswahili. The questionnaire was programmed using ODK-based Survey CTO platform for data collection and management and was administered through face-to-face interview.
The survey tools were programmed using the ODK-based SurveyCTO platform for data collection and management. During programming, consistency checks were in-built into the data capture software which ensured that there were no cases of missing or implausible information/values entered into the database by the field interviewers. For example, the application included controls for variables ranges, skip patterns, duplicated individuals, and intra- and inter-module consistency checks. This reduced or eliminated errors usually introduced at the data capture stage. Once programmed, the survey tools were tested by the programming team who in conjunction with the project team conducted further testing on the application's usability, in-built consistency checks (skips, variable ranges, duplicating individuals etc.), and inter-module consistency checks. Any issues raised were documented and tracked on the Issue Tracker and followed up to full and timely resolution. After internal testing was done, the tools were availed to the project and field teams to perform user acceptance testing (UAT) so as to verify and validate that the electronic platform worked exactly as expected, in terms of usability, questions design, checks and skips etc.
Data cleaning was performed to ensure that data were free of errors and that indicators generated from these data were accurate and consistent. This process begun on the first day of data collection as the first records were uploaded into the database. The data manager used data collected during pilot testing to begin writing scripts in Stata 14 to check the variables in the data in 'real-time'. This ensured the resolutions of any inconsistencies that could be addressed by the data collection teams during the fieldwork activities. The Stata 14 scripts that perform real-time checks and clean data also wrote to a .rtf file that detailed every check performed against each variable, any inconsistencies encountered, and all steps that were taken to address these inconsistencies. The .rtf files also reported when a variable was
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
| Demonstration Case Name | Multi-Hazards in the Downstream Area of the Adige River Basin. |
| Dataset Name/Title | NDVI observations for Common and Winter Wheat, Maize and Soybean – Adige River-Fed Downstream Irrigated Plain, 2022 |
| Dataset Description |
NDVI (Normalized Difference Vegetation Index) and Bare Soil Index (BSI) observations at the crop field level for Common and Winter Wheat, Maize, and Soybean. The dataset contains one row per crop field representing the average crop field NDVI, covering 10 distinct dates between March 9 and August 26, 2022. All observations are located in a plain area that relies on the Adige River for cropland irrigation, and are integrated with Hydrologic Soil Group data. Dataset column names:
|
| Key Methodologies |
Crop field-level NDVI values were calculated by averaging pixel-level NDVI derived from Sentinel-2 L2A observations within the irrigated districts fed with the Adige River waters (data courtesy of ANBI Veneto). Crop field-level NDVI values were linked to specific crops using in situ crop type information (data obtained from Regional Local Agencies). The observations in the dataset result from a multi-step data cleaning process. Raw observations were excluded if more than 50% of their pixels were unavailable e.g., due to cloud cover; remaining observations were filtered using a Bare Soil Index (BSI) threshold of 0.08 (Mzid et al., 2021), to distinguish vegetated from non vegetated (soil) pixels. Finally, fields associated with alternating crops required further processing, in order to disentangle double entries and assure that the satellite observation referred to the correct crop. Temporal NDVI profiles were inspected to identify two green-up periods separated by at least one observation identified as bare soil (BSI > 0.08). To disentangle alternating crop field, the following assumptions were made: if soybean was one of the reported crops, the vegetation period following the bare-soil break was attributed to soybean, while the preceding growth phase was assigned to the initial crop (e.g., winter wheat or maize); in cases where no double cropping was reported but a second green-up phase was evident, it was assumed to result either from an unreported second crop or from spontaneous vegetation regrowth after harvesting. In such instances, only observations preceding the bare-soil interval and considered relevant to the declared crop growing season were retained for analysis.
|
| Temporal Domain | 2022 |
| Spatial Domain | The dataset is provided over the [10.7, 45.0, 12.3, 45.6] spatial domain (min longitude, min latitude, max longitude, max latitude in WGS84, EPSG:4326). |
| Key Variables/Indicators | Normalized Difference Vegetation Index (NDVI); Bare Soil Index (BSI) |
| Data Format | csv |
| Source Data |
|
| Accessibility | https://doi.org/10.5281/zenodo.15189872 |
| Stakeholder Relevance | The NDVI is a valuable indicator of vegetation health due to its ability to capture chlorophyll activity and biomass dynamics through the differential reflectance between red and near-infrared wavelengths. This index is widely recognized for its robustness in monitoring vegetation vigor, phenological stages, and crop stress. Associating field-level NDVI values with specific crops may enable the identification of patterns linked to co-occurring natural hazards, such as extreme dry and hot events. Moreover, the availability of information on soil hydraulic properties (retrieved from the Hydrologic Soil Group) allows for the identification of relationships between crop average NDVI values and soil properties, potentially allowing for adaptation strategies related to crop performance during dry and hot events. The reliability of the dataset has been further enhanced through a dedicated data cleaning methodology, allowing to distinguish between vegetated and non vegetated crop fields as well as identifying alternating crops. |
| Limitations/Assumptions | In cases where a field was associated with more than one crop, a disaggregation technique was applied based on assumptions about crop growth phases. |
| Additional Outputs/Information | The dataset access is currently restricted due to pending related publication. |
| Contact Information | Albergo, Edoardo (CMCC Foundation - Euro-Mediterranean Center on Climate Change, National Biodiversity Future Center) - Data curator Furlanetto, Jacopo (CMCC Foundation - Euro-Mediterranean Center on Climate Change, National Biodiversity Future Center) - Data curator Masina, Marinella (CMCC Foundation - Euro-Mediterranean Center on Climate Change)- Data curator Maraschini, Margherita (CMCC Foundation - Euro-Mediterranean Center on Climate Change) - Data curator Ferrario, Davide Mauro (CMCC Foundation - Euro-Mediterranean Center on Climate Change) - Data curator Torresan, Silvia (CMCC Foundation - Euro-Mediterranean Center on Climate Change, National Biodiversity Future Center) - Data manager |
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
The Palestinian Central Bureau of Statistics (PCBS) carried out four rounds of the Labor Force Survey 2017 (LFS). The survey rounds covered a total sample of about 23,120 households (5,780 households per quarter).
The main objective of collecting data on the labour force and its components, including employment, unemployment and underemployment, is to provide basic information on the size and structure of the Palestinian labour force. Data collected at different points in time provide a basis for monitoring current trends and changes in the labour market and in the employment situation. These data, supported with information on other aspects of the economy, provide a basis for the evaluation and analysis of macro-economic policies.
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a representative sample on the region level (West Bank, Gaza Strip), the locality type (urban, rural, camp) and the governorates.
1- Household/family. 2- Individual/person.
The survey covered all Palestinian households who are a usual residence of the Palestinian Territory.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE PALESTINIAN CENTRAL BUREAU OF STATISTICS
The methodology was designed according to the context of the survey, international standards, data processing requirements and comparability of outputs with other related surveys.
---> Target Population: It consists of all individuals aged 10 years and Above and there are staying normally with their households in the state of Palestine during 2017.
---> Sampling Frame: The sampling frame consists of the master sample, which was updated in 2011: each enumeration area consists of buildings and housing units with an average of about 124 households. The master sample consists of 596 enumeration areas; we used 494 enumeration areas as a framework for the labor force survey sample in 2017 and these units were used as primary sampling units (PSUs).
---> Sampling Size: The estimated sample size is 5,780 households in each quarter of 2017.
---> Sample Design The sample is two stage stratified cluster sample with two stages : First stage: we select a systematic random sample of 494 enumeration areas for the whole round ,and we excluded the enumeration areas which its sizes less than 40 households. Second stage: we select a systematic random sample of 16 households from each enumeration area selected in the first stage, se we select a systematic random of 16 households of the enumeration areas which its size is 80 household and over and the enumeration areas which its size is less than 80 households we select systematic random of 8 households.
---> Sample strata: The population was divided by: 1- Governorate (16 governorate) 2- Type of Locality (urban, rural, refugee camps).
---> Sample Rotation: Each round of the Labor Force Survey covers all of the 494 master sample enumeration areas. Basically, the areas remain fixed over time, but households in 50% of the EAs were replaced in each round. The same households remain in the sample for two consecutive rounds, left for the next two rounds, then selected for the sample for another two consecutive rounds before being dropped from the sample. An overlap of 50% is then achieved between both consecutive rounds and between consecutive years (making the sample efficient for monitoring purposes).
Face-to-face [f2f]
The survey questionnaire was designed according to the International Labour Organization (ILO) recommendations. The questionnaire includes four main parts:
---> 1. Identification Data: The main objective for this part is to record the necessary information to identify the household, such as, cluster code, sector, type of locality, cell, housing number and the cell code.
---> 2. Quality Control: This part involves groups of controlling standards to monitor the field and office operation, to keep in order the sequence of questionnaire stages (data collection, field and office coding, data entry, editing after entry and store the data.
---> 3. Household Roster: This part involves demographic characteristics about the household, like number of persons in the household, date of birth, sex, educational level…etc.
---> 4. Employment Part: This part involves the major research indicators, where one questionnaire had been answered by every 15 years and over household member, to be able to explore their labour force status and recognize their major characteristics toward employment status, economic activity, occupation, place of work, and other employment indicators.
---> Raw Data PCBS started collecting data since 1st quarter 2017 using the hand held devices in Palestine excluding Jerusalem in side boarders (J1) and Gaza Strip, the program used in HHD called Sql Server and Microsoft. Net which was developed by General Directorate of Information Systems. Using HHD reduced the data processing stages, the fieldworkers collect data and sending data directly to server then the project manager can withdrawal the data at any time he needs. In order to work in parallel with Gaza Strip and Jerusalem in side boarders (J1), an office program was developed using the same techniques by using the same database for the HHD.
---> Harmonized Data - The SPSS package is used to clean and harmonize the datasets. - The harmonization process starts with a cleaning process for all raw data files received from the Statistical Agency. - All cleaned data files are then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program is generated for each dataset to generate/ compute/ recode/ rename/ format/ label harmonized variables. - A post-harmonization cleaning process is then conducted on the data. - Harmonized data is saved on the household as well as the individual level, in SPSS and then converted to STATA, to be disseminated.
The survey sample consists of about 30,230 households of which 23,120 households completed the interview; whereas 14,682 households from the West Bank and 8,438 households in Gaza Strip. Weights were modified to account for non-response rate. The response rate in the West Bank reached 82.4% while in the Gaza Strip it reached 92.7%.
---> Sampling Errors Data of this survey may be affected by sampling errors due to use of a sample and not a complete enumeration. Therefore, certain differences can be expected in comparison with the real values obtained through censuses. Variances were calculated for the most important indicators: the variance table is attached with the final report. There is no problem in disseminating results at national or governorate level for the West Bank and Gaza Strip.
---> Non-Sampling Errors Non-statistical errors are probable in all stages of the project, during data collection or processing. This is referred to as non-response errors, response errors, interviewing errors, and data entry errors. To avoid errors and reduce their effects, great efforts were made to train the fieldworkers intensively. They were trained on how to carry out the interview, what to discuss and what to avoid, carrying out a pilot survey, as well as practical and theoretical training during the training course. Also data entry staff were trained on the data entry program that was examined before starting the data entry process. To stay in contact with progress of fieldwork activities and to limit obstacles, there was continuous contact with the fieldwork team through regular visits to the field and regular meetings with them during the different field visits. Problems faced by fieldworkers were discussed to clarify any issues. Non-sampling errors can occur at the various stages of survey implementation whether in data collection or in data processing. They are generally difficult to be evaluated statistically.
They cover a wide range of errors, including errors resulting from non-response, sampling frame coverage, coding and classification, data processing, and survey response (both respondent and interviewer-related). The use of effective training and supervision and the careful design of questions have direct bearing on limiting the magnitude of non-sampling errors, and hence enhancing the quality of the resulting data. The implementation of the survey encountered non-response where the case ( household was not present at home ) during the fieldwork visit and the case ( housing unit is vacant) become the high percentage of the non response cases. The total non-response rate reached14.2% which is very low once compared to the household surveys conducted by PCBS , The refusal rate reached 3.0% which is very low percentage compared to the
Facebook
TwitterThe General Household Survey-Panel (GHS-Panel) is implemented in collaboration with the World Bank Living Standards Measurement Study (LSMS) team as part of the Integrated Surveys on Agriculture (ISA) program. The objectives of the GHS-Panel include the development of an innovative model for collecting agricultural data, interinstitutional collaboration, and comprehensive analysis of welfare indicators and socio-economic characteristics. The GHS-Panel is a nationally representative survey of approximately 5,000 households, which are also representative of the six geopolitical zones. The 2023/24 GHS-Panel is the fifth round of the survey with prior rounds conducted in 2010/11, 2012/13, 2015/16 and 2018/19. The GHS-Panel households were visited twice: during post-planting period (July - September 2023) and during post-harvest period (January - March 2024).
National
• Households • Individuals • Agricultural plots • Communities
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
The original GHS‑Panel sample was fully integrated with the 2010 GHS sample. The GHS sample consisted of 60 Primary Sampling Units (PSUs) or Enumeration Areas (EAs), chosen from each of the 37 states in Nigeria. This resulted in a total of 2,220 EAs nationally. Each EA contributed 10 households to the GHS sample, resulting in a sample size of 22,200 households. Out of these 22,200 households, 5,000 households from 500 EAs were selected for the panel component, and 4,916 households completed their interviews in the first wave.
After nearly a decade of visiting the same households, a partial refresh of the GHS‑Panel sample was implemented in Wave 4 and maintained for Wave 5. The refresh was conducted to maintain the integrity and representativeness of the sample. The refresh EAs were selected from the same sampling frame as the original GHS‑Panel sample in 2010. A listing of households was conducted in the 360 EAs, and 10 households were randomly selected in each EA, resulting in a total refresh sample of approximately 3,600 households.
In addition to these 3,600 refresh households, a subsample of the original 5,000 GHS‑Panel households from 2010 were selected to be included in the new sample. This “long panel” sample of 1,590 households was designed to be nationally representative to enable continued longitudinal analysis for the sample going back to 2010. The long panel sample consisted of 159 EAs systematically selected across Nigeria’s six geopolitical zones.
The combined sample of refresh and long panel EAs in Wave 5 that were eligible for inclusion consisted of 518 EAs based on the EAs selected in Wave 4. The combined sample generally maintains both the national and zonal representativeness of the original GHS‑Panel sample.
Although 518 EAs were identified for the post-planting visit, conflict events prevented interviewers from visiting eight EAs in the North West zone of the country. The EAs were located in the states of Zamfara, Katsina, Kebbi and Sokoto. Therefore, the final number of EAs visited both post-planting and post-harvest comprised 157 long panel EAs and 354 refresh EAs. The combined sample is also roughly equally distributed across the six geopolitical zones.
Computer Assisted Personal Interview [capi]
The GHS-Panel Wave 5 consisted of three questionnaires for each of the two visits. The Household Questionnaire was administered to all households in the sample. The Agriculture Questionnaire was administered to all households engaged in agricultural activities such as crop farming, livestock rearing, and other agricultural and related activities. The Community Questionnaire was administered to the community to collect information on the socio-economic indicators of the enumeration areas where the sample households reside.
GHS-Panel Household Questionnaire: The Household Questionnaire provided information on demographics; education; health; labour; childcare; early child development; food and non-food expenditure; household nonfarm enterprises; food security and shocks; safety nets; housing conditions; assets; information and communication technology; economic shocks; and other sources of household income. Household location was geo-referenced in order to be able to later link the GHS-Panel data to other available geographic data sets (forthcoming).
GHS-Panel Agriculture Questionnaire: The Agriculture Questionnaire solicited information on land ownership and use; farm labour; inputs use; GPS land area measurement and coordinates of household plots; agricultural capital; irrigation; crop harvest and utilization; animal holdings and costs; household fishing activities; and digital farming information. Some information is collected at the crop level to allow for detailed analysis for individual crops.
GHS-Panel Community Questionnaire: The Community Questionnaire solicited information on access to infrastructure and transportation; community organizations; resource management; changes in the community; key events; community needs, actions, and achievements; social norms; and local retail price information.
The Household Questionnaire was slightly different for the two visits. Some information was collected only in the post-planting visit, some only in the post-harvest visit, and some in both visits.
The Agriculture Questionnaire collected different information during each visit, but for the same plots and crops.
The Community Questionnaire collected prices during both visits, and different community level information during the two visits.
CAPI: Wave five exercise was conducted using Computer Assisted Person Interview (CAPI) techniques. All the questionnaires (household, agriculture, and community questionnaires) were implemented in both the post-planting and post-harvest visits of Wave 5 using the CAPI software, Survey Solutions. The Survey Solutions software was developed and maintained by the Living Standards Measurement Unit within the Development Economics Data Group (DECDG) at the World Bank. Each enumerator was given a tablet which they used to conduct the interviews. Overall, implementation of survey using Survey Solutions CAPI was highly successful, as it allowed for timely availability of the data from completed interviews.
DATA COMMUNICATION SYSTEM: The data communication system used in Wave 5 was highly automated. Each field team was given a mobile modem which allowed for internet connectivity and daily synchronization of their tablets. This ensured that head office in Abuja had access to the data in real-time. Once the interview was completed and uploaded to the server, the data was first reviewed by the Data Editors. The data was also downloaded from the server, and Stata dofile was run on the downloaded data to check for additional errors that were not captured by the Survey Solutions application. An excel error file was generated following the running of the Stata dofile on the raw dataset. Information contained in the excel error files were then communicated back to respective field interviewers for their action. This monitoring activity was done on a daily basis throughout the duration of the survey, both in the post-planting and post-harvest.
DATA CLEANING: The data cleaning process was done in three main stages. The first stage was to ensure proper quality control during the fieldwork. This was achieved in part by incorporating validation and consistency checks into the Survey Solutions application used for the data collection and designed to highlight many of the errors that occurred during the fieldwork.
The second stage cleaning involved the use of Data Editors and Data Assistants (Headquarters in Survey Solutions). As indicated above, once the interview is completed and uploaded to the server, the Data Editors review completed interview for inconsistencies and extreme values. Depending on the outcome, they can either approve or reject the case. If rejected, the case goes back to the respective interviewer’s tablet upon synchronization. Special care was taken to see that the households included in the data matched with the selected sample and where there were differences, these were properly assessed and documented. The agriculture data were also checked to ensure that the plots identified in the main sections merged with the plot information identified in the other sections. Additional errors observed were compiled into error reports that were regularly sent to the teams. These errors were then corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then approved by the Data Editor. After the Data Editor’s approval of the interview on Survey Solutions server, the Headquarters also reviews and depending on the outcome, can either reject or approve.
The third stage of cleaning involved a comprehensive review of the final raw data following the first and second stage cleaning. Every variable was examined individually for (1) consistency with other sections and variables, (2) out of range responses, and (3) outliers. However, special care was taken to avoid making strong assumptions when resolving potential errors. Some minor errors remain in the data where the diagnosis and/or solution were unclear to the data cleaning team.
Response
Facebook
Twitterhttps://www.bco-dmo.org/dataset/651880/licensehttps://www.bco-dmo.org/dataset/651880/license
Dissolved lead data collected from the R/V Pourquoi pas (GEOVIDE) in the North Atlantic, Labrador Sea (section GA01) during 2014 access_formats=.htmlTable,.csv,.json,.mat,.nc,.tsv,.esriCsv,.geoJson acquisition_description=Sample storage bottle lids and threads were soaked overnight in 2N reagent grade HCl, then filled with 1N reagent grade HCl to be heated in an oven at 60 degrees celcius\u00a0overnight, inverted, heated for a second day, and rinsed 5X with pure distilled water.\u00a0 The bottles were then filled with trace metal clean dilute HCl (0.01N HCl) and again heated in the oven for one day on either end.\u00a0 Clean sample bottles were emptied, and double-bagged prior to rinsing and filling with sample.
As stated in the cruise report, trace metal clean seawater samples were collected using the French GEOTRACES clean rosette (General Oceanics Inc. Model 1018 Intelligent Rosette), equipped with twenty-two new 12L GO-FLO bottles (two bottles were leaking and were never deployed during the cruise). The 22 new GO-FLO bottles were initially cleaned in LEMAR laboratory following the GEOTRACES procedures (Cutter and Bruland, 2012). The rosette was deployed on a 6mm Kevlar cable with a dedicated custom designed clean winch. Immediately after recovery, GO-FLO bottles were individually covered at each end with plastic bags to minimize contamination. They were then transferred into a clean container (class-100) for sampling. On each trace metal cast, nutrient and/or salinity samples were taken to check potential leakage of the Go-Flo bottles. Prior to filtration, GO-FLO bottles were mixed manually three times. GO-FLO bottles were pressurized to less than\u00a08 psi with 0.2-um filtered N2\u00a0(Air Liquide). For Stations 1, 11, 15, 17, 19, 21, 25, 26, 29, 32 GO-FLO spigots were fitted with an acid-cleaned piece of Bev-a-Line tubing that fed into a 0.2 um capsule filters (SARTOBRAN\u00a0300, Sartorius). For all other stations (13, 34, 36, 38, 40, 42, 44, 49, 60, 64, 68, 69, 71, 77) seawater was filtered directly through paired filters (Pall Gelman Supor 0.45um polystersulfone, and Millipore mixed ester cellulose MF 5 um) mounted in Swinnex polypropylene filter holders, following the Planquette and Sherrell (2012) method. Filters were cleaned following the protocol described in Planquette and Sherrell (2012) and kept in acid-cleaned 1L LDPE bottles (Nalgene) filled with ultrapure water (Milli-Q, 18.2 megaohm/cm) until use. Subsamples were taken into acid-cleaned (see above) Nalgene HDPE bottles after a triple rinse with the sample. All samples were acidified back in the Boyle laboratory at 2mL per liter seawater (pH 2) with trace metal clean 6N HCl.
On this cruise, only the particulate samples were assigned GEOTRACES numbers. In this dataset, the dissolved Pb samples collected at the same depth (sometimes on a different cast) as the particulate samples have been assigned identifiers as \u201cSAMPNO\u201d which corresponds to the particulate GEOTRACES number. In cases where there were no corresponding particulate samples, a number was generated as \u201cPI_SAMPNO\u201d.
Upon examining the data, we observed that the sample taken from rosette position 1 (usually the near-bottom sample) was always higher in [Pb] than the sample taken immediately above that, and that the excess decreased as the cruise proceeded. The Pb isotope ratio of these samples were higher than the comparison bottles as well. A similar situation was seen for the sample taken from rosette positions 5, 20 and 21 when compared to the depth-interpolated [Pb] from the samples immediately above and below. Also, at two stations where our near-bottom sample was taken from rosette position 2, there was no [Pb] excess over the samples immediately above. We believe that this evidence points to sampler-induced contamination that was being slowly washed out during the cruise, but never completely. So we have flagged all of these analyses with a \u201c3\u201d indicating that we do not believe that these samples should be trusted as reflecting the true ocean [Pb].
In addition, we observed high [Pb] in the samples at Station 1 and very scattered Pb isotope ratios. The majority of these concentrations were far in excess of those values observed at nearby Station 11, and also the nearby USGT10-01. Discussion among other cruise participants revealed similarly anomalous data for other trace metals (e.g., Hg species). After discussion at the 2016 GEOVIDE Workshop, we came to the conclusion that this is*- evidence of GoFlo bottles not having sufficient time to \u201cclean up\u201d prior to use, and that most or all bottles from Station 1 were contaminated. We flagged all Station 1 data with a \u201c3\u201d indicating that we do not believe these values reflect the true ocean [Pb].
Samples were analyzed at least 1 month after acidification over 36 analytical sessions by a resin pre-concentration method. This method utilized the isotope-dilution ICP-MS method described in Lee et al. 2011, which includes pre-concentration on nitrilotriacetate (NTA) resin and analysis on a Fisons PQ2+ using a 400uL/min nebulizer. Briefly, samples were poured into 30mL subsample bottles. Then, triplicate 1.5mL polypropylene vials (Nalgene) were rinsed three times with the 30mL subsample.\u00a0 Each sample was pipetted (1.3mL) from the 30mL subsample to the 1.5mL vial.\u00a0 Pipettes were calibrated daily to the desired volume.\u00a0 25 ul of a 204Pb spike were added to each sample, and the pH was raised to 5.3 using a trace metal clean ammonium acetate buffer, prepared at a pH of between 7.95 and 7.98.\u00a0 2400 beads of NTA Superflow resin (Qiagen Inc., Valencia, CA) were added to the mixture, and the vials were set to shake on a shaker for 3 \u2013 6 days to allow the sample to equilibrate with the resin.\u00a0 After equilibration, the beads were centrifuged and washed 3 times with pure distilled water, using a trace metal clean siphon tip to remove the water wash from the sample vial following centrifugation.\u00a0 After the last wash, 350\u03bcl of a 0.1N solution of trace metal clean HNO3 was added to the resin to elute the metals, and the samples were set to shake on a shaker for 1 \u2013 2 days prior to analysis by ICP-MS.
NTA Superflow resin was cleaned by batch rinsing with 0.1N trace metal clean HCl for a few hours, followed by multiple washes until the pH of the solution was above 4.\u00a0 Resin was stored at 4 degrees celcius\u00a0in the dark until use, though it was allowed to equilibrate to room temperature prior to the addition to the sample.
Nalgene polypropylene (PPCO) vials were cleaned by heated submersion for 2 days at 60 degrees celcius\u00a0in 1N reagent grade HCl, followed by a bulk rinse and 4X individual rinse of each vial with pure distilled water. Each vial was then filled with trace metal clean dilute HCl (0.01N HCl) and heated in the oven at 60 degrees celcius\u00a0for one day on either end.\u00a0 Vials were kept filled until just before usage.
On each day of sample analysis, procedure blanks were determined. Replicates (12) of 300uL of an in-house standard reference material seawater (low Pb surface water) were used, where the amount of Pb in the 300uL was verified as negligible. The procedural blank over the relevant sessions for resin preconcentration method ranged from 2.2 \u2013 9.9pmol/kg, averaging 4.6 +/-\u00a01.7pmol/kg. Within a day, procedure blanks were very reproducible with an average standard deviation of 0.7pmol/kg, resulting in detection limits (3x this standard deviation) of 2.1pmol/kg. Replicate analyses of three different large-volume seawater samples (one with 11pmol/kg, another with 24pmol/kg, and a third with 38pmol/kg) indicated that the precision of the analysis is 4% or 1.6pmol/kg, whichever is larger.
Triplicate analyses of an international reference standard gave SAFe D2: 27.2 +/-\u00a01.7 pmol/kg. However, this standard run was linked into our own long- term quality control standards that are run on every analytical day to maintain long-term consistency. \u00a0
For the most part, the reported numbers are simply as calculated from the isotope dilution equation on the day of the analysis. For some analytical days, however, quality control samples indicated offsets in the blank used to correct the samples. For the upper 5 depths of Station 29, all depths of Station 40, and the deepest 2 depths of Station 42, the quality control samples indicated our blank was overcorrecting by 3.4pM, and we applied a -3.4pM correction to our Pb concentrations for that day. For the deepest 11 depths of Station 34, the quality control samples indicated our blank was overcorrecting by 10.2pM (due to contamination of the low trace metal seawater stock), and we applied a -10.2 pM correction to our Pb concentrations for that day. With these corrections, the overall internal comparability of the Pb collection should be better than 4%.
The errors associated with these Pb concentration measurements are on average 3.2% of the concentration (0.1 \u2013 4.4pmol/kg). Although there was a formal crossover station (1) that overlaps with USGT10-01 (GA-03), sample quality on the first station of GEOVIDE appears problematical making the comparison unhelpful. However, GEOVIDE station 11 (40.33 degrees North, 12.22 degrees\u00a0West) is not too far from USGT10-01 (38.325 degrees\u00a0North, 9.66 degrees\u00a0West) and makes for a reasonable comparison. It should also be noted that the MIT lab has intercalibrated Pb with other labs on the 2008 IC1 cruise, on the 2011 USGT11 (GA-03) cruise, and on the EPZT (GP-16) cruises, and maintains in-lab quality control standards for long-term data quality evaluation.
Ten percent of the samples were analyzed by Rick Kayser and the remaining ninety percent of the samples were analyzed by Cheryl Zurbrick. There was no significant difference between them for the lowest concentration large-volume seawater
Facebook
TwitterEvaluation Design Impact Evaluation: Matched comparison group Difference-in-Differences (DID) and analysis of time-series high resolution aerial photography
Performance Evaluation: Analysis of time-series land administration, cadastral, bank and building permits data covering the period before and after the start of LARP-related activities
The Millennium Challenge Corporation (MCC) established a partnership with Michigan State University (MSU) to design and conduct the evaluation of the Land Administration Reform Project (LARP). A matched comparison group difference-in-differences evaluation strategy was designed and baseline data were collected in March-June 2013 (https://data.mcc.gov/evaluations/index.php/catalog/85), aimed at to test whether the following expected outcomes were realized and attributable to LARP: (1) Reduction in the financial and time burden of conducting land transaction with LAA and increased efficiency in rendering land administration services to the public by LAA; (2) Reduction in time for land conflict resolution and reduction in land related conflicts within the areas where there has been intervention amongst the 55,000 lease holders; (3) Increased number of land parcels used as collateral for mortgage, and increased property investment, subleasing, rentals and other economic activities; (4) Increased frequency of formal land transaction, increased land values, and increased base case mortgage lending volume; (5) Increased household income of primary and secondary beneficiaries; (6) Increased understanding by Basotho of their rights and knowledge about services rendered by the LAA; and (7) Increased willingness of other landowners outside the regularization impact areas to request formal land title.
After reviewing the initial evaluation design and baseline data, complementary approaches are employed for the impact and performance evaluation of LTRP-related activities. These include: Firstly, with the administration of the follow-up household survey, DID approach with propensity score matching (PSM) is used to assess the impact of the program at the household, individual and plot level. This approach is complemented using time-series high resolution aerial photography (with geographic discontinuity design) to assess the impact of LARP-related activities on new construction and/or expansion of existing built-up area at the plot level. Secondly, administrative data from different sources were compiled for the purpose of performance evaluation: spatial and textual land administration data (including and mortgage and transaction registration) form the Lesotho Land Administration Authority; data on the incidence of issuance of building permits form Maseru City Council; and gender disaggregated loans by commercial banks and microfinance institutions form Lesotho Central Bank. These data are used, among others, to assess the impact of legal, regulatory and institutional reforms on: (i) gender equality with regard to access to formalized residential land and land rights; (ii) duration to registered mortgage and land transactions; and (iii) formalization of new construction or expansion of existing structures authorized by building permits.
The follow up survey is conducted by the Lesotho Bureau of Statistics and extraction of vector data from 2009 and 2016 high resolution data is done by the World Bank research team.
The survey covered selected wards in Maseru city: MMC1, MMC2, MMC3 and MMC27
Households, properties/parcels, individuals
Sample survey data [ssd]
Computer Assisted Personal Interview [capi]
Complementing the baseline questionnaire, the endline questionnaire consists of over 20 sections with modules on: 1. HH identification 2. Identifying household members at baseline/new members 3. Household characteristics (demographic information by each member of the HH) 4. Employment and sources of any other cash transfers 5. Identification and list of all the parcels 6. Information on Parcel Acquisition, Documents, Land Value 7. Land conflicts 8. Rights to the land and perceptions of the risk 9. Parcels rented out, rented in 10. Characteristics of parcels 11. Investments on land 12. Perceptions about Lease, renting land, the land law, women's rights and LAA 13. Ownership of Assets 14. Expenditures 15. Credit in the last 12 months 16. Consumption 17. Sale of Household goods in last 12 months 18. Woman module - Land ownership by women 19. Woman module - Knowledge, perceptions, and opinion about land issues 20. Woman module - Decision making - women respondent
As the survey was conducted through CAPI, the survey routing and many of the survey logic checks were automated and completed during fieldwork. This minimized the extent of data cleaning that was required during post-fieldwork.
The data cleaning process was done in multiple-stages. The first step was to ensure proper quality control during the fieldwork to ensure the accuracy of the final dataset. Errors that were caught at the fieldwork stage were corrected based on re-visits to the household on the instruction of the supervisor. The data that had gone through this first stage of cleaning was then sent from the field to the head office of BOS where a second stage of data cleaning was undertaken. During the second stage the data were examined for out of range values and outliers. The data were also examined for missing information of required variables, and sections. Any problems found were then reported back to the supervisors where the correction was then made. This was an ongoing process until all data were delivered to the head office.
After all the data were received by the head office, there was an overall review of the data to identify outliers and other errors on the complete dataset. Problems that were identified in the process were reported to the supervisors for further corrections. The questionnaires were also checked for completeness and where necessary the relevant households were re-visited and a report sent back to the head office with the corrections.
The final stage of the cleaning process was to ensure that the household-and individual-level datasets were correctly merged across all sections of the household questionnaire. Special care was taken to see that the households included in the data matched with the selected sample and any discrepancies were properly assessed and documented.
Facebook
TwitterThe Consumer price surveys primarily provide the following: Data on CPI in Palestine covering the West Bank, Gaza Strip and Jerusalem J1 for major and sub groups of expenditure. Statistics needed for decision-makers, planners and those who are interested in the national economy. Contribution to the preparation of quarterly and annual national accounts data.
Consumer Prices and indices are used for a wide range of purposes, the most important of which are as follows: Adjustment of wages, government subsidies and social security benefits to compensate in part or in full for the changes in living costs. To provide an index to measure the price inflation of the entire household sector, which is used to eliminate the inflation impact of the components of the final consumption expenditure of households in national accounts and to dispose of the impact of price changes from income and national groups. Price index numbers are widely used to measure inflation rates and economic recession. Price indices are used by the public as a guide for the family with regard to its budget and its constituent items. Price indices are used to monitor changes in the prices of the goods traded in the market and the consequent position of price trends, market conditions and living costs. However, the price index does not reflect other factors affecting the cost of living, e.g. the quality and quantity of purchased goods. Therefore, it is only one of many indicators used to assess living costs. It is used as a direct method to identify the purchasing power of money, where the purchasing power of money is inversely proportional to the price index.
Palestine West Bank Gaza Strip Jerusalem
The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.
The target population for the CPI survey is the shops and retail markets such as grocery stores, supermarkets, clothing shops, restaurants, public service institutions, private schools and doctors.
Sample survey data [ssd]
A non-probability purposive sample of sources from which the prices of different goods and services are collected was updated based on the establishment census 2017, in a manner that achieves full coverage of all goods and services that fall within the Palestinian consumer system. These sources were selected based on the availability of the goods within them. It is worth mentioning that the sample of sources was selected from the main cities inside Palestine: Jenin, Tulkarm, Nablus, Qalqiliya, Ramallah, Al-Bireh, Jericho, Jerusalem, Bethlehem, Hebron, Gaza, Jabalia, Dier Al-Balah, Nusseirat, Khan Yunis and Rafah. The selection of these sources was considered to be representative of the variation that can occur in the prices collected from the various sources. The number of goods and services included in the CPI is approximately 730 commodities, whose prices were collected from 3,200 sources. (COICOP) classification is used for consumer data as recommended by the United Nations System of National Accounts (SNA-2008).
Not apply
Computer Assisted Personal Interview [capi]
A tablet-supported electronic form was designed for price surveys to be used by the field teams in collecting data from different governorates, with the exception of Jerusalem J1. The electronic form is supported with GIS, and GPS mapping technique that allow the field workers to locate the outlets exactly on the map and the administrative staff to manage the field remotely. The electronic questionnaire is divided into a number of screens, namely: First screen: shows the metadata for the data source, governorate name, governorate code, source code, source name, full source address, and phone number. Second screen: shows the source interview result, which is either completed, temporarily paused or permanently closed. It also shows the change activity as incomplete or rejected with the explanation for the reason of rejection. Third screen: shows the item code, item name, item unit, item price, product availability, and reason for unavailability. Fourth screen: checks the price data of the related source and verifies their validity through the auditing rules, which was designed specifically for the price programs. Fifth screen: saves and sends data through (VPN-Connection) and (WI-FI technology).
In case of the Jerusalem J1 Governorate, a paper form has been designed to collect the price data so that the form in the top part contains the metadata of the data source and in the lower section contains the price data for the source collected. After that, the data are entered into the price program database.
The price survey forms were already encoded by the project management depending on the specific international statistical classification of each survey. After the researcher collected the price data and sent them electronically, the data was reviewed and audited by the project management. Achievement reports were reviewed on a daily and weekly basis. Also, the detailed price reports at data source levels were checked and reviewed on a daily basis by the project management. If there were any notes, the researcher was consulted in order to verify the data and call the owner in order to correct or confirm the information.
At the end of the data collection process in all governorates, the data will be edited using the following process: Logical revision of prices by comparing the prices of goods and services with others from different sources and other governorates. Whenever a mistake is detected, it should be returned to the field for correction. Mathematical revision of the average prices for items in governorates and the general average in all governorates. Field revision of prices through selecting a sample of the prices collected from the items.
Not apply
The findings of the survey may be affected by sampling errors due to the use of samples in conducting the survey rather than total enumeration of the units of the target population, which increases the chances of variances between the actual values we expect to obtain from the data if we had conducted the survey using total enumeration. The computation of differences between the most important key goods showed that the variation of these goods differs due to the specialty of each survey. For example, for the CPI, the variation between its goods was very low, except in some cases such as banana, tomato, and cucumber goods that had a high coefficient of variation during 2019 due to the high oscillation in their prices. The variance of the key goods in the computed and disseminated CPI survey that was carried out on the Palestine level was for reasons related to sample design and variance calculation of different indicators since there was a difficulty in the dissemination of results by governorates due to lack of weights. Non-sampling errors are probable at all stages of data collection or data entry. Non-sampling errors include: Non-response errors: the selected sources demonstrated a significant cooperation with interviewers; so, there wasn't any case of non-response reported during 2019. Response errors (respondent), interviewing errors (interviewer), and data entry errors: to avoid these types of errors and reduce their effect to a minimum, project managers adopted a number of procedures, including the following: More than one visit was made to every source to explain the objectives of the survey and emphasize the confidentiality of the data. The visits to data sources contributed to empowering relations, cooperation, and the verification of data accuracy. Interviewer errors: a number of procedures were taken to ensure data accuracy throughout the process of field data compilation: Interviewers were selected based on educational qualification, competence, and assessment. Interviewers were trained theoretically and practically on the questionnaire. Meetings were held to remind interviewers of instructions. In addition, explanatory notes were supplied with the surveys. A number of procedures were taken to verify data quality and consistency and ensure data accuracy for the data collected by a questioner throughout processing and data entry (knowing that data collected through paper questionnaires did not exceed 5%): Data entry staff was selected from among specialists in computer programming and were fully trained on the entry programs. Data verification was carried out for 10% of the entered questionnaires to ensure that data entry staff had entered data correctly and in accordance with the provisions of the questionnaire. The result of the verification was consistent with the original data to a degree of 100%. The files of the entered data were received, examined, and reviewed by project managers before findings were extracted. Project managers carried out many checks on data logic and coherence, such as comparing the data of the current month with that of the previous month, and comparing the data of sources and between governorates. Data collected by tablet devices were checked for consistency and accuracy by applying rules at item level to be checked.
Other technical procedures to improve data quality: Seasonal adjustment processes
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Effective management of non-indigenous species requires knowledge of their dispersal factors and founder events. We aim to identify the main environmental drivers favouring dispersal events along the invasion gradient and to characterize the spatial patterns of genetic diversity in feral populations of the non-native pink salmon within its epicentre of invasion in Norway. We first conducted SDM using four modelling techniques with varying levels of complexity, which encompassed both regression-based and tree-based machine-learning algorithms, using climatic data from the present to 2050. Then we used the triple-enzyme restriction-site associated DNA sequencing (3RADseq) approach to genotype over 30,000 high-quality single-nucleotide polymorphisms to elucidate patterns of genetic diversity and gene flow within the pink salmon putative invasion hotspot. We discovered temperature- and precipitation-related variables drove pink salmon distributional shifts across its non-native ranges, and that climate-induced favourable areas will remain stable for the next 30 years. In addition, all SDMs identified north-eastern Norway as the epicentre of the pink salmon invasion, and genomic data revealed that there was minimal variation in genetic diversity across the sampled populations at a genome-wide level in this region. While, upon utilizing a specific group of ‘diagnostic’ SNPs, we observed a significant degree of genetic differentiation, ranging from moderate to substantial, and detected four hierarchical genetic clusters concordant with geography. Our findings suggest that fluctuations of climate extreme events associated with ongoing climate change will likely maintain environmental favourability for the pink salmon outside its ‘native’/introduced ranges. Local invaded rivers are themselves a potential source population of invaders in the ongoing secondary spread of pink salmon in Northern Norway. Our study shows that SDMs and genomic data can reveal species distribution determinants and provide indicators to aid in post-control measures and potential inferences of their success. Methods 3RAD library preparation and sequencing: We prepared RADseq libraries using the Adapterama III library preparation protocol of Bayona-Vásquez et al., (2019; their Supplemental File SI). For each sample, ~40-100 ng of genomic DNA were digested for 1 h at 37 °C in a solution with 1.5 µl of 10x Cutsmart® buffer, 0.25 µl (NEB®) of Read 1 enzyme (MspI) at 20 U/µl, 0.25 µl of Read 2 enzyme (BamHI-HF) at 20 U/µl, 0.25 µl of Read 1 adapter dimer-cutting enzyme (ClaI) at 20 U/ µl, 1 µl of i5Tru adapter at 2.5 µM, 1 µl of i7Tru adapter at 2.5 µM and 0.75 µl of dH2O. After digestion/ligation, samples were pooled and cleaned with 1.2x Sera-Mag SpeedBeads (Fisher Scientiifc™) in a 1.2:1 (SpeedBeads:DNA) ratio, and we eluted cleaned DNA in 60 µL of TLE. An enrichment PCR of each sample was carried with 10 µl of 5x Kapa Long Range Buffer (Kapa Biosystems, Inc.), 0.25 µl of KAPA LongRange DNA Polymerase at 5 U/µl, 1.5 µl of dNTPs mix (10 mM each dNTP), 3.5 µl of MgCl2 at 25 mM, 2.5 µl of iTru5 primer at 5 µM, 2.5 µl of iTru7 primer at 5 µM and 5 µl of pooled DNA. The i5 and i7 adapters ligated to each sample using a unique combination (2 i5 X 1 i7 indexes). The temperature conditions for PCR enrichment were 94 °C for 2 min of initial denaturation, followed by 10 cycles of 94 °C for 20 sec, 57 °C for 15 sec and 72° for 30 sec, and a final cycle of 72 °C for 5 min. The enriched samples were each cleaned and quantified with a Quantus™ Fluorometer. Cleaned, indexed and quantified library pools were pooled to equimolar concentrations and were sent to the Norwegian Sequencing Centre (NSC) for quality control and subsequent final size selection using a one-sided bead clean-up (0.7:1 ratio) to capture 550 bp +/- 10% fragments, and the final paired-end (PE) 150 bp sequencing on one lane each of the Illumina HiSeq 4000 platform. Data filtering: We filtered genotype data and characterized singleton SNP loci and multi-site variants (MSVs) using filtering procedures and custom scripts available in scripts available in STACKS Workflow v.2 (https://github.com/enormandeau/stacks_workflow). First, we filtered the ‘raw’ VCF file keeping only SNPs that (i) showed a minimum depth of four (-m 4), (ii) were called in at least 80% of the samples in each site (-p 80) and (iii) and for which at least two samples had the rare allele i.e., Minor Allele Sample (MAS; -S 2), using the python script 05_filter_vcf_fast.py. Second, we exclude those samples with more than 20% missing genotypes from the data set. Third, we calculated pairwise relatedness between samples with the Yang et al., (2010) algorithm and individual-level heterozygosity in vcftools v.0.1.17 (Danecek et al., 2010). Additionally, we calculated pairwise kinship coefficients among individuals using the KING-robust method (Manichaikul et al., 2010) with the R package SNPRelate v.1.28.0 (Zheng et al., 2012). Then, we estimated genotyping error rates between technical replicates using the software tiger v1.0 (Bresadola et al., 2020). Finally, we removed one of the pair of closely related individuals exhibiting the higher level of missing data along with samples that showed extremely low heterozygosity (< -0.2) from graphical observation of individual-level heterozygosity per sampling population. Fourth, we conducted a secondary dataset filtering step using 05_filter_vcf_fast.py, keeping the above-mentioned data filtering cut-off parameters (i.e., -m = 4; -p = 80; -S = 3). Fifth, we calculated a suit of four summary statistics to discriminate high-confidence SNPs (singleton SNPs) from SNPs exhibiting a duplication pattern (duplicated SNPs; MSVs): (i) median of allele ratio in heterozygotes (MedRatio), (ii) proportion of heterozygotes (PropHet), (iii) proportion of rare homozygotes (PropHomRare) and (iv) inbreeding coefficient (FIS). We calculated each parameter from the filtered VCF file using the python script 08_extract_snp_duplication_info.py. The four parameters calculated for each locus were plotted against each other to visualize their distribution across all loci using the R script 09_classify_snps.R. Based on the methodology of McKinney et al. (2017) and by plotting different combinations of each parameter, we graphically fixed cut-offs for each parameter. Sixth, we then used the python script 10_split_vcf_in_categories.py for classify SNPs to generate two separate datasets: the “SNP dataset,” based on SNP singletons only, and the “MSV dataset,” based on duplicated SNPs only, which we excluded from further analyses. Seventh, we postfiltered the SNP dataset by keeping all unlinked SNPs within each 3RAD locus using the 11_extract_unlinked_snps.py script with a minimum difference of 0.5 (-diff_threshold 0.5) and a maximum distance 1,000 bp (-max_distance 1,000). Then, for the SNP dataset, we filtered out SNPs that were located in unplaced scaffolds i.e., contigs that were not part of the 26 chromosomes of the pink salmon genome.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Hello!
Thank you for taking the time to review my work.
The PBIX file feature 3 reports, that have been derived after cleaning the data. That includes deleting unnecessary columns, deleting erroneous values, changing data type and more.
The following columns were removed as they had redundant data, or did not suite the purpose of the report "Transaction.SubCategoryName" - Both this and "Transaction. category name" serve no purpose for the report, as the answers are too similar and too broad. "Transaction. Subscription" - There were next to no transactions with a subscription model, rendering this field empty. "Transaction.RecurrenceFrequency" - Had one single value if not blank. "Transaction. categoryname" - Servese no purpose for the analysis "Transaction.MerchantId" - We have a much more readable Merchant ID, we don't need a long code as well. "Transaction.Currency" - All transactions are processed in EUR, making the field pointless at the moment. We started .
Some of the dates in the "Transaction UpdatedAtDate" Column were dated at 1970, most likely an integer overflow at some point in the life of the data set.
Both Date columns were originally text, and had to be cleaned and transformed to Dates.
The reports I decided to use are meant to highlight the differences between the individual store chains.
While you are here, feel free to check out my R Lang project.
Thanks!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Kaggle dataset comes from an output dataset that powers my March Madness Data Analysis dashboard in Domo. - Click here to view this dashboard: Dashboard Link - Click here to view this dashboard features in a Domo blog post: Hoops, Data, and Madness: Unveiling the Ultimate NCAA Dashboard
This dataset offers one the most robust resource you will find to discover key insights through data science and data analytics using historical NCAA Division 1 men's basketball data. This data, sourced from KenPom, goes as far back as 2002 and is updated with the latest 2025 data. This dataset is meticulously structured to provide every piece of information that I could pull from this site as an open-source tool for analysis for March Madness.
Key features of the dataset include: - Historical Data: Provides all historical KenPom data from 2002 to 2025 from the Efficiency, Four Factors (Offense & Defense), Point Distribution, Height/Experience, and Misc. Team Stats endpoints from KenPom's website. Please note that the Height/Experience data only goes as far back as 2007, but every other source contains data from 2002 onward. - Data Granularity: This dataset features an individual line item for every NCAA Division 1 men's basketball team in every season that contains every KenPom metric that you can possibly think of. This dataset has the ability to serve as a single source of truth for your March Madness analysis and provide you with the granularity necessary to perform any type of analysis you can think of. - 2025 Tournament Insights: Contains all seed and region information for the 2025 NCAA March Madness tournament. Please note that I will continually update this dataset with the seed and region information for previous tournaments as I continue to work on this dataset.
These datasets were created by downloading the raw CSV files for each season for the various sections on KenPom's website (Efficiency, Offense, Defense, Point Distribution, Summary, Miscellaneous Team Stats, and Height). All of these raw files were uploaded to Domo and imported into a dataflow using Domo's Magic ETL. In these dataflows, all of the column headers for each of the previous seasons are standardized to the current 2025 naming structure so all of the historical data can be viewed under the exact same field names. All of these cleaned datasets are then appended together, and some additional clean up takes place before ultimately creating the intermediate (INT) datasets that are uploaded to this Kaggle dataset. Once all of the INT datasets were created, I joined all of the tables together on the team name and season so all of these different metrics can be viewed under one single view. From there, I joined an NCAAM Conference & ESPN Team Name Mapping table to add a conference field in its full length and respective acronyms they are known by as well as the team name that ESPN currently uses. Please note that this reference table is an aggregated view of all of the different conferences a team has been a part of since 2002 and the different team names that KenPom has used historically, so this mapping table is necessary to map all of the teams properly and differentiate the historical conferences from their current conferences. From there, I join a reference table that includes all of the current NCAAM coaches and their active coaching lengths because the active current coaching length typically correlates to a team's success in the March Madness tournament. I also join another reference table to include the historical post-season tournament teams in the March Madness, NIT, CBI, and CIT tournaments, and I join another reference table to differentiate the teams who were ranked in the top 12 in the AP Top 25 during week 6 of the respective NCAA season. After some additional data clean-up, all of this cleaned data exports into the "DEV _ March Madness" file that contains the consolidated view of all of this data.
This dataset provides users with the flexibility to export data for further analysis in platforms such as Domo, Power BI, Tableau, Excel, and more. This dataset is designed for users who wish to conduct their own analysis, develop predictive models, or simply gain a deeper understanding of the intricacies that result in the excitement that Division 1 men's college basketball provides every year in March. Whether you are using this dataset for academic research, personal interest, or professional interest, I hope this dataset serves as a foundational tool for exploring the vast landscape of college basketball's most riveting and anticipated event of its season.