Facebook
TwitterIn this project, I analysed the employees of an organization located in two distinct countries using Excel. This project covers:
1) How to approach a data analysis project 2) How to systematically clean data 3) Doing EDA with Excel formulas & tables 4) How to use Power Query to combine two datasets 5) Statistical Analysis of data 6) Using formulas like COUNTIFS, SUMIFS, XLOOKUP 7) Making an information finder with your data 8) Male vs. Female Analysis with Pivot tables 9) Calculating Bonuses based on business rules 10) Visual analytics of data with 4 topics 11) Analysing the salary spread (Histograms & Box plots) 12) Relationship between Salary & Rating 13) Staff growth over time - trend analysis 14) Regional Scorecard to compare NZ with India
Including various Excel features such as: 1) Using Tables 2) Working with Power Query 3) Formulas 4) Pivot Tables 5) Conditional formatting 6) Charts 7) Data Validation 8) Keyboard Shortcuts & tricks 9) Dashboard Design
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterThe documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.
The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.
As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.
As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).
Sample survey data [ssd]
The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.
Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.
For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.
For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).
Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).
For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.
For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.
For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.
Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).
Computer Assisted Personal Interview [capi]
Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.
Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.
Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.
For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.
For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.
Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
Facebook
TwitterDescription:
This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:
Data Analysis:
File Name: Data_Analysis.mlx
Description: This MATLAB Live Script file contains the main script used for the classification analysis of the spectral data. It includes steps for preprocessing, analysis, and visualization of the ATR and NIR spectra.
2D Correlation Data Analysis:
File Name: Data_Analysis_2Dcorr.mlx
Description: This MATLAB Live Script file is similar to the primary analysis script but is specifically tailored for performing 2D correlation analysis on the spectral data. It includes detailed steps and code for executing the 2D correlation.
Functions:
Folder Name: Functions
Description: This folder contains all the necessary MATLAB function files required to replicate the analyses presented in the scripts. These functions handle various preprocessing steps, calculations, and visualizations.
Datasets:
File Names: ATR_dataset.xlsx, NIR_dataset.xlsx, Reference_data.csv
Description: These Excel files contain the raw spectral data for ATR and NIR analyses, as well as reference datasets. Each file includes multiple sheets with detailed measurements and metadata.
Usage Notes:
Software Requirements:
MATLAB is required to run the .mlx files and utilize the functions.
PLS_Toolbox: Necessary for certain preprocessing and analysis steps.
MIDAS 2010: Available at MIDAS 2010, required for the 2D correlation analysis.
Replication: Users can replicate the analyses by running the Data_Analysis.mlx and Data_Analysis_2Dcorr.mlx scripts in MATLAB, ensuring that the Functions folder is in the MATLAB path.
Data Handling: The datasets are provided in .xlsx format, which can be easily imported into MATLAB or other data analysis software.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The shared archived combined in Supplementary Datasets represent the actual databases used in the investigation considered in two papers:
Meteorological conditions affecting black vulture (Coragyps atratus) soaring behavior in the southeast of Brazil: Implications for bird strike abatement (in submission)
Remote sensing applications for abating the aircraft-bird strike risks in the southeast of Brazil (Human-Wildlife Interactions Journal, in print)
The papers were based on my Master’s thesis defended in 2016 in the Institute of Biology of the University of Campinas (UNICAMP) in partial fulfilment of the requirements for the degree of Master in Ecology. Our investigation was devoted to reducing the risk of aircraft collision with Black vultures. It had two parts considered in these two papers. In the first one we studied the relationship between soaring activity of Black vultures and meteorological characteristics. In the second one we explored the dependence of soaring activity of vultures on superficial and anthropogenic characteristics. The study was implemented within surroundings of two airports in the southeast of Brazil taken as case studies. We developed the methodological approaches combining application of GIS and remote sensing technologies for data processing, which were used as the main research instrument. By dint of them we joined in the georeferenced databases (shapefiles) the data of bird's observation and three types of environmental factors: (i) meteorological characteristics collected together with the bird’s observation, (ii) superficial parameters (relief and surface temperature) obtained from the products of ASTER imagery; (iii) parameters of surface covering and anthropogenic pressure obtained from the satellite images of high resolution. Based on the analyses of the georeferenced databases, the relationship between soaring activity of vultures and environmental factors was studied; the behavioral patterns of vultures in soaring flight were revealed; the landscape types highly attractive for this species and forming the increased concentration of birds over them were detected; the maps giving a numerical estimation of hazard of bird strike events over the airport vicinities were constructed; the practical recommendations devoted to decrease the risk of collisions with vultures and other bird species were formulated.
This archive contains all materials elaborated and used for the study, including the GIS database for two papers, remote sensing data, and Microsoft Excel datasets. You can find the description of supplementary files in the Description of Supplementary Dataset.docx. The links on supplementary files and their attribution to the text of papers are considered in the Attribution to the text of papers.docx. The supplementary files are in the folders Datasets, GIS_others, GIS_Raster, GIS_Shape.
For any question please write me on this email: natalieenov@gmail.com
Natalia Novoselova
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Firm-level data from 2009 to 2018 of 34 large gold mines in Developing countries. The data is used to compute the deterministic, dynamic environmental and technical efficiencies of large gold mines in developing countries. Steps to reproduce1. Run the R command to generate dynamic technical and dynamic inefficiencies per every two subsequent period (i.e period t and t+1)2. combine the results files of inefficiencies per period generated in R into a panel (see the Excel files in the results folder)3. Import the excel folder into Stata and generate the final results indicated in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables convey 1. demographics (281 variables), 2. dietary consumption (324 variables), 3. physiological functions (1,040 variables), 4. occupation (61 variables), 5. questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood), 6. medications (29 variables), 7. mortality information linked from the National Death Index (15 variables), 8. survey weights (857 variables), 9. environmental exposure biomarker measurements (598 variables), and 10. chemical comments indicating which measurements are below or above the lower limit of detection (505 variables).
csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file. - The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. - "dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES. - "dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables. - “dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes. - “nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.
R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. - “w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data. - “m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.
Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order. - “example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together. - “example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model. - “example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design. - “example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Facebook
TwitterThis reference contains tabular datasets resulting from the eDNA pilot study on National Wildlife Refuges. ZIP file contains all datasets as received from the authors: a folder for each participating refuge containing two Excel workbooks, one for the MiFish marker results and one for the COI marker results. Each workbook has several sheets including one for the raw compiled data, one for each site, and filtered combined data. CSV of filtered data for all participating refuges combined. This dataset was compiled by extracting the filtered datasheet for each refuge from the excel workbook and combining them into a CSV using an r script. CSV of the total OTU, OTU species, unique families, and number of fish, mammal, amphibian, mollusk, and bird species for each participating refuge. This csv was compiled by Rachel Maxey (I&M Data Manager) by extracting the data from the refuge workbooks and combining manually into a CSV. CSV of the full Site data download from Survey 123. Data dictionaries and metadata for site information and eDNA results tables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.
INTRODUCTION
Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.
The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).
Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).
These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.
PARTICIPANTS
The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).
DATE STRUCTURE AND ARCHIVES FORMAT
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
Social network
The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.
Personal networks
Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).
Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.
Sense of community and metropolitan displacements
The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:
• Socio-economic data.
• Data on habitual residence.
• Information on intercity journeys.
• Identity and sense of community.
• Personal network indicators.
• Social network indicators.
DATA ACCESS
Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.
The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: <https://www.flickr.com/photos/25906481@N07/albums/72157667029974755>.
In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:
The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl
CONCLUSION
The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.
The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.
ACKNOWLEDGEMENTS
The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals,
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
2008 Population & demographic census data for Israel, at the level of settlements and lower .
Data provided at the sub-settlement level (i.e neighborhoods). Variable names (in Hebrew and English) and data dictionary provided in XLS files. 2008 statistical area names provided (along with top roads/neighborhoods per settlement). Excel data needs cleaning/merging from multiple sub-pages.
Data from Israel Central Bureau of Statistics (CBS): http://www.cbs.gov.il/census/census/pnimi_page.html?id_topic=12
Photo by Me (Dan Ofer).
Facebook
TwitterThis dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.
The dataset was separated into the following data files:
Applications, individuals
Administrative records [adm]
Other [oth]
The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterState estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This document explain how data were generated and how to interpret them.
LICENSE: CC0
But if you want to combine data with other datasets, feel free to use them as if they were published under CC0 license.
Data were published in February 2017. At that time, Zenodo only provided CC BY, CC BY-SA, CC BY-NC, CC BY-ND and CC BY-NC-ND. No CC0 option was available.
HOW DATA WERE COLLECTED
The 21 recorded sessions took place between February 2013 and December 2016.
Data were collected using Turning Technologies' remote controls (called clickers) and TurningPoint software.
The 4 versions of the quiz used during these 4 years are provided in the 'quizzes' folder for information purpose (in PDF and Powerpoint formats).
Turning Technologies records data in a closed format (.tpzx) that can be exported and converted them into 3 formats provided here (these 3 files contain the same data):
The first one was directly exported from TurningPoint and is provided for Excel users who can't read CSV correctly.
CSV was converted from Excel and is provided for non-Excel users.
Finally, SQLite is provided in order to apply different sorting and filters to the data. It can be read using SQLite manager for Firefox (https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/).
CODEBOOK Here is the name, the meaning and the possible values of the columns (name - meaning [possible values]). If students didn't answer the question, the value is '-'.
Session - session number (chronological) [1 to 21] AcademicYear - academic year [12-13, 13-14, 14-15, 15-16, 16-17] Year - calendar year [2013, 2014, 2015, 2016] Month - month (number) [1 to 12] Day - day (number) [1 to 31] Section - section abbreviation [CH, ESC, GM, IF, SIE, SV] Level - students' level [BA2, BA3, MA] Language - course's language [FR or EN] DeviceID - clicker's ID [(unique ID within a session)] Q1 - answers to question 1 [A, B, C, D, E] Q2 - answers to question 2 [A, B, C, D] Q3 - answers to question 3 [A or B] Q4 - answers to question 4 [A or B] Q5 - answers to question 5 [A or B] Q6 - answers to question 6 [A or B] Q7 - answers to question 7 [A or B] Q8 - answers to question 8 [A or B] Q9 - answers to question 9 [A or B] Q8-9 - answers to the question 8-9 (merge) [A or B] Q10 - answers to question 10 [1, 2] Q11 - answers to question 11 [A or B] Q12 - answers to question 12 [A, B]
Section abbreviation meaning * CH: chemistry * ESC: school of criminal justice (Unil) * GM: mechanical engineering * IF: financial engineering * SIE: environmental engineering * SV: life sciences
Level meaning
* BA2: 2nd year of Bachelor
* BA3: 3rd year of Bachelor
* MA: Master level
Question types
For some questions, multiple answers were allowed: Q1, Q2, Q10 & Q12.
Half of the questions have only one correct answer, true or false: Q3, Q5, Q6, Q7, Q8, Q9 & Q8-9.
Finally, for 2 questions only one answer was accepted, but there is not only one correct answer: Q4 & Q11.
INFORMATION ABOUT THE SESSIONS
Except otherwise stated below, all sessions were conducted like the original one: Q1 to Q12 (no Q8-9).
The original French version of the quiz has been translated into English for a few sessions with Master students.
For sessions 14 and 20, Q5 was removed and Q8 & Q9 were merged in Q8-9.
Session 18 was a short one with only 7 sevens questions: Q1, Q2, Q3, Q4, Q6, Q7 & Q9.
CONTACT INFORMATION If you have any question about these data, contact formations.bib@epfl.ch.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Facebook
TwitterThis dataset contains images of five different vehicle classes: Bus, Car, Motorcycle, Light Truck, and Heavy Truck. The images are split into training and testing sets, making it suitable for supervised learning tasks such as image classification and weight estimation.
In addition to the image files, the dataset includes two Excel sheets that provide approximate weight annotations for the different vehicle classes, enabling combined classification-regression tasks.
Class-name Total number Bus 1096 Car 1428 Motorcycle 542 heavy truck 1982 light truck 553
The dataset was manually created by combining images from several public datasets:
https://www.kaggle.com/datasets/kshitij192/cars-image-dataset https://www.kaggle.com/datasets/krishrana/vehicle-dataset https://www.kaggle.com/datasets/kaggleashwin/vehicle-type-recognition
Additional images were manually collected from the internet and organized into the five categories to ensure better class balance and diversity.
The dataset is shared for research and commercial use, with the goal of supporting projects in vehicle classification, weight estimation, and intelligent transportation systems.
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
TwitterDescription: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.
What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.
Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.
Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Explore the archive of relevant economic information: relevant news on all indicators with explanations, data on past publications on the economy of the United States, Britain, Japan and other developed countries, volatility assessments and much more. For the construction of their forecast models, the use of in-depth training is optimal, with a learning model built on the basis of EU and Forex data. The economic calendar is an indispensable assistant for the trader.
ON THIS TOPIC Telegram : @Economic Calendar Investing Forex https://t.me/economic_calendar_forex_invest This channel will wake you up 5 minutes before important events of high volatility, as well as inform you of current data for monitoring from the investing economic calendar
The data set is created in the form of an CSV, Excel spreadsheet (two files 2011-2013, 2014-2019), which can be found at boot time. You can see the source of the data on the site https://www.investing.com/economic-calendar/
http://comparic.com/wp-content/uploads/2016/12/Economic_Calendar_-_Investing.com_-_2016-12-19_02.45.10.jpg" alt="http://comparic.com/wp-content/uploads/2016/12/Economic_Calendar_-_Investing.com_-_2016-12-19_02.45.10.jpg">
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Source : UCI Machine Learning Repository – Bank Marketing (#222)
A Portuguese retail bank’s phone-based marketing campaigns (May 2008 → Nov 2010).
The task is to predict whether a client will subscribe to a term deposit (targety).
| File | Rows | Columns | Notes |
|---|---|---|---|
bank_marketing.xlsx | 45 211 | 17 | Classic “bank-full” version (all examples, 17 predictors + target) |
Need the enriched “bank-additional” version with 20 predictors? Grab it from the UCI link.
| Column | Type | Description |
|---|---|---|
age | int | Age of the client |
job | cat | Job type (admin., blue-collar, …) |
marital | cat | Marital status (married / single / divorced) |
education | cat | Education level (primary / secondary / tertiary / unknown) |
default | bin | Has credit in default? |
balance | int | Average yearly balance (EUR) |
housing | bin | Has housing loan? |
loan | bin | Has personal loan? |
contact | cat | Contact channel (cellular / telephone / unknown) |
day | int | Day of month of last contact |
month | cat | Month of last contact (jan-dec) |
duration | int | Call duration (secs)* |
campaign | int | Contacts made in this campaign (incl. last) |
pdays | int | Days since last contact (-1 ⇒ never) |
previous | int | Previous contacts before this campaign |
poutcome | cat | Outcome of previous campaign (failure / success / nonexistent) |
y | bin | Target – subscribed to term deposit? (yes/no) |
*⚠️ duration is only known after the call ends; include it only for benchmarking, not for live prediction.
import pandas as pd
df = pd.read_excel('/kaggle/input/bank-marketing/bank_marketing.xlsx')
print(df.shape) # (45211, 17)
df.head()
Prefer pip? Fetch directly from ucimlrepo:
'''
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
bm = fetch_ucirepo(id=222)
X, y = bm.data.features, bm.data.targets
'''
## 5 · Use-Cases & Ideas
| 🛠️ ML Task | Why it’s interesting |
|--------------------------|----------------------------------------------------------------------------------------------------------------|
| Binary classification | Classic imbalanced dataset – try **SMOTE**, cost-sensitive learning, threshold tuning |
| Feature engineering | Combine `pdays`, `campaign`, `previous` into a **contact-intensity score** |
| Model interpretability | Use **SHAP** / **LIME** to explain “yes” predictions |
| Time-aware validation | Data are date-ordered → split train/test chronologically to avoid leakage |
---
## 6 · Credits & Citations
> **Creators :** **Sérgio Moro, Paulo Rita, Paulo Cortez**
> **Original paper :**
> Moro S., Cortez P., Rita P. (2014).
> *A data-driven approach to predict the success of bank telemarketing campaigns.*
> *Decision Support Systems.* [[PDF]](https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e)
If you use this dataset, please cite:
Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset].
UCI Machine Learning Repository. https://doi.org/10.24432/C5K306
---
## 7 · License
This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**.
You are free to share & adapt, **provided you credit the original creators**.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterIn this project, I analysed the employees of an organization located in two distinct countries using Excel. This project covers:
1) How to approach a data analysis project 2) How to systematically clean data 3) Doing EDA with Excel formulas & tables 4) How to use Power Query to combine two datasets 5) Statistical Analysis of data 6) Using formulas like COUNTIFS, SUMIFS, XLOOKUP 7) Making an information finder with your data 8) Male vs. Female Analysis with Pivot tables 9) Calculating Bonuses based on business rules 10) Visual analytics of data with 4 topics 11) Analysing the salary spread (Histograms & Box plots) 12) Relationship between Salary & Rating 13) Staff growth over time - trend analysis 14) Regional Scorecard to compare NZ with India
Including various Excel features such as: 1) Using Tables 2) Working with Power Query 3) Formulas 4) Pivot Tables 5) Conditional formatting 6) Charts 7) Data Validation 8) Keyboard Shortcuts & tricks 9) Dashboard Design