20 datasets found

Employee Analysis In Excel
kaggle.com
zip
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Afolabi Raymond (2024). Employee Analysis In Excel [Dataset]. https://www.kaggle.com/datasets/afolabiraymond/employee-analysis-in-excel
Explore at:
zip(190258 bytes)Available download formats
Dataset updated
Mar 20, 2024
Authors
Afolabi Raymond
Description
In this project, I analysed the employees of an organization located in two distinct countries using Excel. This project covers:

1) How to approach a data analysis project 2) How to systematically clean data 3) Doing EDA with Excel formulas & tables 4) How to use Power Query to combine two datasets 5) Statistical Analysis of data 6) Using formulas like COUNTIFS, SUMIFS, XLOOKUP 7) Making an information finder with your data 8) Male vs. Female Analysis with Pivot tables 9) Calculating Bonuses based on business rules 10) Visual analytics of data with 4 topics 11) Analysing the salary spread (Histograms & Box plots) 12) Relationship between Salary & Rating 13) Staff growth over time - trend analysis 14) Regional Scorecard to compare NZ with India

Including various Excel features such as: 1) Using Tables 2) Working with Power Query 3) Formulas 4) Pivot Tables 5) Conditional formatting 6) Charts 7) Data Validation 8) Keyboard Shortcuts & tricks 9) Dashboard Design
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Enterprise Survey 2009-2019, Panel Data - Slovenia
microdata.worldbank.org
catalog.ihsn.org
Updated Aug 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank Group (WBG) (2020). Enterprise Survey 2009-2019, Panel Data - Slovenia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3762
Explore at:
Dataset updated
Aug 6, 2020
Dataset provided by
European Investment Bankhttp://eib.org/
European Bank for Reconstruction and Developmenthttp://ebrd.com/
World Bank Grouphttp://www.worldbank.org/
Time period covered
2008 - 2019
Area covered
Slovenia
Description
Abstract

The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.

The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.

As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.

Geographic coverage

National

Analysis unit

The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.

Universe

As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.

Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.

For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.

For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).

Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).

For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.

For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.

For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.

Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.

Response rate

Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.

Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.

For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.

For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.

Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
u
Data from: DATASET FOR: A multimodal spectroscopic approach combining...
producciocientifica.uv.es
zenodo.org
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Perez Guaita, David; Perez Guaita, David (2024). DATASET FOR: A multimodal spectroscopic approach combining mid-infrared and near-infrared for discriminating Gram-positive and Gram-negative bacteria [Dataset]. https://producciocientifica.uv.es/documentos/67321becaea56d4af0482a0e
Explore at:
Dataset updated
2024
Authors
Perez Guaita, David; Perez Guaita, David
Description
Description:

This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:

Data Analysis:

File Name: Data_Analysis.mlx

Description: This MATLAB Live Script file contains the main script used for the classification analysis of the spectral data. It includes steps for preprocessing, analysis, and visualization of the ATR and NIR spectra.

2D Correlation Data Analysis:

File Name: Data_Analysis_2Dcorr.mlx

Description: This MATLAB Live Script file is similar to the primary analysis script but is specifically tailored for performing 2D correlation analysis on the spectral data. It includes detailed steps and code for executing the 2D correlation.

Functions:

Folder Name: Functions

Description: This folder contains all the necessary MATLAB function files required to replicate the analyses presented in the scripts. These functions handle various preprocessing steps, calculations, and visualizations.

Datasets:

File Names: ATR_dataset.xlsx, NIR_dataset.xlsx, Reference_data.csv

Description: These Excel files contain the raw spectral data for ATR and NIR analyses, as well as reference datasets. Each file includes multiple sheets with detailed measurements and metadata.

Usage Notes:

Software Requirements:

MATLAB is required to run the .mlx files and utilize the functions.

PLS_Toolbox: Necessary for certain preprocessing and analysis steps.

MIDAS 2010: Available at MIDAS 2010, required for the 2D correlation analysis.

Replication: Users can replicate the analyses by running the Data_Analysis.mlx and Data_Analysis_2Dcorr.mlx scripts in MATLAB, ensuring that the Functions folder is in the MATLAB path.

Data Handling: The datasets are provided in .xlsx format, which can be easily imported into MATLAB or other data analysis software.
m
Supplementary Datasets
data.mendeley.com
Updated Mar 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalia Novoselova (2020). Supplementary Datasets [Dataset]. http://doi.org/10.17632/8s3fps4vvb.2
Explore at:
Unique identifier
https://doi.org/10.17632/8s3fps4vvb.2
Dataset updated
Mar 17, 2020
Authors
Natalia Novoselova
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The shared archived combined in Supplementary Datasets represent the actual databases used in the investigation considered in two papers:

Meteorological conditions affecting black vulture (Coragyps atratus) soaring behavior in the southeast of Brazil: Implications for bird strike abatement (in submission)

Remote sensing applications for abating the aircraft-bird strike risks in the southeast of Brazil (Human-Wildlife Interactions Journal, in print)

The papers were based on my Master’s thesis defended in 2016 in the Institute of Biology of the University of Campinas (UNICAMP) in partial fulfilment of the requirements for the degree of Master in Ecology. Our investigation was devoted to reducing the risk of aircraft collision with Black vultures. It had two parts considered in these two papers. In the first one we studied the relationship between soaring activity of Black vultures and meteorological characteristics. In the second one we explored the dependence of soaring activity of vultures on superficial and anthropogenic characteristics. The study was implemented within surroundings of two airports in the southeast of Brazil taken as case studies. We developed the methodological approaches combining application of GIS and remote sensing technologies for data processing, which were used as the main research instrument. By dint of them we joined in the georeferenced databases (shapefiles) the data of bird's observation and three types of environmental factors: (i) meteorological characteristics collected together with the bird’s observation, (ii) superficial parameters (relief and surface temperature) obtained from the products of ASTER imagery; (iii) parameters of surface covering and anthropogenic pressure obtained from the satellite images of high resolution. Based on the analyses of the georeferenced databases, the relationship between soaring activity of vultures and environmental factors was studied; the behavioral patterns of vultures in soaring flight were revealed; the landscape types highly attractive for this species and forming the increased concentration of birds over them were detected; the maps giving a numerical estimation of hazard of bird strike events over the airport vicinities were constructed; the practical recommendations devoted to decrease the risk of collisions with vultures and other bird species were formulated.

This archive contains all materials elaborated and used for the study, including the GIS database for two papers, remote sensing data, and Microsoft Excel datasets. You can find the description of supplementary files in the Description of Supplementary Dataset.docx. The links on supplementary files and their attribution to the text of papers are considered in the Attribution to the text of papers.docx. The supplementary files are in the folders Datasets, GIS_others, GIS_Raster, GIS_Shape.

For any question please write me on this email: natalieenov@gmail.com

Natalia Novoselova
F
Data from: Dynamic Technical and Environmental Efficiency Performance of...
dataverse.fgcu.edu
data.mendeley.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaiah Magambo; Isaiah Magambo (2024). Dynamic Technical and Environmental Efficiency Performance of Large Gold Mines in Developing Countries [Dataset]. http://doi.org/10.17632/pp3g267hny.1
Explore at:
zip(322671)Available download formats
Unique identifier
https://doi.org/10.17632/pp3g267hny.1
Dataset updated
Aug 2, 2024
Dataset provided by
FGCU Data Repository
Authors
Isaiah Magambo; Isaiah Magambo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Firm-level data from 2009 to 2018 of 34 large gold mines in Developing countries. The data is used to compute the deterministic, dynamic environmental and technical efficiencies of large gold mines in developing countries. Steps to reproduce1. Run the R command to generate dynamic technical and dynamic inefficiencies per every two subsequent period (i.e period t and t+1)2. combine the results files of inefficiencies per period generated in R into a panel (see the Excel files in the results folder)3. Import the excel folder into Stata and generate the final results indicated in the paper.
NHANES 1988-2018
kaggle.com
zip
Updated Jul 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nguyenvy (2025). NHANES 1988-2018 [Dataset]. https://www.kaggle.com/datasets/nguyenvy/nhanes-19882018
Explore at:
zip(917955003 bytes)Available download formats
Dataset updated
Jul 31, 2025
Authors
nguyenvy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables convey 1. demographics (281 variables), 2. dietary consumption (324 variables), 3. physiological functions (1,040 variables), 4. occupation (61 variables), 5. questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood), 6. medications (29 variables), 7. mortality information linked from the National Death Index (15 variables), 8. survey weights (857 variables), 9. environmental exposure biomarker measurements (598 variables), and 10. chemical comments indicating which measurements are below or above the lower limit of detection (505 variables).

csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file. - The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments. - "dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES. - "dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables. - “dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes. - “nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.

R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file. - “w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data. - “m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.

Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order. - “example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together. - “example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model. - “example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design. - “example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife...
catalog.data.gov
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife Refuges - Tabular Data [Dataset]. https://catalog.data.gov/dataset/environmental-dna-edna-metabarcoding-pilot-study-on-national-wildlife-refuges-tabular-data
Explore at:
Dataset updated
Nov 25, 2025
Dataset provided by
U.S. Fish and Wildlife Servicehttp://www.fws.gov/
Description
This reference contains tabular datasets resulting from the eDNA pilot study on National Wildlife Refuges. ZIP file contains all datasets as received from the authors: a folder for each participating refuge containing two Excel workbooks, one for the MiFish marker results and one for the COI marker results. Each workbook has several sheets including one for the raw compiled data, one for each site, and filtered combined data. CSV of filtered data for all participating refuges combined. This dataset was compiled by extracting the filtered datasheet for each refuge from the excel workbook and combining them into a CSV using an r script. CSV of the total OTU, OTU species, unique families, and number of fish, mammal, amphibian, mollusk, and bird species for each participating refuge. This csv was compiled by Rachel Maxey (I&M Data Manager) by extracting the data from the refuge workbooks and combining manually into a CSV. CSV of the full Site data download from Survey 123. Data dictionaries and metadata for site information and eDNA results tables.
Albero study: a longitudinal database of the social network and personal...
zenodo.org
data.niaid.nih.gov
+1more
bin, csv
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isidro Maya Jariego; Isidro Maya Jariego; Daniel Holgado Ramos; Daniel Holgado Ramos; Deniza Alieva; Deniza Alieva (2021). Albero study: a longitudinal database of the social network and personal networks of a cohort of students at the end of high school [Dataset]. http://doi.org/10.5281/zenodo.3532048
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3532048
Dataset updated
Mar 26, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Isidro Maya Jariego; Isidro Maya Jariego; Daniel Holgado Ramos; Daniel Holgado Ramos; Deniza Alieva; Deniza Alieva
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT

The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.

INTRODUCTION

Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.

The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).

Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).

These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.

PARTICIPANTS

The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).

DATE STRUCTURE AND ARCHIVES FORMAT

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

Social network

The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.

Personal networks

Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).

Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.

Sense of community and metropolitan displacements

The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:

• Socio-economic data.

• Data on habitual residence.

• Information on intercity journeys.

• Identity and sense of community.

• Personal network indicators.

• Social network indicators.

DATA ACCESS

Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.

The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: <https://www.flickr.com/photos/25906481@N07/albums/72157667029974755>.

In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:

Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp

The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl

CONCLUSION

The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.

The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.

ACKNOWLEDGEMENTS

The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals,
Israel Census
kaggle.com
zip
Updated Jul 31, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Ofer (2018). Israel Census [Dataset]. https://www.kaggle.com/danofer/israel-census
Explore at:
zip(4275033 bytes)Available download formats
Dataset updated
Jul 31, 2018
Authors
Dan Ofer
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Israel
Description
Context

2008 Population & demographic census data for Israel, at the level of settlements and lower .

Content

Data provided at the sub-settlement level (i.e neighborhoods). Variable names (in Hebrew and English) and data dictionary provided in XLS files. 2008 statistical area names provided (along with top roads/neighborhoods per settlement). Excel data needs cleaning/merging from multiple sub-pages.

Ideas:

Combine with voting datasets

Correlate population or economic growth over time with demographics

Geospatial analysis

Merge and clean the data from the sub tables.

Acknowledgements

Data from Israel Central Bureau of Statistics (CBS): http://www.cbs.gov.il/census/census/pnimi_page.html?id_topic=12

Photo by Me (Dan Ofer).
u
University of Cape Town Student Admissions Data 2006-2014 - South Africa
datafirst.uct.ac.za
Updated Jul 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
Explore at:
Dataset updated
Jul 28, 2020
Dataset authored and provided by
UCT Student Administration
Time period covered
2006 - 2014
Area covered
South Africa
Description
Abstract

This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

The dataset was separated into the following data files:

Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.

Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.

Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).

Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

Analysis unit

Applications, individuals

Kind of data

Administrative records [adm]

Mode of data collection

Other [oth]

Cleaning operations

The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
2019-2020 National Survey on Drug Use and Health: Comparison of Population...
catalog.data.gov
data.virginia.gov
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). 2019-2020 National Survey on Drug Use and Health: Comparison of Population Percentages from the United States, Census Regions, States, and the District of Columbia (Documentation for CSV and Excel Files) [Dataset]. https://catalog.data.gov/dataset/2019-2020-national-survey-on-drug-use-and-health-comparison-of-population-percentages-from
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Area covered
Washington, United States
Description
State estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
Z
What students answer when discussing about citation practices
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Sep 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salamin, Caroline; Cobolet, Noémi; Grolimund, Raphaël; Bouton, Pascale (2021). What students answer when discussing about citation practices [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_290155
Explore at:
Dataset updated
Sep 21, 2021
Dataset provided by
Bibliothèque de l'EPFL
Authors
Salamin, Caroline; Cobolet, Noémi; Grolimund, Raphaël; Bouton, Pascale
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This document explain how data were generated and how to interpret them.

LICENSE: CC0 But if you want to combine data with other datasets, feel free to use them as if they were published under CC0 license.
Data were published in February 2017. At that time, Zenodo only provided CC BY, CC BY-SA, CC BY-NC, CC BY-ND and CC BY-NC-ND. No CC0 option was available.

HOW DATA WERE COLLECTED The 21 recorded sessions took place between February 2013 and December 2016.
Data were collected using Turning Technologies' remote controls (called clickers) and TurningPoint software.

The 4 versions of the quiz used during these 4 years are provided in the 'quizzes' folder for information purpose (in PDF and Powerpoint formats).

Turning Technologies records data in a closed format (.tpzx) that can be exported and converted them into 3 formats provided here (these 3 files contain the same data):

Excel (.xslx)

Comma-spearated values (.csv)

SQLite (.sqlite)

The first one was directly exported from TurningPoint and is provided for Excel users who can't read CSV correctly.
CSV was converted from Excel and is provided for non-Excel users.
Finally, SQLite is provided in order to apply different sorting and filters to the data. It can be read using SQLite manager for Firefox (https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/).

CODEBOOK Here is the name, the meaning and the possible values of the columns (name - meaning [possible values]). If students didn't answer the question, the value is '-'.

Session - session number (chronological) [1 to 21] AcademicYear - academic year [12-13, 13-14, 14-15, 15-16, 16-17] Year - calendar year [2013, 2014, 2015, 2016] Month - month (number) [1 to 12] Day - day (number) [1 to 31] Section - section abbreviation [CH, ESC, GM, IF, SIE, SV] Level - students' level [BA2, BA3, MA] Language - course's language [FR or EN] DeviceID - clicker's ID [(unique ID within a session)] Q1 - answers to question 1 [A, B, C, D, E] Q2 - answers to question 2 [A, B, C, D] Q3 - answers to question 3 [A or B] Q4 - answers to question 4 [A or B] Q5 - answers to question 5 [A or B] Q6 - answers to question 6 [A or B] Q7 - answers to question 7 [A or B] Q8 - answers to question 8 [A or B] Q9 - answers to question 9 [A or B] Q8-9 - answers to the question 8-9 (merge) [A or B] Q10 - answers to question 10 [1, 2] Q11 - answers to question 11 [A or B] Q12 - answers to question 12 [A, B]

Section abbreviation meaning * CH: chemistry * ESC: school of criminal justice (Unil) * GM: mechanical engineering * IF: financial engineering * SIE: environmental engineering * SV: life sciences

Level meaning
* BA2: 2nd year of Bachelor * BA3: 3rd year of Bachelor * MA: Master level

Question types For some questions, multiple answers were allowed: Q1, Q2, Q10 & Q12.
Half of the questions have only one correct answer, true or false: Q3, Q5, Q6, Q7, Q8, Q9 & Q8-9.
Finally, for 2 questions only one answer was accepted, but there is not only one correct answer: Q4 & Q11.

INFORMATION ABOUT THE SESSIONS Except otherwise stated below, all sessions were conducted like the original one: Q1 to Q12 (no Q8-9). The original French version of the quiz has been translated into English for a few sessions with Master students. For sessions 14 and 20, Q5 was removed and Q8 & Q9 were merged in Q8-9.
Session 18 was a short one with only 7 sevens questions: Q1, Q2, Q3, Q4, Q6, Q7 & Q9.

CONTACT INFORMATION If you have any question about these data, contact formations.bib@epfl.ch.
Student Performance Data Set
kaggle.com
zip
Updated Mar 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Science Sean (2020). Student Performance Data Set [Dataset]. https://www.kaggle.com/datasets/larsen0966/student-performance-data-set
Explore at:
zip(12353 bytes)Available download formats
Dataset updated
Mar 27, 2020
Authors
Data-Science Sean
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
If this Data Set is useful, and upvote is appreciated. This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd-period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Vehicle Weight Estimation Dataset (5 Classes)
kaggle.com
zip
Updated Jul 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zainab (2025). Vehicle Weight Estimation Dataset (5 Classes) [Dataset]. https://www.kaggle.com/datasets/zainab288/vehicle-weight-estimation-dataset-5-classes
Explore at:
zip(453778558 bytes)Available download formats
Dataset updated
Jul 17, 2025
Authors
Zainab
Description
This dataset contains images of five different vehicle classes: Bus, Car, Motorcycle, Light Truck, and Heavy Truck. The images are split into training and testing sets, making it suitable for supervised learning tasks such as image classification and weight estimation.

In addition to the image files, the dataset includes two Excel sheets that provide approximate weight annotations for the different vehicle classes, enabling combined classification-regression tasks.

Class-name Total number Bus 1096 Car 1428 Motorcycle 542 heavy truck 1982 light truck 553

The dataset was manually created by combining images from several public datasets:

https://www.kaggle.com/datasets/kshitij192/cars-image-dataset https://www.kaggle.com/datasets/krishrana/vehicle-dataset https://www.kaggle.com/datasets/kaggleashwin/vehicle-type-recognition

Additional images were manually collected from the internet and organized into the five categories to ensure better class balance and diversity.

The dataset is shared for research and commercial use, with the goal of supporting projects in vehicle classification, weight estimation, and intelligent transportation systems.
E-Commerce Data
kaggle.com
zip
Updated Aug 17, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carrie (2017). E-Commerce Data [Dataset]. https://www.kaggle.com/datasets/carrie1/ecommerce-data
Explore at:
zip(7548686 bytes)Available download formats
Dataset updated
Aug 17, 2017
Authors
Carrie
Description
Context

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

Content

"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."

Acknowledgements

Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Image from stocksnap.io.

Inspiration

Analyses for this dataset could include time series, clustering, classification and more.
IMDb Top 4070: Explore the Cinema Data
kaggle.com
zip
Updated Aug 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
K.T.S. Prabhu (2023). IMDb Top 4070: Explore the Cinema Data [Dataset]. https://www.kaggle.com/datasets/ktsprabhu/imdb-top-4070-explore-the-cinema-data/discussion
Explore at:
zip(1449581 bytes)Available download formats
Dataset updated
Aug 13, 2023
Authors
K.T.S. Prabhu
Description
Description: Dive into the world of exceptional cinema with our meticulously curated dataset, "IMDb's Gems Unveiled." This dataset is a result of an extensive data collection effort based on two critical criteria: IMDb ratings exceeding 7 and a substantial number of votes, surpassing 10,000. The outcome? A treasure trove of 4070 movies meticulously selected from IMDb's vast repository.

What sets this dataset apart is its richness and diversity. With more than 20 data points meticulously gathered for each movie, this collection offers a comprehensive insight into each cinematic masterpiece. Our data collection process leveraged the power of Selenium and Pandas modules, ensuring accuracy and reliability.

Cleaning this vast dataset was a meticulous task, combining both Excel and Python for optimum precision. Analysis is powered by Pandas, Matplotlib, and NLTK, enabling to uncover hidden patterns, trends, and themes within the realm of cinema.

Note: The data is collected as of April 2023. Future versions of this analysis include Movie recommendation system Please do connect for any queries, All Love, No Hate.
Economic calendar Invest Forex https://t.me/econos
kaggle.com
zip
Updated Apr 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergey (2021). Economic calendar Invest Forex https://t.me/econos [Dataset]. https://www.kaggle.com/devorvant/economic-calendar
Explore at:
zip(7999685 bytes)Available download formats
Dataset updated
Apr 26, 2021
Authors
Sergey
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Introduction

Explore the archive of relevant economic information: relevant news on all indicators with explanations, data on past publications on the economy of the United States, Britain, Japan and other developed countries, volatility assessments and much more. For the construction of their forecast models, the use of in-depth training is optimal, with a learning model built on the basis of EU and Forex data. The economic calendar is an indispensable assistant for the trader.

ON THIS TOPIC Telegram : @Economic Calendar Investing Forex https://t.me/economic_calendar_forex_invest This channel will wake you up 5 minutes before important events of high volatility, as well as inform you of current data for monitoring from the investing economic calendar

Data set

The data set is created in the form of an CSV, Excel spreadsheet (two files 2011-2013, 2014-2019), which can be found at boot time. You can see the source of the data on the site https://www.investing.com/economic-calendar/

http://comparic.com/wp-content/uploads/2016/12/Economic_Calendar_-_Investing.com_-_2016-12-19_02.45.10.jpg" alt="http://comparic.com/wp-content/uploads/2016/12/Economic_Calendar_-_Investing.com_-_2016-12-19_02.45.10.jpg">

column - Event date

column - Event time (time New York)

column - Country of the event

column - The degree of volatility (possible fluctuations in currency, indices, etc.) caused by this event

column - Description of the event

column - Evaluation of the event according to the actual data, which came out better than the forecast, worse or correspond to it

column - Data format (%, K x103, M x106, T x109)

column - Actual event data

column - Event forecast data

column - Previous data on this event (with comments if there were any interim changes).

Inspiration

Use the historical EU in conjunction with the Forex data (exchange rates, indices, metals, oil, stocks) to forecast subsequent Forex data in order to minimize investment risks (combine fundamental market analysis and technical).

Historical events of the EU used as a forecast of the subsequent (for example, the calculation of the probability of an increase in the rate of the Fed).

Investigate the impact of combinations of EC events on the degree of market volatility at different time periods.

To trace the main trends in the economies of the leading countries (for example, a decrease in the demand for unemployment benefits).

Use the EU calendar together with the news background archive for this time interval for a more accurate forecast.

Bank Telemarketing

kaggle.com

zip

Updated Jun 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Younus_Mohamed (2025). Bank Telemarketing [Dataset]. https://www.kaggle.com/datasets/younusmohamed/bank-telemarketing

Explore at:

zip(3248401 bytes)Available download formats

Dataset updated

Jun 1, 2025

Authors

Younus_Mohamed

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

📞 Bank Marketing (Term Deposit Subscription) Dataset

Source : UCI Machine Learning Repository – Bank Marketing (#222)

A Portuguese retail bank’s phone-based marketing campaigns (May 2008 → Nov 2010).
The task is to predict whether a client will subscribe to a term deposit (target y).

1 · Background

Each row records the outcome of the last phone call (plus client history).
Multiple calls to the same client may appear across campaigns.
The original authors showed that data-driven targeting boosts campaign ROI – see the reference paper below.

2 · Files in this Kaggle release

File	Rows	Columns	Notes
`bank_marketing.xlsx`	45 211	17	Classic “bank-full” version (all examples, 17 predictors + target)

Need the enriched “bank-additional” version with 20 predictors? Grab it from the UCI link.

3 · Data Dictionary (17 predictors + target)

Column	Type	Description
`age`	int	Age of the client
`job`	cat	Job type (admin., blue-collar, …)
`marital`	cat	Marital status (married / single / divorced)
`education`	cat	Education level (primary / secondary / tertiary / unknown)
`default`	bin	Has credit in default?
`balance`	int	Average yearly balance (EUR)
`housing`	bin	Has housing loan?
`loan`	bin	Has personal loan?
`contact`	cat	Contact channel (cellular / telephone / unknown)
`day`	int	Day of month of last contact
`month`	cat	Month of last contact (`jan`-`dec`)
`duration`	int	Call duration (secs)*
`campaign`	int	Contacts made in this campaign (incl. last)
`pdays`	int	Days since last contact (-1 ⇒ never)
`previous`	int	Previous contacts before this campaign
`poutcome`	cat	Outcome of previous campaign (failure / success / nonexistent)
`y`	bin	Target – subscribed to term deposit? (`yes`/`no`)

*⚠️ duration is only known after the call ends; include it only for benchmarking, not for live prediction.

4 · Quick Start in Python

import pandas as pd

df = pd.read_excel('/kaggle/input/bank-marketing/bank_marketing.xlsx')
print(df.shape)     # (45211, 17)
df.head()

Prefer pip? Fetch directly from ucimlrepo:
'''
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
bm = fetch_ucirepo(id=222)
X, y = bm.data.features, bm.data.targets
'''

## 5 · Use-Cases & Ideas 

| 🛠️ ML Task       | Why it’s interesting                                              |
|--------------------------|----------------------------------------------------------------------------------------------------------------|
| Binary classification  | Classic imbalanced dataset – try **SMOTE**, cost-sensitive learning, threshold tuning             |
| Feature engineering   | Combine `pdays`, `campaign`, `previous` into a **contact-intensity score**                   |
| Model interpretability  | Use **SHAP** / **LIME** to explain “yes” predictions                              |
| Time-aware validation  | Data are date-ordered → split train/test chronologically to avoid leakage                   |

---

## 6 · Credits & Citations 

> **Creators :** **Sérgio Moro, Paulo Rita, Paulo Cortez** 
> **Original paper :** 
> Moro S., Cortez P., Rita P. (2014). 
> *A data-driven approach to predict the success of bank telemarketing campaigns.* 
> *Decision Support Systems.* [[PDF]](https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e)

If you use this dataset, please cite:

Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset].
UCI Machine Learning Repository. https://doi.org/10.24432/C5K306


---

## 7 · License 

This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)**. 
You are free to share & adapt, **provided you credit the original creators**.

Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Afolabi Raymond (2024). Employee Analysis In Excel [Dataset]. https://www.kaggle.com/datasets/afolabiraymond/employee-analysis-in-excel

Employee Analysis In Excel

Explore at:

zip(190258 bytes)Available download formats

Dataset updated

Mar 20, 2024

Authors

Afolabi Raymond

Description

In this project, I analysed the employees of an organization located in two distinct countries using Excel. This project covers:

1) How to approach a data analysis project 2) How to systematically clean data 3) Doing EDA with Excel formulas & tables 4) How to use Power Query to combine two datasets 5) Statistical Analysis of data 6) Using formulas like COUNTIFS, SUMIFS, XLOOKUP 7) Making an information finder with your data 8) Male vs. Female Analysis with Pivot tables 9) Calculating Bonuses based on business rules 10) Visual analytics of data with 4 topics 11) Analysing the salary spread (Histograms & Box plots) 12) Relationship between Salary & Rating 13) Staff growth over time - trend analysis 14) Regional Scorecard to compare NZ with India

Including various Excel features such as: 1) Using Tables 2) Working with Power Query 3) Formulas 4) Pivot Tables 5) Conditional formatting 6) Charts 7) Data Validation 8) Keyboard Shortcuts & tricks 9) Dashboard Design

Clear search

Close search

Google apps

Main menu

Employee Analysis In Excel

Cleaned NHANES 1988-2018

Enterprise Survey 2009-2019, Panel Data - Slovenia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Response rate

Data from: DATASET FOR: A multimodal spectroscopic approach combining...

Supplementary Datasets

Data from: Dynamic Technical and Environmental Efficiency Performance of...

NHANES 1988-2018

Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife...

Albero study: a longitudinal database of the social network and personal...

Israel Census

Context

Content

Ideas:

Acknowledgements

University of Cape Town Student Admissions Data 2006-2014 - South Africa

Abstract

Analysis unit

Kind of data

Mode of data collection

Cleaning operations

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

2019-2020 National Survey on Drug Use and Health: Comparison of Population...

What students answer when discussing about citation practices

Student Performance Data Set

Vehicle Weight Estimation Dataset (5 Classes)

E-Commerce Data

Context

Content

Acknowledgements

Inspiration

IMDb Top 4070: Explore the Cinema Data

Economic calendar Invest Forex https://t.me/econos

Introduction

Data set

Inspiration

Bank Telemarketing

📞 Bank Marketing (Term Deposit Subscription) Dataset

1 · Background

2 · Files in this Kaggle release

3 · Data Dictionary (17 predictors + target)

4 · Quick Start in Python

Employee Analysis In Excel