25 datasets found

Superstore Sales Analysis
kaggle.com
zip
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1
Explore at:
zip(3009057 bytes)Available download formats
Dataset updated
Oct 21, 2023
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Pakistan House Price dataset
kaggle.com
zip
Updated May 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jillani SofTech (2023). Pakistan House Price dataset [Dataset]. https://www.kaggle.com/datasets/jillanisofttech/pakistan-house-price-dataset/versions/1
Explore at:
zip(8379623 bytes)Available download formats
Dataset updated
May 6, 2023
Authors
Jillani SofTech
Area covered
Pakistan
Description
Dataset Description: The dataset contains information about properties. Each property has a unique property ID and is associated with a location ID based on the subcategory of the city. The dataset includes the following attributes:

Property ID: Unique identifier for each property. Location ID: Unique identifier for each location within a city. Page URL: The URL of the webpage where the property was published. Property Type: Categorization of the property into six types: House, FarmHouse, Upper Portion, Lower Portion, Flat, or Room. Price: The price of the property, which is the dependent feature in this dataset. City: The city where the property is located. The dataset includes five cities: Lahore, Karachi, Faisalabad, Rawalpindi, and Islamabad. Province: The state or province where the city is located. Location: Different types of locations within each city. Latitude and Longitude: Geographic coordinates of the cities. Steps Involved in the Analysis:

Statistical Analysis:

Data Types: Determine the data types of the attributes. Level of Measurement: Identify the level of measurement for each attribute. Summary Statistics: Calculate mean, standard deviation, minimum, and maximum values for numerical attributes. Data Cleaning:

Filling Null Values: Handle missing values in the dataset. Duplicate Values: Remove duplicate records, if any. Correcting Data Types: Ensure the correct data types for each attribute. Outliers Detection: Identify and handle outliers in the data. Exploratory Data Analysis (EDA):

Visualization: Use libraries such as Seaborn, Matplotlib, and Plotly to visualize the data and gain insights. Model Building:

Libraries: Utilize libraries like Sklearn and pickle. List of Models: Build models using Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors (KNN), XG Boost, Gradient Boost, and Ada Boost. Model Saving: Save the selected model into a pickle file for future use. I hope this captures the essence of the provided information. Let me know if you need any further assistance!
f
Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...
acs.figshare.com
zip
Updated Sep 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross (2023). PCP-SAFT Parameters of Pure Substances Using Large Experimental Databases [Dataset]. http://doi.org/10.1021/acs.iecr.3c02255.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.iecr.3c02255.s001
Dataset updated
Sep 6, 2023
Dataset provided by
ACS Publications
Authors
Timm Esper; Gernot Bauer; Philipp Rehner; Joachim Gross
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This work reports pure component parameters for the PCP-SAFT equation of state for 1842 substances using a total of approximately 551 172 experimental data points for vapor pressure and liquid density. We utilize data from commercial and public databases in combination with an automated workflow to assign chemical identifiers to all substances, remove duplicate data sets, and filter unsuited data. The use of raw experimental data, as opposed to pseudoexperimental data from empirical correlations, requires means to identify and remove outliers, especially for vapor pressure data. We apply robust regression using a Huber loss function. For identifying and removing outliers, the empirical Wagner equation for vapor pressure is adjusted to experimental data, because the Wagner equation is mathematically rather flexible and is thus not subject to a systematic model bias. For adjusting model parameters of the PCP-SAFT model, nonpolar, dipolar and associating substances are distinguished. The resulting substance-specific parameters of the PCP-SAFT equation of state yield in a mean absolute relative deviation of the of 2.73% for vapor pressure and 0.52% for liquid densities (2.56% and 0.47% for nonpolar substances, 2.67% and 0.61% for dipolar substances, and 3.24% and 0.54% for associating substances) when evaluated against outlier-removed data. All parameters are provided as JSON and CSV files.
Data from: Fast robust SUR with economical and actuarial applications
search.datacite.org
wiley.figshare.com
Updated Jul 14, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mia Hubert; Tim Verdonck (2016). Data from: Fast robust SUR with economical and actuarial applications [Dataset]. http://doi.org/10.6084/m9.figshare.3408073
Explore at:
Unique identifier
https://doi.org/10.6084/m9.figshare.3408073
Dataset updated
Jul 14, 2016
Dataset provided by
DataCitehttps://www.datacite.org/
Wiley
Authors
Mia Hubert; Tim Verdonck
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The seemingly unrelated regression (SUR) model is a generalization of a linear regression model consisting of more than one equation, where the error terms of these equations are contemporaneously correlated. The standard Feasible Generalized Linear Squares (FGLS) estimator is efficient as it takes into account the covariance structure of the errors, but it is also very sensitive to outliers. The robust SUR estimator of Bilodeau and Duchesne (Canadian Journal of Statistics, 28:277-288, 2000) can accommodate outliers, but it is hard to compute. First we propose a fast algorithm, FastSUR, for its computation and show its good performance in a simulation study. We then provide diagnostics for outlier detection and illustrate them on a real data set from economics. Next we apply our FastSUR algorithm in the framework of stochastic loss reserving for general insurance. We focus on the General Multivariate Chain Ladder (GMCL) model that employs SUR to estimate its parameters. Consequently, this multivariate stochastic reserving method takes into account the contemporaneous correlations among run-off triangles and allows structural connections between these triangles. We plug in our FastSUR algorithm into the GMCL model to obtain a robust version.
Intermediate data for TE calculation
zenodo.org
bin, csv
Updated May 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yue Liu; Yue Liu (2025). Intermediate data for TE calculation [Dataset]. http://doi.org/10.5281/zenodo.10373032
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10373032
Dataset updated
May 9, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yue Liu; Yue Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes intermediate data from RiboBase that generates translation efficiency (TE). The code to generate the files can be found at https://github.com/CenikLab/TE_model.

We uploaded demo HeLa .ribo files, but due to the large storage requirements of the full dataset, I recommend contacting Dr. Can Cenik directly to request access to the complete version of RiboBase if you need the original data.

The detailed explanation for each file:

human_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in human.

human_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in human.

human_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in human.

human_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in human.

human_TE_rho.rda: TE proportional similarity data as genes by genes matrix in human.

mouse_flatten_ribo_clr.rda: ribosome profiling clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_rna_clr.rda: matched RNA-seq clr normalized data with GEO GSM ids in columns and genes in rows in mouse.

mouse_flatten_te_clr.rda: TE clr data with GEO GSM ids in columns and genes in rows in mouse.

mouse_TE_cellline_all_plain.csv: TE clr data with genes in rows and cell lines in rows in mouse.

mouse_RNA_rho_new.rda: matched RNA-seq proportional similarity data as genes by genes matrix in mouse.

mouse_TE_rho.rda: TE proportional similarity data as genes by genes matrix in mouse.

All the data was passed quality control. There are 1054 mouse samples and 835 mouse samples:
* coverage > 0.1 X
* CDS percentage > 70%
* R2 between RNA and RIBO >= 0.188 (remove outliers)

All ribosome profiling data here is non-dedup winsorizing data paired with RNA-seq dedup data without winsorizing (even though it names as flatten, it just the same format of the naming)

####code
If you need to read rda data please use load("rdaname.rda") with R

If you need to calculate proportional similarity from clr data:
library(propr)
human_TE_homo_rho <- propr:::lr2rho(as.matrix(clr_data))
rownames(human_TE_homo_rho) <- colnames(human_TE_homo_rho) <- rownames(clr_data)
Employees Report
kaggle.com
zip
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2024). Employees Report [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/employees-report/code
Explore at:
zip(306949 bytes)Available download formats
Dataset updated
Feb 3, 2024
Authors
Ali Reda Elblgihy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Analyzing an Employees Report with the provided data on employee numbers, monthly salary, performance appraisal rates, departmental distribution, geographical distribution, trends in employee numbers over the years, employee satisfaction, and gender distribution can yield valuable insights for informed decision-making and strategic planning. Here's a breakdown of the analysis:

Employee Numbers:

The total number of employees in the company is a key metric for assessing the organization's size and growth potential. Analyze historical data on employee numbers over the years to identify trends. Is the workforce expanding, contracting, or remaining stable? Monthly Salary:

Examine the distribution of monthly salaries to understand compensation structures within the organization. Calculate key statistics such as median and quartiles to assess salary equity. Identify any outliers in salary data, which may require further investigation. Performance Appraisal:

Calculate the average performance appraisal rate to gauge overall employee performance. Break down performance ratings by department to identify areas of excellence and potential improvement. Departmental Distribution:

Determine the number of employees in each department to assess departmental size and potential resource allocation. Analyze turnover rates by department to identify areas with high attrition. Geographical Distribution:

Examine the geographical origin of employees, including their area and country of residence. Identify locations with a high concentration of employees, which may have implications for office space, remote work policies, or recruitment strategies. Trends in Employee Numbers Over the Years:

Visualize the trend in employee numbers over the years using line charts or graphs. Look for patterns, such as seasonal fluctuations or long-term growth trends. Employee Satisfaction:

Analyze employee satisfaction survey results to assess the overall satisfaction level of employees. Identify areas where employees are particularly satisfied or dissatisfied and consider action plans. Gender Distribution:

Calculate the percentage of male and female employees to understand gender diversity. Assess whether there are any significant gender imbalances in specific departments or roles. To facilitate data cleaning and filtering for enhanced decision-making:

Data Cleaning: Ensure data integrity by addressing missing values, outliers, and inconsistencies in the dataset. This will result in more accurate and reliable analyses.

Filtering Options: Provide filters and interactive dashboards in the report to allow users to explore data based on various criteria such as department, performance rating, salary range, location, and gender. This empowers stakeholders to tailor the analysis to their specific needs.

Decision-Making Insights: Summarize key findings and insights from the analysis to assist decision-makers in identifying areas for improvement, resource allocation, and strategic planning.
Machine learning pipeline to train toxicity prediction model of...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Ewald; Jan Ewald (2020). Machine learning pipeline to train toxicity prediction model of FunTox-Networks [Dataset]. http://doi.org/10.5281/zenodo.3529162
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3529162
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jan Ewald; Jan Ewald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Machine Learning pipeline used to provide toxicity prediction in FunTox-Networks

01_DATA # preprocessing and filtering of raw activity data from ChEMBL
- Chembl_v25 # latest activity assay data set from ChEMBL (retrieved Nov 2019)
- filt_stats.R # Filtering and preparation of raw data
- Filtered # output data sets from filt_stats.R
- toxicity_direction.csv # table of toxicity measurements and their proportionality to toxicity

02_MolDesc # Calculation of molecular descriptors for all compounds within the filtered ChEMBL data set
- datastore # files with all compounds and their calculated molecular descriptors based on SMILES
- scripts
- calc_molDesc.py # calculates for all compounds based on their smiles the molecular descriptors
- chemopy-1.1 # used python package for descriptor calculation as decsribed in: https://doi.org/10.1093/bioinformatics/btt105

03_Averages # Calculation of moving averages for levels and organisms as required for calculation of Z-scores
- datastore # output files with statistics calculated by make_Z.R
- scripts
-make_Z.R # script to calculate statistics to calculate Z-scores as used by the regression models

04_ZScores # Calculation of Z-scores and preparation of table to fit regression models
- datastore # Z-normalized activity data and molecular descriptors in the form as used for fitting regression models
- scripts
-calc_Ztable.py # based on activity data, molecular descriptors and Z-statistics, the learning data is calculated

05_Regression # Performing regression. Preparation of data by removing of outliers based on a linear regression model. Learning of random forest regression models. Validation of learning process by cross validation and tuning of hyperparameters.

- datastore # storage of all random forest regression models and average level of Z output value per level and organism (zexp_*.tsv)
- scripts
- data_preperation.R # set up of regression data set, removal of outliers and optional removal of fields and descriptors
- Rforest_CV.R # analysis of machine learning by cross validation, importance of regression variables and tuning of hyperparameters (number of trees, split of variables)
- Rforest.R # based on analysis of Rforest_CV.R learning of final models

rregrs_output
# early analysis of regression model performance with the package RRegrs as described in: https://doi.org/10.1186/s13321-015-0094-2
housing
kaggle.com
zip
Updated Sep 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HappyRautela (2023). housing [Dataset]. https://www.kaggle.com/datasets/happyrautela/housing
Explore at:
zip(809785 bytes)Available download formats
Dataset updated
Sep 22, 2023
Authors
HappyRautela
Description
The exercise after this contains questions that are based on the housing dataset.

How many houses have a waterfront? a. 21000 b. 21450 c. 163 d. 173

How many houses have 2 floors? a. 2692 b. 8241 c. 10680 d. 161

How many houses built before 1960 have a waterfront? a. 80 b. 7309 c. 90 d. 92

What is the price of the most expensive house having more than 4 bathrooms? a. 7700000 b. 187000 c. 290000 d. 399000

For instance, if the ‘price’ column consists of outliers, how can you make the data clean and remove the redundancies? a. Calculate the IQR range and drop the values outside the range. b. Calculate the p-value and remove the values less than 0.05. c. Calculate the correlation coefficient of the price column and remove the values less than the correlation coefficient. d. Calculate the Z-score of the price column and remove the values less than the z-score.

What are the various parameters that can be used to determine the dependent variables in the housing data to determine the price of the house? a. Correlation coefficients b. Z-score c. IQR Range d. Range of the Features

If we get the r2 score as 0.38, what inferences can we make about the model and its efficiency? a. The model is 38% accurate, and shows poor efficiency. b. The model is showing 0.38% discrepancies in the outcomes. c. Low difference between observed and fitted values. d. High difference between observed and fitted values.

If the metrics show that the p-value for the grade column is 0.092, what all inferences can we make about the grade column? a. Significant in presence of other variables. b. Highly significant in presence of other variables c. insignificance in presence of other variables d. None of the above

If the Variance Inflation Factor value for a feature is considerably higher than the other features, what can we say about that column/feature? a. High multicollinearity b. Low multicollinearity c. Both A and B d. None of the above
n
Data from: Drivers of contemporary and future changes in Arctic seasonal...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+3more
zip
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen (2023). Drivers of contemporary and future changes in Arctic seasonal transition dates for a tundra site in coastal Greenland [Dataset]. http://doi.org/10.5061/dryad.jsxksn0hp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jsxksn0hp
Dataset updated
Dec 30, 2023
Dataset provided by
University of Copenhagen
Institute of Geographic Sciences and Natural Resources Research
Authors
Yijing Liu; Peiyan Wang; Bo Elberling; Andreas Westergaard-Nielsen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Arctic, Greenland
Description
Climate change has had a significant impact on the seasonal transition dates of Arctic tundra ecosystems, causing diverse variations between distinct land surface classes. However, the combined effect of multiple controls as well as their individual effects on these dates remains unclear at various scales and across diverse land surface classes. Here we quantified spatiotemporal variations of three seasonal transition dates (start of spring, maximum Normalized Difference Vegetation Index (NDVImax) day, end of fall) for five dominant land surface classes in the ice-free Greenland and analyzed their drivers for current and future climate scenarios, respectively. Methods To quantify the seasonal transition dates, we used NDVI derived from Sentinel-2 MultiSpectral Instrument (Level-1C) images during 2016–2020 based on Google Earth Engine (https://developers.google.com/earth-engine/datasets/catalog/COPERNICUS_S2). We performed an atmospheric correction (Yin et al., 2019) on the images before calculating NDVI. The months from May to October were set as the study period each year. The quality control process includes 3 steps: (i) the cloud was masked according to the QA60 band; (ii) images were removed if the number of pixels with NDVI values outside the range of -1–1 exceeds 30% of the total pixels while extracting the median value of each date; (iii) NDVI outliers resulting from cloud mask errors (Coluzzi et al., 2018) and sporadic snow were deleted pixel by pixel. NDVI outliers mentioned here appear as a sudden drop to almost zero in the growing season and do not form a sequence in this study (Komisarenko et al., 2022). To identify outliers, we iterated through every two consecutive NDVI values in the time series and calculated the difference between the second and first values for each pixel every year. We defined anomalous NDVI differences as points outside of the percentiles threshold [10 90], and if the NDVI difference is positive, then the first NDVI value used to calculate the difference will be the outlier, otherwise, the second one will be the outlier. Finally, 215 images were used to reflect seasonal transition dates in all 5 study periods of 2016–2020 after the quality control. Each image was resampled with 32 m spatial resolution to match the resolution of the ArcticDEM data and SnowModel outputs. To detect seasonal transition dates, we used a double sigmoid model to fit the NDVI changes on time series, and points where the curvature changes most rapidly on the fitted curve, appear at the beginning, middle, and end of each season (Klosterman et al., 2014). The applicability of this phenology method in the Arctic has been demonstrated (Ma et al., 2022; Westergaard-Nielsen et al., 2013; Westergaard-Nielsen et al., 2017). We focused on 3 seasonal transition dates, i.e., SOS, NDVImax day, and EOF. The NDVI values for some pixels are still below zero in spring and summer due to topographical shadow. We, therefore, set a quality control rule before calculating seasonal transition dates for each pixel, i.e., if the number of days with positive NDVI values from June to September is less than 60% of the total number of observed days, the pixel will not be considered for subsequent calculations. As verification of fitted dates, the seasonal transition dates in dry heaths and corresponding time-lapse photos acquired from the snow fence area are shown in Fig. 2. Snow cover extent is greatly reduced and vegetation is exposed with lower NDVI values on the SOS. All visible vegetation is green on the NDVImax day. On EOF, snow cover distributes partly, and NDVI decreases to a value close to zero.
Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...
zenodo.org
zip
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
o; o (2025). Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of U.S. Tech Firms [Dataset]. http://doi.org/10.5281/zenodo.15337959
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15337959
Dataset updated
May 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
o; o
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 4, 2025
Description
Note: All supplementary files are provided as a single compressed archive named dataset.zip. Users should extract this file to access the individual Excel and Python files listed below.

This supplementary dataset supports the manuscript titled “Mahalanobis-Based Multivariate Financial Statement Analysis: Outlier Detection and Typological Clustering in U.S. Tech Firms.” It contains both data files and Python scripts used in the financial ratio analysis, Mahalanobis distance computation, and hierarchical clustering stages of the study. The files are organized as follows:

ESM_1.xlsx – Raw financial ratios of 18 U.S. technology firms (2020–2024)

ESM_2.py – Python script to calculate Z-scores from raw financial ratios

ESM_3.xlsx – Dataset containing Z-scores for the selected financial ratios

ESM_4.py – Python script for generating the correlation heatmap of the Z-scores

ESM_5.xlsx – Mahalanobis distance values for each firm

ESM_6.py – Python script to compute Mahalanobis distances

ESM_7.py – Python script to visualize Mahalanobis distances

ESM_8.xlsx – Mean Z-scores per firm (used for cluster analysis)

ESM_9.py – Python script to compute mean Z-scores

ESM_10.xlsx – Re-standardized Z-scores based on firm-level means

ESM_11.py – Python script to re-standardize mean Z-scores

ESM_12.py – Python script to generate the hierarchical clustering dendrogram

All files are provided to ensure transparency and reproducibility of the computational procedures in the manuscript. Each script is commented and formatted for clarity. The dataset is intended for educational and academic reuse under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
Salaries case study
kaggle.com
zip
Updated Oct 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shobhit Chauhan (2024). Salaries case study [Dataset]. https://www.kaggle.com/datasets/satyam0123/salaries-case-study
Explore at:
zip(13105509 bytes)Available download formats
Dataset updated
Oct 2, 2024
Authors
Shobhit Chauhan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:

Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.

Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Dataset for the paper "Observation of Acceleration and Deceleration Periods...
zenodo.org
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yide Qian; Yide Qian (2025). Dataset for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 " [Dataset]. http://doi.org/10.5281/zenodo.15022854
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15022854
Dataset updated
Mar 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yide Qian; Yide Qian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Pine Island Glacier
Description
Dataset and codes for "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023 "

Description of the data and file structure

The MATLAB codes and related datasets are used for generating the figures for the paper "Observation of Acceleration and Deceleration Periods at Pine Island Ice Shelf from 1997–2023".

Files and variables

File 1: Data_and_Code.zip

Directory: Main_function

**Description:****Include MATLAB scripts and functions. Each script include discriptions that guide the user how to used it and how to find the dataset that used for processing.

MATLAB Main Scripts: Include the whole steps to process the data, output figures, and output videos.

Script_1_Ice_velocity_process_flow.m

Script_2_strain_rate_process_flow.m

Script_3_DROT_grounding_line_extraction.m

Script_4_Read_ICESat2_h5_files.m

Script_5_Extraction_results.m

MATLAB functions: Five Files that includes MATLAB functions that support the main script:

1_Ice_velocity_code: Include MATLAB functions related to ice velocity post-processing, includes remove outliers, filter, correct for atmospheric and tidal effect, inverse weited averaged, and error estimate.

2_strain_rate: Include MATLAB functions related to strain rate calculation.

3_DROT_extract_grounding_line_code: Include MATLAB functions related to convert range offset results output from GAMMA to differential vertical displacement and used the result extract grounding line.

4_Extract_data_from_2D_result: Include MATLAB functions that used for extract profiles from 2D data.

5_NeRD_Damage_detection: Modified code fom Izeboud et al. 2023. When apply this code please also cite Izeboud et al. 2023 (https://www.sciencedirect.com/science/article/pii/S0034425722004655).

6_Figure_plotting_code:Include MATLAB functions related to Figures in the paper and support information.

Director: data_and_result

Description:**Include directories that store the results output from MATLAB. user only neeed to modify the path in MATLAB script to their own path.

1_origin : Sample data ("PS-20180323-20180329", “PS-20180329-20180404”, “PS-20180404-20180410”) output from GAMMA software in Geotiff format that can be used to calculate DROT and velocity. Includes displacment, theta, phi, and ccp.

2_maskccpN: Remove outliers by ccp < 0.05 and change displacement to velocity (m/day).

3_rockpoint: Extract velocities at non-moving region

4_constant_detrend: removed orbit error

5_Tidal_correction: remove atmospheric and tidal induced error

6_rockpoint: Extract non-aggregated velocities at non-moving region

6_vx_vy_v: trasform velocities from va/vr to vx/vy

7_rockpoint: Extract aggregated velocities at non-moving region

7_vx_vy_v_aggregate_and_error_estimate: inverse weighted average of three ice velocity maps and calculate the error maps

8_strain_rate: calculated strain rate from aggregate ice velocity

9_compare: store the results before and after tidal correction and aggregation.

10_Block_result: times series results that extrac from 2D data.

11_MALAB_output_png_result: Store .png files and time serties result

12_DROT: Differential Range Offset Tracking results

13_ICESat_2: ICESat_2 .h5 files and .mat files can put here (in this file only include the samples from tracks 0965 and 1094)

14_MODIS_images: you can store MODIS images here

shp: grounding line, rock region, ice front, and other shape files.

File 2 : PIG_front_1947_2023.zip

Includes Ice front positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

File 3 : PIG_DROT_GL_2016_2021.zip

Includes grounding line positions shape files from 1947 to 2023, which used for plotting figure.1 in the paper.

Data was derived from the following sources:
Those links can be found in MATLAB scripts or in the paper "**Open Research" **section.
w
Malaria Indicator Survey 2017 - Tanzania
microdata.worldbank.org
datacatalog.ihsn.org
+1more
Updated Jul 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office of the Chief Government Statistician (OCGS) (2019). Malaria Indicator Survey 2017 - Tanzania [Dataset]. https://microdata.worldbank.org/index.php/catalog/3376
Explore at:
Dataset updated
Jul 10, 2019
Dataset provided by
Office of the Chief Government Statistician (OCGS)
National Bureau of Statistics (NBS)
Time period covered
2017
Area covered
Tanzania
Description
Abstract

The 2017 Tanzania Malaria Indicator Survey (2017 TMIS) was the second stand-alone malaria indicator survey conducted in the country, following the one implemented in 2011-2012 (2011-12 THMIS). The survey involved a nationally representative sample of 9,724 households from 442 sample clusters.

The primary objective of the 2017 TMIS is to provide up-to-date estimates of basic demographic and health indicators related to malaria. Specifically, the survey collected information on vector control interventions such as mosquito nets, intermittent preventive treatment of malaria in pregnant women, and care seeking and treatment of fever in children. Young children were also tested for anaemia and for malaria infection.

Overall, the key aims of the 2017 TMIS are to: • Measure the level of ownership and use of mosquito nets • Assess coverage of intermittent preventive treatment for pregnant women • Identify health care seeking behaviours and treatment practices, including the use of specific antimalarial medications to treat malaria among children under age 5 • Identify diagnostic trends prior to administration of antimalarial medications for treatment of fever and other malaria-like symptoms • Measure the prevalence of malaria and anaemia among children age 6-59 months • Assess malaria knowledge, attitudes, and practices among women age 15-49 • Assess housing conditions • Assess the cost of malaria-related services

The information collected through the 2017 TMIS is intended to assist policymakers and program managers in evaluating and designing programs and strategies for improving the health of the country’s population.

Geographic coverage

National coverage

Analysis unit

Household

Woman age 15 to 49

Child age 0 to 5

Kind of data

Sample survey data [ssd]

Sampling procedure

The sampling frame used for the 2017 TMIS was the 2012 Tanzania Population and Housing Census (PHC). The sampling frame was a complete list of enumeration areas (EAs) covering the whole country provided by the National Bureau of Statistics (NBS) of Tanzania, the implementing agency for the 2017 TMIS. This frame was created for the 2012 PHC, and the EAs served as counting units for the census. In rural areas, an EA is a natural village, a segment of a large village, or a group of small villages; in urban areas, an EA is a street or a city block. Each EA includes identification information, administrative information, and, as a measure of size, the number of residential households residing in the EA. Each EA is also classified into one of two types of residence, urban or rural. For each EA, there are cartographical materials that delineate its geographical locations, boundaries, main access, and landmarks inside or outside the EA, helping to identify the different areas.

Note: See Appendix A of the final report for additional details on the sampling procedure.

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Three questionnaires—the Household Questionnaire, the Woman’s Questionnaire, and the Biomarker Questionnaire—were used for the 2017 TMIS. Core questionnaires available from the Roll Back Malaria Monitoring & Evaluation Reference Group (RBM-MERG) were adapted to reflect the population and health issues relevant to Tanzania.

The questionnaires were initially prepared in English, later translated to Kiswahili, and then programmed onto tablet computers, enabling use of computer-assisted personal interviewing (CAPI) for the survey.

Cleaning operations

Data for the 2017 TMIS were collected through questionnaires programmed onto the CAPI application. The CAPI application was programmed by ICF in collaboration with NBS and OCGS and loaded with the Household and Woman’s Questionnaires. The Biomarker Questionnaire measurements were entered on a hard copy and later transferred to the CAPI application. Using a secure internet file streaming system (IFSS), the field supervisors transferred data to a server located at NBS headquarters in Dar es Salaam on a daily basis. To facilitate communication and monitoring, each field worker was assigned a unique identification number.

At NBS headquarters, data received from the field teams’ CAPI applications were registered and checked for inconsistencies and outliers. Data editing and cleaning included an extensive range of structural and internal consistency checks. Any anomalies were communicated to the teams so that, together with the data processing teams, they could resolve data discrepancies. The corrected results were maintained in master Census and Survey Processing System (CSPro) data files at NBS and were used in producing tables for analysis and report writing. ICF provided technical assistance in processing the data using CSPro for data editing, cleaning, weighting, and tabulation.

Response rate

A total of 9,724 households selected for the sample, 9,390 were occupied at the time of fieldwork. Among the occupied households, 9,330 were successfully interviewed, yielding a total household response rate of 99%. In the interviewed households, 10,136 eligible women were identified for individual interviews and 10,018 were successfully interviewed, yielding a response rate of 99%.

Sampling error estimates

The estimates from a sample survey are affected by two types of errors: non-sampling errors and sampling errors. Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as failure to locate and interview the correct household, misunderstanding of the questions on the part of either the interviewer or the respondent, and data entry errors. Although numerous efforts were made during the implementation of the 2017 Tanzania Malaria Indicator Survey (2017 TMIS) to minimise this type of error, non-sampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors, on the other hand, can be evaluated statistically. The sample of respondents selected in the 2017 TMIS is only one of many samples that could have been selected from the same population, using the same design and expected size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability between all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.

A sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95% of all possible samples of identical size and design.

If the sample of respondents had been selected as a simple random sample, it would have been possible to use straightforward formulas for calculating sampling errors. However, the 2017 TMIS sample is the result of a multi-stage stratified design, and, consequently, it was necessary to use more complex formulas. The computer software used to calculate sampling errors for the 2017 TMIS is an SAS program. This program uses the Taylor linearization method of variance estimation for survey estimates that are means, proportions, or ratios.

Note: Detailed description of sampling error estimates is presented in APPENDIX B of the final report.

Data appraisal

Data quality tables are produced to review the quality of the data: - Household age distribution - Age distribution of eligible and interviewed women - Completeness of reporting - Births by calendar years

Note: The tables are presented in APPENDIX C of the final report.
f
Data from: High-Level Ab Initio Predictions of Thermochemical Properties of...
acs.figshare.com
txt
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hannu T. Vuori; J. Mikko Rautiainen; Erkki T. Kolehmainen; Heikki M. Tuononen (2023). High-Level Ab Initio Predictions of Thermochemical Properties of Organosilicon Species: Critical Evaluation of Experimental Data and a Reliable Benchmark Database for Extending Group Additivity Approaches [Dataset]. http://doi.org/10.1021/acs.jpca.1c09980.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jpca.1c09980.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Hannu T. Vuori; J. Mikko Rautiainen; Erkki T. Kolehmainen; Heikki M. Tuononen
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A high-level composite quantum chemical method, W1X-1, is used herein to calculate the gas-phase standard enthalpy of formation, entropy, and heat capacity of 159 organosilicon compounds. The results set a new benchmark in the field that allows, for the first time, an in-depth assessment of existing experimental data on standard enthalpies of formation, enabling the identification of important trends and possible outliers. The calculated thermochemical data are used to determine Benson group additivity contributions for 60 Benson groups and group pairs involving silicon. These values allow fast and accurate estimation of thermochemical parameters of organosilicon compounds of varying complexity, and the data acquired are used to assess the reliability of experimental work of Voronkov et al. that has been repeatedly criticized by Becerra and Walsh. Recent results from other computational investigations in the field are also carefully discussed through the prism of reported advancements.
i
Agriculture Survey 2023 - Cambodia
catalog.ihsn.org
microdata.fao.org
+3more
Updated May 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ministry of Agriculture, Forestry and Fishery (MAFF) (2025). Agriculture Survey 2023 - Cambodia [Dataset]. https://catalog.ihsn.org/catalog/12858
Explore at:
Dataset updated
May 1, 2025
Dataset provided by
Ministry of Agriculture, Forestry and Fishery (MAFF)
National Institute of Statistics of Cambodia
Time period covered
2023
Area covered
Cambodia
Description
Abstract

CAS 2023 was a comprehensive statistical undertaking for the collection and compilation of information on crop cultivation, livestock and poultry raising, aquaculture and capture fishing, agricultural economy, adaptation strategies of the holding to shocks, and the Food Insecurity Experience Scale. The National Institute of Statistics (NIS) of the Ministry of Planning (MOP), and the Ministry of Agriculture, Forestry and Fisheries (MAFF), were the responsible government ministries authorized to undertake the CAS 2023. While NIS had the census and survey mandate, the MAFF was the primary user of the data produced from the survey. Technical support was also provided by the Food and Agriculture Organization of the United Nations (FAO).

The main objective of the CAS was to provide data on the agricultural situation in the Kingdom of Cambodia, to be utilized by planners and policy-makers. Specifically, the survey data are useful in:

1.Providing an updated sampling frame in the conduct of agricultural surveys; 2.Providing data at the country and regional level, with some items available at the province level; 3.Providing data on the current structure of the country's agricultural holdings, including cropping, raising livestock and poultry, and aquaculture and capture fishing activities.

The data collected and generated from this survey effort will help reflect progress towards the 2030 Sustainable Development goals for the agricultural sector, focusing on:

-Goal 1: End poverty in all forms everywhere. -Goal 2: End hunger, achieve food security and improved nutrition and promote sustainable agriculture. -Goal 5: Achieve gender equality and empower all women and girls.

Geographic coverage

The CAS 2023 provides national coverage.

The national territory is divided in four Regions or Zones (Coastal Region, Plains Region, Plateau and Mountain Region, and Tonle Sap Region) and 25 Provinces (Banteay Meanchey, Battambang, Kampong Cham, Kampong Chhnang, Kampong Speu, Kampong Thom, Kampot, Kandal, Kep, Koh Kong, Kratie, Mondul Kiri, Otdar Meanchey, Pailin, Phnom Penh, Preah Sihanouk, Preah Vihear, Prey Veng, Pursat, Ratanak Kiri, Siem Reap, Stung Treng, Svay Rieng, Takeo, and Tboung Khmum).

Analysis unit

Household agricultural holdings

Universe

Agricultural households, i.e. holdings in the household sector that are involved in agricultural activities, including the growing of crops, raising of livestock or poultry, and aquaculture or capture fishing activities. A minimum threshold was not considered to determine a household's engagement in the above mentioned activities.

Kind of data

Sample survey data [ssd]

Sampling procedure

The sampling approach for the CAS 2023 relied fully upon the sampling procedure of CAS 2022 and CAS 2021 before it, utilizing a panel approach. The CAS 2021 had used statistical methods to select a representative sample of enumeration areas (EAs) throughout Cambodia from the 2019 General Population Census of Cambodia Sampling Frame. Households within these EAs were then screened for any agricultural activity. Using this basic information, the agricultural households were stratified and sampled for additional data collection.

For the CAS 2023, the 2019 General Population Census Sampling Frame was utilized, similarly to previous survey rounds. This frame consisted of around 14,500 villages and 38,000 Enumeration Areas (EAs). For each village, the following information was available: province, district, commune, type (rural/urban), number of EAs and number of households. The target population comprised the households that were engaged in agriculture, fishery and/or aquaculture. Given their low number of rural villages, the following districts were excluded from the frame: -Province Preah Sihanouk, District Krong Preah Sihanouk -Province Siem Reap, District Krong Siem Reab -Province Phnom Penh, District Chamkar Mon -Province Phnom Penh, District Doun Penh -Province Phnom Penh, District Prampir Meakkakra -Province Phnom Penh, District Tuol Kouk -Province Phnom Penh, District Ruessei Kaev -Province Phnom Penh, District Chhbar Ampov

Since the number of rural households per EA was not known from the 2019 census, to calculate the number of rural households in each province, the sum of the households in the villages that were classified as rural was computed. The listing operation in each sampled EA was conducted for the CAS 2021 to identify the target population, i.e., the households engaged in agricultural activities.

For this survey, there was no minimum threshold set to determine a household's engagement in agricultural activities. This differs from the procedures used during the 2013 Agriculture Census (and that would be used in the 2023 Agriculture Census later), in which households were eligible for the survey if they grew crops on at least 0.03 hectares and/or had a minimum of 2 large livestock and/or 3 small livestock and/or 25 poultry. The procedure used in the CAS, which had no minimum land area or livestock or poultry inventory, allowed for smaller household agricultural holdings to have the potential to be selected for the survey. However, based on the sampling procedure indicated below, household agricultural holdings with larger land areas or more livestock or poultry were identified and associated with different sampling strata to ensure the selection of some of them.

The CAS 2023 used a two-stage stratified sampling procedure, with EAs as primary units and households engaged in agriculture as secondary units. Overall, 1,381 EAs and 12 agricultural households per each EA were selected, for a total planned sample size of 16,572 households. The 1,381 EAs were allocated to the provinces (statistical domains) proportionally to the number of rural households. To select the EAs within each province, the villages were ordered by district, commune, and then by type of village (Rural-Urban). Systematic sampling was then performed, with probability proportional to size (number of households). After two years of attrition, the total effective sample size of the survey was 15,323 agricultural households.

Mode of data collection

Computer Assisted Personal Interview [capi]

Cleaning operations

Once the enumerators collected the survey data for an agricultural household, they submitted the completed questionnaire via Survey Solutions to their data supervisors who, in turn, carried out scrutiny checks. If there were errors or suspicious data detected, the data supervisor would return the record to the enumerator to address the issues with the respondent if needed, and the corrected record would be re-submitted to the data Supervisor. Once the records were validated by the data supervisors, they would approve them for final review by headquarters staff.

At the survey headquarters, the completed questionnaires were received after being approved by the data supervisors. If any issues or suspicious data were discovered during the headquarters review, the records could be returned to the enumerator for verification or correction if needed. Documentation on how to review questionnaire data for suspicious items or outliers was provided to both data Supervisors and headquarters staff. The data review and calculation of the survey estimates was undertaken using the RStudio software tool. Validation of the data began even when the questionnaires were being designed in the CAPI tool, as Survey Solutions allows for consistency checks to be built into the data collection tool. As soon as completed records were returned during the data collection stage, additional consistency checks were completed, evaluating the ranges for certain items, and verifying any outlier records with the enumerator and/or respondent. Moreover, when the data was cleaned, another step was conducted to impute the missing values derived from item non-response.

STATISTICAL DISCLOSURE CONTROL (SDC)

Microdata are disseminated as Public Use Files under the terms and conditions indicated at the NIS Microdata Catalog (https://microdata.nis.gov.kh/), as indicated in the section about 'access conditions' below.

In addition, anonymization methods have been applied to the microdata files before their dissemination, to protect the confidentiality of the statistical units (e.g. individuals) from which the data were collected. These methods include: i) removal of some variables contained in the survey (e.g. name, address, etc.), ii) grouping values of some variables into categories (e.g. age categories), iii) limiting geographical information to the province level, iv) removal of some records or specific data points, v) censoring the highest values in continuous variables (top-coding) by groups, replacing them with less extreme values from other respondents, or vi) rounding numerical values.

Users must therefore be aware that data protection with SDC methods involves perturbations in the microdata. This implies information loss and bias, and affects the resulting estimates and their parameters. In general, the smaller the subpopulation, the higher the potential impact derived from the anonymization process.
w
Demographic and Health Survey 2016 - Timor-Leste
microdata.worldbank.org
catalog.ihsn.org
+1more
Updated Apr 16, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
General Directorate of Statistics (GDS) (2018). Demographic and Health Survey 2016 - Timor-Leste [Dataset]. https://microdata.worldbank.org/index.php/catalog/2992
Explore at:
Dataset updated
Apr 16, 2018
Dataset authored and provided by
General Directorate of Statistics (GDS)
Time period covered
2016
Area covered
Timor-Leste
Description
Abstract

The 2016 Timor-Leste Demographic and Health Survey (TLDHS) was implemented by the General Directorate of Statistics (GDS) of the Ministry of Finance in collaboration with the Ministry of Health (MOH). Data collection took place from 16 September to 22 December, 2016.

The primary objective of the 2016 TLDHS project is to provide up-to-date estimates of basic demographic and health indicators. The TLDHS provides a comprehensive overview of population, maternal, and child health issues in Timor-Leste. More specifically, the 2016 TLDHS: • Collected data at the national level, which allows the calculation of key demographic indicators, particularly fertility, and child, adult, and maternal mortality rates • Provided data to explore the direct and indirect factors that determine the levels and trends of fertility and child mortality • Measured the levels of contraceptive knowledge and practice • Obtained data on key aspects of maternal and child health, including immunization coverage, prevalence and treatment of diarrhea and other diseases among children under age 5, and maternity care, including antenatal visits and assistance at delivery • Obtained data on child feeding practices, including breastfeeding, and collected anthropometric measures to assess nutritional status in children, women, and men • Tested for anemia in children, women, and men • Collected data on the knowledge and attitudes of women and men about sexually-transmitted diseases and HIV/AIDS, potential exposure to the risk of HIV infection (risk behaviors and condom use), and coverage of HIV testing and counseling • Measured key education indicators, including school attendance ratios, level of educational attainment, and literacy levels • Collected information on the extent of disability • Collected information on non-communicable diseases • Collected information on early childhood development • Collected information on domestic violence • The information collected through the 2016 TLDHS is intended to assist policy makers and program managers in evaluating and designing programs and strategies for improving the health of the country’s population.

Geographic coverage

National

Analysis unit

Household

Individual

Children age 0-5

Woman age 15-49

Man age 15-59

Universe

The survey covered all de jure household members (usual residents), women age 15-49 years and men age 15-59 years resident in the household.

Kind of data

Sample survey data [ssd]

Sampling procedure

The sampling frame used for the TLDHS 2016 survey is the 2015 Timor-Leste Population and Housing Census (TLPHC 2015), provided by the General Directorate of Statistics. The sampling frame is a complete list of 2320 non-empty Enumeration Areas (EAs) created for the 2015 population census. An EA is a geographic area made up of a convenient number of dwelling units which served as counting units for the census, with an average size of 89 households per EA. The sampling frame contains information about the administrative unit, the type of residence, the number of residential households and the number of male and female population for each of the EAs. Among the 2320 EAs, 413 are urban residence and 1907 are rural residence.

There are five geographic regions in Timor-Leste, and these are subdivided into 12 municipalities and special administrative region (SAR) of Oecussi. The 2016 TLDHS sample was designed to produce reliable estimates of indicators for the country as a whole, for urban and rural areas, and for each of the 13 municipalities. A representative probability sample of approximately 12,000 households was drawn; the sample was stratified and selected in two stages. In the first stage, 455 EAs were selected with probability proportional to EA size from the 2015 TLPHC: 129 EAs in urban areas and 326 EAs in rural areas. In the second stage, 26 households were randomly selected within each of the 455 EAs; the sampling frame for this household selection was the 2015 TLPHC household listing available from the census database.

For further details on sample design, see Appendix A of the final report.

Mode of data collection

Face-to-face [f2f]

Research instrument

Four questionnaires were used for the 2016 TLDHS: the Household Questionnaire, the Woman’s Questionnaire, the Man’s Questionnaire, and the Biomarker Questionnaire. These questionnaires, based on The DHS Program’s standard Demographic and Health Survey questionnaires, were adapted to reflect the population and health issues relevant to Timor-Leste.

Cleaning operations

The data processing operation included registering and checking for inconsistencies, incompleteness, and outliers. Data editing and cleaning included structure and consistency checks to ensure completeness of work in the field. The central office also conducted secondary editing, which required resolution of computer-identified inconsistencies and coding of open-ended questions. The data were processed by two staff who took part in the main fieldwork training. Data editing was accomplished with CSPro software. Secondary editing and data processing were initiated in October 2016 and completed in February 2017.

Response rate

A total of 11,829 households were selected for the sample, of which 11,660 were occupied. Of the occupied households, 11,502 were successfully interviewed, which yielded a response rate of 99 percent.

In the interviewed households, 12,998 eligible women were identified for individual interviews. Interviews were completed with 12,607 women, yielding a response rate of 97 percent. In the subsample of households selected for the men’s interviews, 4,878 eligible men were identified and 4,622 were successfully interviewed, yielding a response rate of 95 percent. Response rates were higher in rural than in urban areas, with the difference being more pronounced among men (97 percent versus 90 percent, respectively) than among women (98 percent versus 94 percent, respectively). The lower response rates for men were likely due to their more frequent and longer absences from the household.

Sampling error estimates

The estimates from a sample survey are affected by two types of errors: non-sampling errors and sampling errors. Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as failure to locate and interview the correct household, misunderstanding of the questions on the part of either the interviewer or the respondent, and data entry errors. Although numerous efforts were made during the implementation of the TLDHS 2016 to minimize this type of error, non-sampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors, on the other hand, can be evaluated statistically. The sample of respondents selected in the TLDHS 2016 is only one of many samples that could have been selected from the same population, using the same design and expected size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability between all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.

A sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95 percent of all possible samples of identical size and design.

If the sample of respondents had been selected as a simple random sample, it would have been possible to use straightforward formulas for calculating sampling errors. However, the TLDHS 2016 sample is the result of a multi-stage stratified design, and, consequently, it was necessary to use more complex formulae. The computer software used to calculate sampling errors for the TLDHS 2016 is a SAS program. This program used the Taylor linearization method of variance estimation for survey estimates that are means, proportions or ratios. The Jackknife repeated replication method is used for variance estimation of more complex statistics such as fertility and mortality rates.

A more detailed description of estimates of sampling errors are presented in Appendix B of the survey final report.

Data appraisal

Data Quality Tables - Household age distribution - Age distribution of eligible and interviewed women - Age distribution of eligible and interviewed men - Completeness of reporting - Births by calendar years - Reporting of age at death in days - Reporting of age at death in months - Height and weight data completeness and quality for children - Completeness of information on siblings - Sibship size and sex ratio of siblings - Pregnancy-related mortality trends

See details of the data quality tables in Appendix C of the survey final report.
Marketing Insights for E-Commerce Company
kaggle.com
zip
Updated Oct 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishi Kumar (2023). Marketing Insights for E-Commerce Company [Dataset]. https://www.kaggle.com/datasets/rishikumarrajvansh/marketing-insights-for-e-commerce-company
Explore at:
zip(628618 bytes)Available download formats
Dataset updated
Oct 27, 2023
Authors
Rishi Kumar
Description
** Inputs related to Analysis for additional reference:** 1. Why do we need customer Segmentation? As every customer is unique and can be targeted in different ways. The Customer segmentation plays an important role in this case. The segmentation helps to understand profiles of customers and can be helpful in defining cross sell/upsell/activation/acquisition strategies. 2. What is RFM Segmentation? RFM Segmentation is an acronym of recency, frequency and monetary based segmentation. Recency is about when the last order of a customer. It means the number of days since a customer made the last purchase. If it’s a case for a website or an app, this could be interpreted as the last visit day or the last login time. Frequency is about the number of purchases in a given period. It could be 3 months, 6 months or 1 year. So we can understand this value as for how often or how many customers used the product of a company. The bigger the value is, the more engaged the customers are. Alternatively We can define, average duration between two transactions Monetary is the total amount of money a customer spent in that given period. Therefore big spenders will be differentiated with other customers such as MVP or VIP. 3. What is LTV and How to define it? In the current world, almost every retailer promotes its subscription and this is further used to understand the customer lifetime. Retailer can manage these customers in better manner if they know which customer is high life time value. Customer lifetime value (LTV) can also be defined as the monetary value of a customer relationship, based on the present value of the projected future cash flows from the customer relationship. Customer lifetime value is an important concept in that it encourages firms to shift their focus from quarterly profits to the long-term health of their customer relationships. Customer lifetime value is an important metric because it represents an upper limit on spending to acquire new customers. For this reason it is an important element in calculating payback of advertising spent in marketing mix modelling. 4. Why do need to predict Customer Lifetime Value? The LTV is an important building block in campaign design and marketing mix management. Although targeting models can help to identify the right customers to be targeted, LTV analysis can help to quantify the expected outcome of targeting in terms of revenues and profits. The LTV is also important because other major metrics and decision thresholds can be derived from it. For example, the LTV is naturally an upper limit on the spending to acquire a customer, and the sum of the LTVs for all of the customers of a brand, known as the customer equity, is a major metric forbusiness valuations. Similarly to many other problems of marketing analytics and algorithmic marketing, LTV modelling can be approached from descriptive, predictive, and prescriptive perspectives. 5. How Next Purchase Day helps to Retailers? Our objective is to analyse when our customer will purchase products in the future so for such customers we can build strategy and can come up with strategies and marketing campaigns accordingly. a. Group-1: Customers who will purchase in more than 60 days b. Group-2: Customers who will purchase in 30-60 days c. Group-3: Customers who will purchase in 0-30 days 6. What is Cohort Analysis? How it will be helpful? A cohort is a group of users who share a common characteristic that is identified in this report by an Analytics dimension. For example, all users with the same Acquisition Date belong to the same cohort. The Cohort Analysis report lets you isolate and analyze cohort behaviour. Cohort analysis in e-commerce means to monitor your customers’ behaviour based on common traits they share – the first product they bought, when they became customers, etc. - - to find patterns and tailor marketing activities for the group.

Transaction data has been provided for the period of 1st Jan 2019 to 31st Dec 2019. The below data sets have been provided. Online_Sales.csv: This file contains actual orders data (point of Sales data) at transaction level with below variables. CustomerID: Customer unique ID Transaction_ID: Transaction Unique ID Transaction_Date: Date of Transaction Product_SKU: SKU ID – Unique Id for product Product_Description: Product Description Product_Cateogry: Product Category Quantity: Number of items ordered Avg_Price: Price per one quantity Delivery_Charges: Charges for delivery Coupon_Status: Any discount coupon applied Customers_Data.csv: This file contains customer’s demographics. CustomerID: Customer Unique ID Gender: Gender of customer Location: Location of Customer Tenure_Months: Tenure in Months Discount_Coupon.csv: Discount coupons have been given for different categories in different months Month: Discount coupon applied in that month Product_Category: Product categor...
Differences in the Dietary Inflammatory Index (DII) calculated according to...
figshare.com
xls
Updated Jun 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xenia Pawlow; Raffael Ott; Christiane Winkler; Anette-G. Ziegler; Sandra Hummel (2023). Differences in the Dietary Inflammatory Index (DII) calculated according to Shivappa et al. [17] or the Scaling-Formula With Outlier Detection (SFOD) method based on similar food consumption data between subject pairs. [Dataset]. http://doi.org/10.1371/journal.pone.0259629.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0259629.t002
Dataset updated
Jun 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xenia Pawlow; Raffael Ott; Christiane Winkler; Anette-G. Ziegler; Sandra Hummel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Differences in the Dietary Inflammatory Index (DII) calculated according to Shivappa et al. [17] or the Scaling-Formula With Outlier Detection (SFOD) method based on similar food consumption data between subject pairs.
f
Results of exploratory factor analysis.
figshare.com
xls
Updated Nov 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haoxuan Feng; Xuan Xiao; Yue Cheng; Rongbing Mu; Li Xiong (2025). Results of exploratory factor analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0334642.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334642.t006
Dataset updated
Nov 6, 2025
Dataset provided by
PLOS ONE
Authors
Haoxuan Feng; Xuan Xiao; Yue Cheng; Rongbing Mu; Li Xiong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As urban rail transit expands, systematic evidence remains limited on how the built environment influences cultural perception among passengers. This study identifies the main determinants of cultural perception, tests whether perception of nearby public cultural facilities mediates these effects, and examines heterogeneity by station type. Using metro stations in central Shanghai as a case, we compute the Shannon diversity index of nearby public cultural facilities within the 500 m station area and apply Anselin Local Moran’s I to classify 90 stations into four types: High-High cluster, High-Low outlier, Low-High outlier, and Low-Low cluster. Questionnaire data from 12 representative stations (n = 414) are analyzed with structural equation modeling, and differences across station types are assessed with a one-way analysis of variance. Results indicate that interior spatial design satisfaction has the strongest positive association with cultural perception, followed by entrance and exit design satisfaction. Perception of nearby public cultural facilities is positively associated with cultural perception and partially mediates the association between interior spatial design satisfaction and cultural perception. Station types differ significantly in interior spatial design satisfaction, entrance and exit design satisfaction, perception of nearby public cultural facilities, and cultural perception, with High-High cluster highest, Low-Low cluster lowest, and High-Low outlier and Low-High outlier in between. This study incorporates the subjective perception of nearby public cultural facilities into the framework for cultural perception in metro stations, clarifies direct and mediated pathways, and provides type specific implications for factor prioritization and station stratification in upgrades and retrofits across different network contexts.
Gender, Age, and Emotion Detection from Voice
kaggle.com
zip
Updated May 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Zaman (2021). Gender, Age, and Emotion Detection from Voice [Dataset]. https://www.kaggle.com/rohitzaman/gender-age-and-emotion-detection-from-voice
Explore at:
zip(967820 bytes)Available download formats
Dataset updated
May 29, 2021
Authors
Rohit Zaman
Description
Context

Our target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.

Content

Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.

Acknowledgements

Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/

Facebook

Twitter

Click to copy link

Link copied

Cite

Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis/versions/1

Superstore Sales Analysis

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(3009057 bytes)Available download formats

Dataset updated

Oct 21, 2023

Authors

Ali Reda Elblgihy

Description

Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.
Utilize Power Query to clean, transform, and structure the data for analysis.
Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.
Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.
Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.
Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.
Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.
Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.

Clear search

Close search

Google apps

Main menu

Superstore Sales Analysis

Pakistan House Price dataset

Data from: PCP-SAFT Parameters of Pure Substances Using Large Experimental...

Data from: Fast robust SUR with economical and actuarial applications

Intermediate data for TE calculation

Employees Report

Machine learning pipeline to train toxicity prediction model of...

housing

Data from: Drivers of contemporary and future changes in Arctic seasonal...

Supplementary Data for Mahalanobis-Based Ratio Analysis and Clustering of...

Salaries case study

Dataset for the paper "Observation of Acceleration and Deceleration Periods...

Malaria Indicator Survey 2017 - Tanzania

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

Data from: High-Level Ab Initio Predictions of Thermochemical Properties of...

Agriculture Survey 2023 - Cambodia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Cleaning operations

Demographic and Health Survey 2016 - Timor-Leste

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

Marketing Insights for E-Commerce Company

Differences in the Dietary Inflammatory Index (DII) calculated according to...

Results of exploratory factor analysis.

Gender, Age, and Emotion Detection from Voice

Context

Content

Acknowledgements

Superstore Sales Analysis