Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.
It includes 500 rows of simulated employee data with intentional errors such as:
Missing values in Age and Salary
Typos in email addresses (@gamil.com)
Inconsistent city name casing (e.g., lahore, Karachi)
Extra spaces in department names (e.g., " HR ")
✅ Skills You Can Practice:
Detecting and handling missing data
String cleaning and formatting
Removing duplicates
Validating email formats
Standardizing categorical data
You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
Facebook
TwitterDescription: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:
Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:
v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model comparison through tenfold cross-validation on training data.
Facebook
TwitterOverview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected data from 36 published articles on PubMed [13–48] to train and validate our machine learning models. Some articles comprised more than one type of cartilage injury models or treatment condition. In total, 15 clinical trial conditions and 29 animal 66 model conditions (1 goat, 6 pigs, 2 dogs, 9 rabbits, 9 rats, and 2 mice) on osteochondral injury or osteoarthritis were included, where MSCs were transplanted to repair the cartilage tissue. We documented each case with specific treatment condition into an entry by considering the cell- and treatment target-related factors as input properties, including species, body weight, tissue source, cell number, cell concentration, defect area, defect depth, and type of cartilage damage. The therapeutic outcomes were considered as output properties, which were evaluated using integrated clinical and histological cartilage repair scores, including the international cartilage repair society (ICRS) scoring system, the O’Driscoll score, the Pineda score, the Mankin score, the osteoarthritis research society international (OARSI) scoring system, the international knee documentation committee (IKDC) score, the visual analog score (VAS) for pain, the knee injury and osteoarthritis outcome score (KOOS), the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lyscholm score. In this study, these scores were linearly normalized to a number between 0 and 1, with 0 representing the worst damage or pain, and 1 representing the completely healthy tissue. The list of entries was combined together to form a database.
We have provided the details for the imputation algorithm in the subsection Handling missing data under Methods and a flowchart in Fig 2. Data imputation algorithm for the vector x was added in the manuscript for illustration. The pseudo-code for uncertainty calculation was shown in S1 Algorithm: A ensemble model to measure the ANN's prediction uncertainty. The original database gathered from the literature, and a ‘complete’ database with missing information filled from our neural network are also included, along with a sample neural network architecture file in Python.
Here we provide a Python notebook comprising a neural network that delivers the performance and results described in the manuscript. Documentation in the form of comments and installation guide is included in the Python notebook. This Python notebook along with the methods described in the manuscript provides sufficient details for other interested readers to either extend this script or write their own scripts and reproduce the results in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Default and optimal tuned hyperparameters of Random Forest model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains groundwater level trends and time series data across a discretized grid of California's Central Valley, modeled with well data using hierarchical Gaussian process and neural network regression methodology. The spatial grid consists of 400 cells, spanning latitudes 34.91 to 40.895 degrees, and 220 cells, spanning longitudes -122.6 to -118.658 degrees. The temporal axis spans March 2015 to Aug 2020, discretized at biweekly intervals, with a total of 132 cells. The spatiotemporal grid details are present in relevant files. The first dataset is contained in the following Python pickle file. 1. 'CV_water_level_trends_Mar2015_Aug2020.pkl': This file contains a nested Python dictionary with following pairs: 1.1. 'longitude': Numpy array of shape 400 x 220 1.2. 'longitude': Numpy array of shape 400 x 220 1.3. 'mean': Python dictionary with mean long-term and seasonal water level trends 1.4. 'P10': Python dictionary with P10 long-term and seasonal water level trends 1.5. 'P90': Python dictionary with P90 long-term and seasonal water level trends Each of the dictionary in 1.3., 1.4. and 1.5. contain the following key and values: 'initial_water_level_ft': Mean/P10/P90 of March 2015 water levels in feet stored as Numpy array of shape 400 x 220 'water_level_decline_rate_ft/biweek': Mean/P10/P90 of March 2015 - Aug 2020 water level decline rates in ft/biweek stored as Numpy array of shape 400 x 220 'water_level_amplitude_ft': Mean/P10/P90 of seasonal water level oscillation amplitude stored as Numpy array of shape 400 x 220 'water_level_phase_deg': Mean/P10/P90 of time to peak seasonal signal in degrees stored as Numpy array of shape 400 x 220 The second dataset is contained in the following Python pickle file. 2. 'CV_water_level_time_series_Mar2015_Aug2020.pkl': This file contains a Python dictionary with following pairs. 2.1. 'longitude': Numpy array of shape 400 x 220 2.2. 'longitude': Numpy array of shape 400 x 220 2.3. 'time_axis': Python list on length 132 containing strings for biweekly periods from March 2015 - August 2020 2.4. 'water_level_well_ft': Processed water level observations in feet from 1744 wells, irregularly sampled across time. The data is stored as Numpy array of shape 400 x 220 x 132, with missing values as nans. 2.5. 'water_level_modeled_mean_ft': Modeled mean water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P10_ft': Modeled P10 water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P90_ft': Modeled P90 water level time series in feet stored as Numpy array of shape 400 x 220 x 132
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:
The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.
This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.
Facebook
TwitterData DescriptionThe CADDI dataset is designed to support research in in-class activity recognition using IMU data from low-cost sensors. It provides multimodal data capturing 19 different activities performed by 12 participants in a classroom environment, utilizing both IMU sensors from a Samsung Galaxy Watch 5 and synchronized stereo camera images. This dataset enables the development and validation of activity recognition models using sensor fusion techniques.Data Generation ProceduresThe data collection process involved recording both continuous and instantaneous activities that typically occur in a classroom setting. The activities were captured using a custom setup, which included:A Samsung Galaxy Watch 5 to collect accelerometer, gyroscope, and rotation vector data at 100Hz.A ZED stereo camera capturing 1080p images at 25-30 fps.A synchronized computer acting as a data hub, receiving IMU data and storing images in real-time.A D-Link DSR-1000AC router for wireless communication between the smartwatch and the computer.Participants were instructed to arrange their workspace as they would in a real classroom, including a laptop, notebook, pens, and a backpack. Data collection was performed under realistic conditions, ensuring that activities were captured naturally.Temporal and Spatial ScopeThe dataset contains a total of 472.03 minutes of recorded data.The IMU sensors operate at 100Hz, while the stereo camera captures images at 25-30Hz.Data was collected from 12 participants, each performing all 19 activities multiple times.The geographical scope of data collection was Alicante, Spain, under controlled indoor conditions.Dataset ComponentsThe dataset is organized into JSON and PNG files, structured hierarchically:IMU Data: Stored in JSON files, containing:Samsung Linear Acceleration Sensor (X, Y, Z values, 100Hz)LSM6DSO Gyroscope (X, Y, Z values, 100Hz)Samsung Rotation Vector (X, Y, Z, W quaternion values, 100Hz)Samsung HR Sensor (heart rate, 1Hz)OPT3007 Light Sensor (ambient light levels, 5Hz)Stereo Camera Images: High-resolution 1920×1080 PNG files from left and right cameras.Synchronization: Each IMU data record and image is timestamped for precise alignment.Data StructureThe dataset is divided into continuous and instantaneous activities:Continuous Activities (e.g., typing, writing, drawing) were recorded for 210 seconds, with the central 200 seconds retained.Instantaneous Activities (e.g., raising a hand, drinking) were repeated 20 times per participant, with data captured only during execution.The dataset is structured as:/continuous/subject_id/activity_name/ /camera_a/ → Left camera images /camera_b/ → Right camera images /sensors/ → JSON files with IMU data
/instantaneous/subject_id/activity_name/repetition_id/ /camera_a/ /camera_b/ /sensors/ Data Quality & Missing DataThe smartwatch buffers 100 readings per second before sending them, ensuring minimal data loss.Synchronization latency between the smartwatch and the computer is negligible.Not all IMU samples have corresponding images due to different recording rates.Outliers and anomalies were handled by discarding incomplete sequences at the start and end of continuous activities.Error Ranges & LimitationsSensor data may contain noise due to minor hand movements.The heart rate sensor operates at 1Hz, limiting its temporal resolution.Camera exposure settings were automatically adjusted, which may introduce slight variations in lighting.File Formats & Software CompatibilityIMU data is stored in JSON format, readable with Python’s json library.Images are in PNG format, compatible with all standard image processing tools.Recommended libraries for data analysis:Python: numpy, pandas, scikit-learn, tensorflow, pytorchVisualization: matplotlib, seabornDeep Learning: Keras, PyTorchPotential ApplicationsDevelopment of activity recognition models in educational settings.Study of student engagement based on movement patterns.Investigation of sensor fusion techniques combining visual and IMU data.This dataset represents a unique contribution to activity recognition research, providing rich multimodal data for developing robust models in real-world educational environments.CitationIf you find this project helpful for your research, please cite our work using the following bibtex entry:@misc{marquezcarpintero2025caddiinclassactivitydetection, title={CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors}, author={Luis Marquez-Carpintero and Sergio Suescun-Ferrandiz and Monica Pina-Navarro and Miguel Cazorla and Francisco Gomez-Donoso}, year={2025}, eprint={2503.02853}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.02853}, }
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reproductive health and Family planning service characteristics of respondents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.
The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850
As it contains longer off periods with zeros, the CSV file is nicely compressible.
To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).
To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.
pythonimport pandas as pd
df = pd.read_csv("DARCK.csv", parse_dates=["time"])
The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.
DARCK.csv)The dataset is provided as a single comma-separated value (CSV) file.
Column Name |
Data Type |
Unit |
Description |
time | datetime | - | Timestamp for the reading in YYYY-MM-DD HH:MM:SS |
main | float | Watt | Total aggregate power consumption for the apartment, measured at the main electrical panel. |
[appliance_name] | float | Watt | Power consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list. |
| Aggregate Columns | |||
aggr_chargers | float | Watt | The sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger. |
aggr_stoveplates | float | Watt | The sum of stoveplatel1 and stoveplatel2. |
aggr_lights | float | Watt | The sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap. |
| Analysis Columns | |||
inaccuracy | float | Watt | As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for. |
The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.
main) PostprocessingThe aggregate power data required several cleaning steps to ensure accuracy.
shellies) PostprocessingThe Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.
.resample('1s').last().ffill(). time index.NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.
The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Socio-demographic and economic characteristics of respondents.
Facebook
TwitterThis is the HadISDH land 4.3.1.2020f version of the Met Office Hadley Centre Integrated Surface Dataset of Humidity (HadISDH). HadISDH-land is a near-global gridded monthly mean land surface humidity climate monitoring product. It is created from in situ observations of air temperature and dew point temperature from weather stations. The observations have been quality controlled and homogenised. Uncertainty estimates for observation issues and gridbox sampling are provided (see data quality statement section below). The data are provided by the Met Office Hadley Centre and this version spans 1/1/1973 to 31/12/2020. The data are monthly gridded (5 degree by 5 degree) fields. Products are available for temperature and six humidity variables: specific humidity (q), relative humidity (RH), dew point temperature (Td), wet bulb temperature (Tw), vapour pressure (e), dew point depression (DPD). This version extends the 4.2.0.2019f version to the end of 2020 and constitutes a minor update to HadISDH due to changing some of the code base from IDL and Python 2.7 to Python 3, detecting and fixing a bug in the process, and retrieving the missing April 2015 station data. These have led to small changes in regional and global average values and coverage. All other processing steps for HadISDH remain identical. Users are advised to read the update document in the Docs section for full details. As in previous years, the annual scrape of NOAAs Integrated Surface Dataset for HadISD.3.1.2.202101p, which is the basis of HadISDH.land, has pulled through some historical changes to stations. This, and the additional year of data, results in small changes to station selection. The homogeneity adjustments differ slightly due to sensitivity to the addition and loss of stations, historical changes to stations previously included and the additional 12 months of data. To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISDH blog: http://hadisdh.blogspot.co.uk/ References: When using the dataset in a paper please cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference): Willett, K. M., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Parker, D. E., Jones, P. D., and Williams Jr., C. N.: HadISDH land surface multi-variable humidity and temperature record for climate monitoring, Clim. Past, 10, 1983-2006, doi:10.5194/cp-10-1983-2014, 2014. Dunn, R. J. H., et al. 2016: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491. Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704-708, doi:10.1175/2011BAMS3015.1 We strongly recommend that you read these papers before making use of the data, more detail on the dataset can be found in an earlier publication: Willett, K. M., Williams Jr., C. N., Dunn, R. J. H., Thorne, P. W., Bell, S., de Podesta, M., Jones, P. D., and Parker D. E., 2013: HadISDH: An updated land surface specific humidity product for climate monitoring. Climate of the Past, 9, 657-677, doi:10.5194/cp-9-657-2013.
Facebook
TwitterThis dataset is a harmonisation of the Domestic Electrical Load Survey (DELS) 1994-2014 dataset. The DELS 1994-2014 questionnaires were changed in 2000. Subsequently survey questions vary between 1994-1999 and 2000-2014. This makes data processing complex, as survey responses first need to be associated with their year of collection and corresponding questionnaire before they can be correctly interpreted. This Key Variables dataset is a user-friendly version of the original dataset. It contains household responses to the most important survey questions, as well as geographic and linking information that allows for the households to be matched to their respective electricity metering data. This dataset and similar custom datasets can be produced from the DELS 1994-2014 dataset with the python package delprocess. The data processing section includes a description of how this dataset was created. The development of the tools to create this dataset was funded by the South African National Energy Development Initiative (SANEDI).
The study had national coverage.
Households
The dataset covers South African households in the DELS 1994-2014 dataset. These are electrified households that received electricity either directly from Eskom or from their local municipality.
Administrative records
The dataset includes all households for which survey responses have been captured in the DELS1994-2014 dataset.
Face-to-face [f2f]
This dataset has been constructed from the DELS 1994-2014 dataset using the data processing functions in the delprocess python package (www.github.com/wiebket/delprocess: release v1.0). The delprocess python package takes the complexities of the original DELS 1994-2014 dataset into account and makes use of 'spec files' to specify the processing steps that must be performed. To retrieve data for all survey years, two separate spec files are required to process survey response from 1994-1999 and 2000-2014. The spec files used to produce this dataset are included in the program files and can be used as templates for new custom datasets. Full instructions on how to use them to process the data are in the README file contained in the delprocess package.
SPEC FILES specify the following processing steps:
In particular, the DELSKV 1994-2014 dataset has been produced by specifying the following processing steps:
TRANSFORMATIONS * monthly_income from 1994 - 1999 is the variable returned by the 'income' search term * monthly_income from 2000 - 2014 is calculated as the sum of the variables returned by the 'earn per month', 'money from small business' and 'external' search terms * Appliance numbers from 1994 - 1999 is the count of appliances (no data was collected on broken appliances) * Appliance numbers from 2000-2014 is the count of appliances [minus] the count of broken appliances (except for TV which included no information on broken appliances) * A new total_adults variable was created by summing the number of all occupants (male and female) over 16 years old * A new total_children variable was created by summing the number of all occupants (male and female) under 16 years old * A new total_pensioners variable was created by summing the number of pensioners (male and female) over 16 years old * A new total_unemployed variable was created by summing the number of unemployed occupants (male and female) over 16 years old * A new total_part_time variable was created by summing the number of part time employed occupants (male and female) over 16 years old * roof_material and wall_material values for 1994-1999 were augmented by 1 * water_access was transformed for 1994-1999 to be 4 [minus] the 'watersource' value
REPLACEMENTS * Appliance usage values have been replaced with: 0=never 1=monthly 2=weekly 3=daily
water_access values have been replaced with: 1=nearby river/dam/borehole 2=block/street taps 3=tap in yard 4=tap inside house
roof_material and wall_material values have been replaced with: 1=IBR/Corr.Iron/Zinc 2=Thatch/Grass 3=Wood/Masonite board 4=Brick 5=Block 6=Plaster 7=Concrete 8=Tiles 9=Plastic 10=Asbestos 11=Daub/Mud/Clay
OTHER NOTES Appliance usage information was only collected after 2000. No binning was done to segment survey responses for this dataset.
MISSING VALUES Missing values have not been replaced and are represented as blanks except for imputed columns (total_adults, total_children, ...) and appliances after 2000, where missing values have been replaced with a 0.
Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.