Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
• This dataset is designed for learning how to identify missing data in Python.
• It focuses on techniques to detect null, NaN, and incomplete values.
• It includes examples of visualizing missing data patterns using Python libraries.
• Useful for beginners practicing data preprocessing and data cleaning.
• Helps users understand missing data handling methods for machine learning workflows.
• Supports practical exploration of datasets before model training.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed specifically for beginners and intermediate learners to practice data cleaning techniques using Python and Pandas.
It includes 500 rows of simulated employee data with intentional errors such as:
Missing values in Age and Salary
Typos in email addresses (@gamil.com)
Inconsistent city name casing (e.g., lahore, Karachi)
Extra spaces in department names (e.g., " HR ")
✅ Skills You Can Practice:
Detecting and handling missing data
String cleaning and formatting
Removing duplicates
Validating email formats
Standardizing categorical data
You can use this dataset to build your own data cleaning notebook, or use it in interviews, assessments, and tutorials.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. Python code. The notebook highlights core components of the code applied in the study.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prediction models are used to predict an outcome based on input variables. Missing data in input variables often occur at model development and at prediction time. The missForestPredict R package proposes an adaptation of the missForest imputation algorithm that is fast, user-friendly and tailored for prediction settings. The algorithm iteratively imputes variables using random forests until a convergence criterion, unified for continuous and categorical variables, is met. The imputation models are saved for each variable and iteration and can be applied later to new observations at prediction time. The missForestPredict package offers extended error monitoring, control over variables used in the imputation and custom initialization. This allows users to tailor the imputation to their specific needs. The missForestPredict algorithm is compared to mean/mode imputation, linear regression imputation, mice, k-nearest neighbours, bagging, miceRanger and IterativeImputer on eight simulated datasets with simulated missingness (48 scenarios) and eight large public datasets using different prediction models. missForestPredict provides competitive results in prediction settings within short computation times.
Facebook
TwitterThis resource contains a Python script used to clean and preprocess the alum dosage dataset from a small Oklahoma water treatment plant. The script handles missing values, removes outliers, merges historical water quality and weather data, and prepares the dataset for AI model training.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains information on the Surface Soil Moisture (SM) content derived from satellite observations in the microwave domain.
A description of this dataset, including the methodology and validation results, is available at:
Preimesberger, W., Stradiotti, P., and Dorigo, W.: ESA CCI Soil Moisture GAPFILLED: an independent global gap-free satellite climate data record with uncertainty estimates, Earth Syst. Sci. Data, 17, 4305–4329, https://doi.org/10.5194/essd-17-4305-2025, 2025.
ESA CCI Soil Moisture is a multi-satellite climate data record that consists of harmonized, daily observations coming from 19 satellites (as of v09.1) operating in the microwave domain. The wealth of satellite information, particularly over the last decade, facilitates the creation of a data record with the highest possible data consistency and coverage.
However, data gaps are still found in the record. This is particularly notable in earlier periods when a limited number of satellites were in operation, but can also arise from various retrieval issues, such as frozen soils, dense vegetation, and radio frequency interference (RFI). These data gaps present a challenge for many users, as they have the potential to obscure relevant events within a study area or are incompatible with (machine learning) software that often relies on gap-free inputs.
Since the requirement of a gap-free ESA CCI SM product was identified, various studies have demonstrated the suitability of different statistical methods to achieve this goal. A fundamental feature of such gap-filling method is to rely only on the original observational record, without need for ancillary variable or model-based information. Due to the intrinsic challenge, there was until present no global, long-term univariate gap-filled product available. In this version of the record, data gaps due to missing satellite overpasses and invalid measurements are filled using the Discrete Cosine Transform (DCT) Penalized Least Squares (PLS) algorithm (Garcia, 2010). A linear interpolation is applied over periods of (potentially) frozen soils with little to no variability in (frozen) soil moisture content. Uncertainty estimates are based on models calibrated in experiments to fill satellite-like gaps introduced to GLDAS Noah reanalysis soil moisture (Rodell et al., 2004), and consider the gap size and local vegetation conditions as parameters that affect the gapfilling performance.
You can use command line tools such as wget or curl to download (and extract) data for multiple years. The following command will download and extract the complete data set to the local directory ~/Download on Linux or macOS systems.
#!/bin/bash
# Set download directory
DOWNLOAD_DIR=~/Downloads
base_url="https://researchdata.tuwien.at/records/3fcxr-cde10/files"
# Loop through years 1991 to 2023 and download & extract data
for year in {1991..2023}; do
echo "Downloading $year.zip..."
wget -q -P "$DOWNLOAD_DIR" "$base_url/$year.zip"
unzip -o "$DOWNLOAD_DIR/$year.zip" -d $DOWNLOAD_DIR
rm "$DOWNLOAD_DIR/$year.zip"
done
The dataset provides global daily estimates for the 1991-2023 period at 0.25° (~25 km) horizontal grid resolution. Daily images are grouped by year (YYYY), each subdirectory containing one netCDF image file for a specific day (DD), month (MM) in a 2-dimensional (longitude, latitude) grid system (CRS: WGS84). The file name has the following convention:
ESACCI-SOILMOISTURE-L3S-SSMV-COMBINED_GAPFILLED-YYYYMMDD000000-fv09.1r1.nc
Each netCDF file contains 3 coordinate variables (WGS84 longitude, latitude and time stamp), as well as the following data variables:
Additional information for each variable is given in the netCDF attributes.
Changes in v9.1r1 (previous version was v09.1):
These data can be read by any software that supports Climate and Forecast (CF) conform metadata standards for netCDF files, such as:
The following records are all part of the ESA CCI Soil Moisture science data records community
| 1 |
ESA CCI SM MODELFREE Surface Soil Moisture Record | <a href="https://doi.org/10.48436/svr1r-27j77" target="_blank" |
Facebook
TwitterThis data package includes data and scripts from the manuscript “Denoising autoencoder for reconstructing sensor observation data and predicting evapotranspiration: noisy and missing values repair and uncertainty quantification”.The study addressed common challenges faced in environmental sensing and modeling, including uncertain input data, missing sensor observations, and high-dimensional datasets with interrelated but redundant variables. Point-scaled meteorological and soil sensor observations were perturbed with noises and missing values, and denoising autoencoder (DAE) neural networks were developed to reconstruct the perturbed data and further predict evapotranspiration. This study concluded that (1) the reconstruction quality of each variable depends on its cross-correlation and alignment to the underlying data structure, (2) uncertainties from the models were overall stronger than those from the data corruption, and (3) there was a tradeoff between reducing bias and reducing variance when evaluating the uncertainty of the machine learning models.This package includes:(1) Four ipython scripts (.ipynb): “DAE_train.ipynb” trains and evaluates DAE neural networks, “DAE_predict.ipynb” makes predictions from the trained DAE models, “ET_train.ipynb” trains and evaluates ET prediction neural networks, and “ET_predict.ipynb” makes predictions from trained ET models.(2) One python file (.py): “methods.py” includes all user-defined functions and python codes used in the ipython scripts.(3) A “sub_models” folder that includes fivemore » trained DAE neural networks (in pytorch format, .pt), which could be used to ingest input data before being fed to the downstream ET models in ‘ET_train.ipynb” or ‘ET_predict.ipynb’.(4) Two data files (.csv). Daily meteorological, vegetation, and soil data is in “df_data.csv”, where “df_meta.csv” contains the location and time information of “df_data.csv”. Each row (index) in “df_meta.csv” corresponds to each row in “df_data.csv”. These data files are formatted to follow the data structure requirements and be directly used in the ipython scripts, and they have been shuffled chronologically to train machine learning models. The meteorological and soil data was collected using point sensors between 2019-2023 at(4.a) Three shrub-dominated field sites in East River, Colorado (named “ph1”, “ph2” and “sg5” in “df_meta.csv”, where “ph1” and “ph2” were located at PumpHouse Hillslopes, and “sg5” was at Snodgrass Mountain meadow) and(4.b) One outdoor, mesoscale, and herbaceous-dominated experiment in Berkeley, California (named “tb” in “df_meta.csv”, short for Smartsoils Testbed at Lawrence Berkeley National Lab).- See "df_data_dd.csv" and "df_meta_dd.csv" for variable descriptions and the Methods section for additional data processing steps. See "flmd.csv" and "README.txt" for brief file descriptions.- All ipython scripts and python files are written in and require PYTHON language software.« less
Facebook
TwitterOverview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:
This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.
Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).
Content of the repository
A) Scripts
The python scripts run with Python 3.7 and with the packages found in "requirements.txt".
B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".
Use cases
We point out that this repository can be used in two different was:
from helper_functions import *
import pandas as pd
cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip',
index_col=0, header=None, squeeze=True,
parse_dates=[0])
valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull())
start,end= valid_bounds[ np.argmax(valid_sizes) ]
data_without_nan = cleansed_data.iloc[start:end]
License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Facebook
TwitterStudies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.
Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494
Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.
Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.
Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.
Missing data The string "NAN" indicates missing data
File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files
Files
Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.
Folsom_weather.csv Primary One-minute weather data.
Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.
Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.
Folsom_sky_image_features.csv Secondary Features derived from the sky images.
Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.
Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).
Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.
Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.
NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.
Target_{horizon}.csv Secondary Target data for the different forecasting horizons.
Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.
Postprocess.py Code Python script used to compute the error metric for all the forecasts.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.
The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:
After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.
In addition to this file, we have also included the following files:
This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Default and optimal tuned hyperparameters of Random Forest model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reproductive health and Family planning service characteristics of respondents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.
The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.
The preprocessing steps include:
One-hot-encoding of categorical values
Imputation of missing values using knn-imputer with k=1
Standard scaling of ordinal attributes
Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Facebook
TwitterDescription: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:
Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:
v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Facebook
TwitterThe BL_full_analysis_script.py file implements, in a sequential and reproducible manner, all the quantitative apparatus described in the article. Its logic can be summarized in five linked modules:
Data loading and cleaning Reads the Dataset Reservas_V.1.xlsx file, identifies the date column, cleans missing values, and calculates monthly logarithmic returns for each asset (gold, SDR, IMF position, currencies, and Bitcoin). The resulting set constitutes the return matrix on which subsequent calculations are based.
Construction of Black-Litterman parameters Calculate the covariance matrix ΣΣ and the market weight vector wmwm; with τ=0.05τ=0.05, derive the implied equilibrium returns π=τΣwmπ=τΣwm. Automatically detect the BTC column to set the view restriction.
black_litterman function Implements the standard Bayesian solution: for each pair (Q, Ω)(Q,Ω) constructs the adjusted return vector μBLμBL and solves the optimal allocation w∗w∗ under an aversion coefficient λ=3λ=3, normalizing the portfolio to full investment.
Sensitivity loop and export It runs through a grid of 30 scenarios (Q−5%…15%, Ω1%…5%), stores the optimal weights and their risk metrics in a DataFrame, and finally saves the result in BL_metrics_full_dataset_v1.xlsx.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected data from 36 published articles on PubMed [13–48] to train and validate our machine learning models. Some articles comprised more than one type of cartilage injury models or treatment condition. In total, 15 clinical trial conditions and 29 animal 66 model conditions (1 goat, 6 pigs, 2 dogs, 9 rabbits, 9 rats, and 2 mice) on osteochondral injury or osteoarthritis were included, where MSCs were transplanted to repair the cartilage tissue. We documented each case with specific treatment condition into an entry by considering the cell- and treatment target-related factors as input properties, including species, body weight, tissue source, cell number, cell concentration, defect area, defect depth, and type of cartilage damage. The therapeutic outcomes were considered as output properties, which were evaluated using integrated clinical and histological cartilage repair scores, including the international cartilage repair society (ICRS) scoring system, the O’Driscoll score, the Pineda score, the Mankin score, the osteoarthritis research society international (OARSI) scoring system, the international knee documentation committee (IKDC) score, the visual analog score (VAS) for pain, the knee injury and osteoarthritis outcome score (KOOS), the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lyscholm score. In this study, these scores were linearly normalized to a number between 0 and 1, with 0 representing the worst damage or pain, and 1 representing the completely healthy tissue. The list of entries was combined together to form a database.
We have provided the details for the imputation algorithm in the subsection Handling missing data under Methods and a flowchart in Fig 2. Data imputation algorithm for the vector x was added in the manuscript for illustration. The pseudo-code for uncertainty calculation was shown in S1 Algorithm: A ensemble model to measure the ANN's prediction uncertainty. The original database gathered from the literature, and a ‘complete’ database with missing information filled from our neural network are also included, along with a sample neural network architecture file in Python.
Here we provide a Python notebook comprising a neural network that delivers the performance and results described in the manuscript. Documentation in the form of comments and installation guide is included in the Python notebook. This Python notebook along with the methods described in the manuscript provides sufficient details for other interested readers to either extend this script or write their own scripts and reproduce the results in the paper.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Description: Geographical Distribution and Climate Data of Cycas taiwaniana (Taiwanese Cycad)This dataset contains the geographical distribution and climate data for Cycas taiwaniana, focusing on its presence across regions in Fujian, Guangdong, and Hainan provinces of China. The dataset includes geographical coordinates (longitude and latitude), monthly climate data (minimum and maximum temperature, and precipitation) across different months, as well as bioclimatic variables based on the WorldClim dataset.**Temporal and Spatial Information** The data covers long-term climate information, with monthly data for each location recorded over a 12-month period (January to December). The dataset includes spatial data in terms of longitude and latitude, corresponding to various locations where Cycas taiwaniana populations are present. The spatial resolution is specific to each point location, and the temporal resolution reflects the monthly climate data for each year.**Data Structure and Units** The dataset consists of 36 records, each representing a unique location with corresponding climate and geographical data. The table includes the following columns: 1. No.: Unique identifier for each data record 2. Longitude: Geographic longitude in decimal degrees 3. Latitude: Geographic latitude in decimal degrees 4. tmin1 to tmin12: Minimum temperature (°C) for each month (January to December) 5. tmax1 to tmax12: Maximum temperature (°C) for each month (January to December) 6. prec1 to prec12: Precipitation (mm) for each month (January to December) 7. bio1 to bio19: Bioclimatic variables (e.g., annual mean temperature, temperature seasonality, precipitation, etc.) derived from WorldClim data (unit varies depending on the variable)The units for each measurement are as follows: - Temperature: Degrees Celsius (°C) - Precipitation: Millimeters (mm) - Bioclimatic variables: Varies depending on the specific variable (e.g., °C, mm)**Data Gaps and Missing Values** The dataset contains some missing values, particularly in the "precipitation" columns for certain months and locations. These missing values may result from gaps in climate station data or limitations in data collection for specific regions. Missing values are indicated as "NA" (Not Available) in the dataset. In cases where data gaps exist, estimations were not made, and the absence of the data is acknowledged in the record.**File Format and Software Compatibility** The dataset is provided in CSV format for ease of use and compatibility with various data analysis tools. It can be opened and processed using software such as Microsoft Excel, R, or Python (with Pandas). Users can download the dataset and work with it in software such as R (https://cran.r-project.org/) or Python (https://www.python.org/). The dataset is compatible with any software that supports CSV files.This dataset provides valuable information for research related to the geographical distribution and climate preferences of Cycas taiwaniana and can be used to inform conservation strategies, ecological studies, and climate change modeling.
Facebook
TwitterIn this Datasets i simply showed the handling of missing values in your data with help of python libraries such as NumPy and pandas. You can also see the use of Nan and Non values. Detecting, dropping and filling of null values.