In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 – Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.
Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv
Playground for visualizations, preprocessing feature engineering, model pipelining, and more.
Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner. Acknowledgements We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted. References Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes three high-density sleep EEG recordings of healthy participants, downsampled to 250 Hz and stored in FIF format:
Additionally, the dataset includes three text files for each recording:
The corresponding package can be found on GitHub.
For citation, please use: https://doi.org/10.1101/2023.12.17.572046
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
There are many sources providing atmospheric weather station data for the Antarctic continent. However, variable naming, timestamps and data types are highly variable between the different sources. The published python code intends to make processing of different AWS sources from Antarctica easier. For all datasets that are taken into account variables are renamed in a consistent way. Data from different sources can then be handled in one consistent python dictionary. The following data sources are taken into account: * AAD: Australian Antarctic Division (https://data.aad.gov.au/aws) * ACECRC: Antarctic Climate and Ecosystems Cooperative Research Centre by the Australian Antarctic Division * AMRC: Antarctic Meteorological Research Center (ftp://amrc.ssec.wisc.edu/pub/aws/q1h/) * BAS: British Antarctic Survey (ftp://ftp.bas.ac.uk/src/ANTARCTIC_METEOROLOGICAL_DATA/AWS/; https://legacy.bas.ac.uk/met/READER/ANTARCTIC_METEOROLOGICAL_DATA/) * CLIMANTARTIDE: Antarctic Meteo-Climatological Observatory by the italian National Programme of Antarctic Research (https://www.climantartide.it/dataaccess/index.php?lang=en) * IMAU: Institute for Marine and Atmospheric research Utrecht (Lazzara et al., 2012), https://www.projects.science.uu.nl/iceclimate/aws/antarctica.ph * JMA: Japan Meteorological Agency (https://www.data.jma.go.jp/antarctic/datareport/index-e.html) * NOAA: National Oceanic and Atmospheric Administration (https://gml.noaa.gov/aftp/data/meteorology/in-situ/spo/) * Other/AWS_PE: Princess Elisabeth (PE), KU Leuven, Prof. N. van Lipzig, personal communication * Other/DDU_transect: Stations D-17 and D-47 (in transect between Dumont d’Urville and Dome C, Amory, 2020) * PANGAEA: World Data Center (e.g. König-Langlo, 2012) Important notes * Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. * Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. * No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner.
We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted.
Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted. König-Langlo, G. (2012). “Continuous meteorological observations at Neumayer station (2011-01)”. Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, PANGAEA, doi: 10.1594/PANGAEA. 775173
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
vege_original : Images of vegetables captured manually in data acquisition stage
vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
food_image_dataset_split : Image dataset (4) split into train and test sets
process : Images created when cropping (pre-processing step) to create dataset (2).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Overnight high-density EEG recording of a healthy young participant downsampled to 250Hz in fif format. In addition, two text files: bad_channels.txt contains indexes of noisy channels annotations.txt contains the onset and duration of noisy temporal intervals The package can be found on GitHub.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.
The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.
The preprocessing steps include:
One-hot-encoding of categorical values
Imputation of missing values using knn-imputer with k=1
Standard scaling of ordinal attributes
Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
https://spdx.org/licenses/MIT.htmlhttps://spdx.org/licenses/MIT.html
pre-processing python script to slice through a large genomics data set to separate/filter according certain criteria
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface.
The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user's system or to familiarize oneself with the pipeline.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Language | Number of Samples |
Java | 153,119 |
Ruby | 233,710 |
Go | 137,998 |
JavaScript | 373,598 |
Python | 472,469 |
PHP | 294,394 |
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
It is a beginner friendly data which is easy to understand and best to start any one's journey. Students of machine learning and data analytics can use this to understand basic libraries of python.
To make it less complex only four basic columns are included which are giving some quick information of top rated movies all over the world. This data set can help to build concepts of preprocessing and handling of data before applying any mathematical models.
Visualizations are also giving some comprehensive description about popular movies.
Lets get started. Happy coding!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the complete Python codebase for the study:
An Empirical Study of ChatGPT-4o Use in Engineering Education: Prompting and Performance
The code includes data preprocessing, metric calculations, and machine learning models used in the study.
The repository also includes the original dataset used in the paper:
An Empirical Research Study of ChatGPT-4o in Engineering Education.xlsx
This dataset contains anonymized logs of AI interactions, written assignments, grades, and computed metrics used for analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Climate forecasts, both experimental and operational, are often made by calibrating Global Climate Model (GCM) outputs with observed climate variables using statistical and machine learning models. Often, machine learning techniques are applied to gridded data independently at each gridpoint. However, the implementation of these gridpoint-wise operations is a significant barrier to entry to climate data science. Unfortunately, there is a significant disconnect between the Python data science ecosystem and the gridded earth data ecosystem. Traditional Python data science tools are not designed to be used with gridded datasets, like those commonly used in climate forecasting. Heavy data preprocessing is needed: gridded data must be aggregated, reshaped, or reduced in dimensionality in order to fit the strict formatting requirements of Python's data science tools. Efficiently implementing this gridpoint-wise workflow is a time-consuming logistical burden which presents a high barrier to entry to earth data science. A set of high-performance, easy-to-use Python climate forecasting tools is needed to bridge the gap between Python's data science ecosystem and its gridded earth data ecosystem. XCast, an Xarray-based climate forecasting Python library developed by the authors, bridges this gap. XCast wraps underlying two-dimensional data science methods, like those of Scikit-Learn, with data structures that allow them to be applied to each gridpoint independently. XCast uses high-performance computing libraries to efficiently parallelize the gridpoint-wise application of data science utilities and make Python's traditional data science toolkits compatible with multidimensional gridded data. XCast also implements a diverse set of climate forecasting tools including traditional statistical methods, state-of-the-art machine learning approaches, preprocessing functionality (regridding, rescaling, smoothing), and postprocessing modules (cross validation, forecast verification, visualization). These tools are useful for producing and analyzing both experimental and operational climate forecasts. In this study, we describe the development of XCast, and present in-depth technical details on how XCast brings highly parallelized gridpoint-wise versions of traditional Python data science tools into Python's gridded earth data ecosystem. We also demonstrate a case study where XCast was used to generate experimental real-time deterministic and probabilistic forecasts for South Asian Summer Monsoon Rainfall in 2022 using different machine learning-based multi-model ensembles.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise