Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by JABERI Mohamed Habib
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about books. It has 1 row and is filtered where the book is Natural language processing : Python and NLTK : learning path : learn to build expert NLP and machine learning projects using NLTK and other Python libraries. It features 7 columns including author, publication date, language, and book publisher.
In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise
Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner. Acknowledgements We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted. References Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.
The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.
The preprocessing steps include:
One-hot-encoding of categorical values
Imputation of missing values using knn-imputer with k=1
Standard scaling of ordinal attributes
Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Language | Number of Samples |
Java | 153,119 |
Ruby | 233,710 |
Go | 137,998 |
JavaScript | 373,598 |
Python | 472,469 |
PHP | 294,394 |
Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
This dataset can be used to run the course on image processing with Python available here: https://github.com/guiwitz/neubias_academy_biapy It combines microscopy images from different publicly available sources. All files are either in the Public Domain (PD) or released with a CC-BY license. The list of the original location of the data as well as their licenses can be found in the LICENSE file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects. It has 2 rows and is filtered where the books is Mastering natural language processing with Python. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Our dataset comprises 1000 tweets, which were taken from Twitter using the Python programming language. The dataset was stored in a CSV file and generated using various modules. The random module was used to generate random IDs and text, while the faker module was used to generate random user names and dates. Additionally, the textblob module was used to assign a random sentiment to each tweet.
This systematic approach ensures that the dataset is well-balanced and represents different types of tweets, user behavior, and sentiment. It is essential to have a balanced dataset to ensure that the analysis and visualization of the dataset are accurate and reliable. By generating tweets with a range of sentiments, we have created a diverse dataset that can be used to analyze and visualize sentiment trends and patterns.
In addition to generating the tweets, we have also prepared a visual representation of the data sets. This visualization provides an overview of the key features of the dataset, such as the frequency distribution of the different sentiment categories, the distribution of tweets over time, and the user names associated with the tweets. This visualization will aid in the initial exploration of the dataset and enable us to identify any patterns or trends that may be present.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.
Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv
Playground for visualizations, preprocessing feature engineering, model pipelining, and more.
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Python scripts to strip speaker name, host contributions and non-dialogue from interview transcripts and convert from PDF to txt files
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The Virginia Tech Natural Motion Dataset contains 40 hours of unscripted human motion (full body kinematics) collected in the open world using an XSens MVN Link system. In total, there are data from 17 participants (13 participants on a college campus and 4 at a home improvement store). Participants did a wide variety of activities, including: walking from one place to another; operating machinery; talking with others; manipulating objects; working at a desk; driving; eating; pushing/pulling carts and dollies; physical exercises such as jumping jacks, jogging, and pushups; sweeping; vacuuming; and emptying a dishwasher. The code for analyzing the data is freely available with this dataset and also at: https://github.com/ARLab-VT/VT-Natural-Motion-Processing. The portion of the dataset involving workers was funded by Lowe's, Inc.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
ugsburg data set and Berlin data set for multimodal classification.This data set is a public data set, and the download address of the data set is provided in the related research articles. You can download it by following the link address. The data set is preprocessed into training data, test data and real label data respectively, and the real label data can be divided into training label data and test label data. The processing method is realized by writing a preprocessor in python.Augsburg data set:The data set contains HS data, SAR data and DSM data, and the data is divided into training sets, test sets and real label data.Berlin data set:The data set includes HS data, SAR data, and data has been divided into training set, test set and real label data.
Python script that was used to postprocess and plot the raw data of the simulation of the Brusselator system and visualise the spiral wave formation. Created on 3-Sep-2022.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by JABERI Mohamed Habib
Released under Apache 2.0