99 datasets found
  1. ML Preprocessing Dataset for Python

    • kaggle.com
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JABERI Mohamed Habib (2024). ML Preprocessing Dataset for Python [Dataset]. https://www.kaggle.com/datasets/jaberimohamedhabib/ml-preprocessing-dataset-for-python/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    JABERI Mohamed Habib
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by JABERI Mohamed Habib

    Released under Apache 2.0

    Contents

  2. w

    Dataset of books called Natural language processing : Python and NLTK :...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Natural language processing : Python and NLTK : learning path : learn to build expert NLP and machine learning projects using NLTK and other Python libraries [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Natural+language+processing+%3A+Python+and+NLTK+%3A+learning+path+%3A+learn+to+build+expert+NLP+and+machine+learning+projects+using+NLTK+and+other+Python+libraries
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Natural language processing : Python and NLTK : learning path : learn to build expert NLP and machine learning projects using NLTK and other Python libraries. It features 7 columns including author, publication date, language, and book publisher.

  3. Data Pre-Processing : Data Integration

    • kaggle.com
    Updated Aug 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mr.Machine (2022). Data Pre-Processing : Data Integration [Dataset]. https://www.kaggle.com/datasets/ilayaraja07/data-preprocessing-data-integration
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mr.Machine
    Description

    In this exercise, we'll merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. Follow these steps to complete this exercise

  4. e

    Preprocessing Antarctic Weather Station (AWS) data in python - Dataset -...

    • b2find.eudat.eu
    Updated Dec 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Preprocessing Antarctic Weather Station (AWS) data in python - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/d93b6b2b-b08f-55a1-9fb0-68c2971701ae
    Explore at:
    Dataset updated
    Dec 27, 2023
    Area covered
    Antarctica
    Description

    Information about data sources is available. Some downloading scripts are included in the provided code. However, users should make sure to comply with the data providers terms and conditions. Given changing download options of the differnent institutions the above links may not permanently work and data has to be retrieved by the user of this dataset. No quality control is applied in the provided preprocessing software - quality control is up to the user of the datasets. Some dataset are quality controlled by the owner. Acknowledgements We thank all the data providers for making the data publicly available or providing them upon request. Full acknowledgements can be found in Gerber et al., submitted. References Amory, C. (2020). “Drifting-snow statistics from multiple-year autonomous measurements in Adélie Land, East Antarctica”. The Cryosphere, 1713–1725. doi: 10.5194/tc-14-1713-2020 Gerber, F., Sharma, V. and Lehning, M.: CRYOWRF - a validation and the effect of blowing snow on the Antarctic SMB, JGR - Atmospheres, submitted.

  5. VegeNet - Image datasets and Codes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jo Yen Tan; Jo Yen Tan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

    Image datasets:

    1. vege_original : Images of vegetables captured manually in data acquisition stage
    2. vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed
    3. non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods
    4. food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.
    5. food_image_dataset_split : Image dataset (4) split into train and test sets
    6. process : Images created when cropping (pre-processing step) to create dataset (2).
  6. Z

    Adult dataset preprocessed

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pustozerova, Anastasia (2024). Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
    Explore at:
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Schuster, Verena
    Pustozerova, Anastasia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

    The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

    The preprocessing steps include:

    One-hot-encoding of categorical values

    Imputation of missing values using knn-imputer with k=1

    Standard scaling of ordinal attributes

    Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.

  7. CommitBench

    • zenodo.org
    csv, json
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo (2024). CommitBench [Dataset]. http://doi.org/10.5281/zenodo.10497442
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maximilian Schall; Maximilian Schall; Tamara Czinczoll; Tamara Czinczoll; Gerard de Melo; Gerard de Melo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Dec 15, 2023
    Description

    Data Statement for CommitBench

    - Dataset Title: CommitBench
    - Dataset Curator: Maximilian Schall, Tamara Czinczoll, Gerard de Melo
    - Dataset Version: 1.0, 15.12.2023
    - Data Statement Author: Maximilian Schall, Tamara Czinczoll
    - Data Statement Version: 1.0, 16.01.2023

    EXECUTIVE SUMMARY

    We provide CommitBench as an open-source, reproducible and privacy- and license-aware benchmark for commit message generation. The dataset is gathered from github repositories with licenses that permit redistribution. We provide six programming languages, Java, Python, Go, JavaScript, PHP and Ruby. The commit messages in natural language are restricted to English, as it is the working language in many software development projects. The dataset has 1,664,590 examples that were generated by using extensive quality-focused filtering techniques (e.g. excluding bot commits). Additionally, we provide a version with longer sequences for benchmarking models with more extended sequence input, as well a version with

    CURATION RATIONALE

    We created this dataset due to quality and legal issues with previous commit message generation datasets. Given a git diff displaying code changes between two file versions, the task is to predict the accompanying commit message describing these changes in natural language. We base our GitHub repository selection on that of a previous dataset, CodeSearchNet, but apply a large number of filtering techniques to improve the data quality and eliminate noise. Due to the original repository selection, we are also restricted to the aforementioned programming languages. It was important to us, however, to provide some number of programming languages to accommodate any changes in the task due to the degree of hardware-relatedness of a language. The dataset is provides as a large CSV file containing all samples. We provide the following fields: Diff, Commit Message, Hash, Project, Split.

    DOCUMENTATION FOR SOURCE DATASETS

    Repository selection based on CodeSearchNet, which can be found under https://github.com/github/CodeSearchNet

    LANGUAGE VARIETIES

    Since GitHub hosts software projects from all over the world, there is no single uniform variety of English used across all commit messages. This means that phrasing can be regional or subject to influences from the programmer's native language. It also means that different spelling conventions may co-exist and that different terms may used for the same concept. Any model trained on this data should take these factors into account. For the number of samples for different programming languages, see Table below:

    LanguageNumber of Samples
    Java153,119
    Ruby233,710
    Go137,998
    JavaScript373,598
    Python472,469
    PHP294,394

    SPEAKER DEMOGRAPHIC

    Due to the extremely diverse (geographically, but also socio-economically) backgrounds of the software development community, there is no single demographic the data comes from. Of course, this does not entail that there are no biases when it comes to the data origin. Globally, the average software developer tends to be male and has obtained higher education. Due to the anonymous nature of GitHub profiles, gender distribution information cannot be extracted.

    ANNOTATOR DEMOGRAPHIC

    Due to the automated generation of the dataset, no annotators were used.

    SPEECH SITUATION AND CHARACTERISTICS

    The public nature and often business-related creation of the data by the original GitHub users fosters a more neutral, information-focused and formal language. As it is not uncommon for developers to find the writing of commit messages tedious, there can also be commit messages representing the frustration or boredom of the commit author. While our filtering is supposed to catch these types of messages, there can be some instances still in the dataset.

    PREPROCESSING AND DATA FORMATTING

    See paper for all preprocessing steps. We do not provide the un-processed raw data due to privacy concerns, but it can be obtained via CodeSearchNet or requested from the authors.

    CAPTURE QUALITY

    While our dataset is completely reproducible at the time of writing, there are external dependencies that could restrict this. If GitHub shuts down and someone with a software project in the dataset deletes their repository, there can be instances that are non-reproducible.

    LIMITATIONS

    While our filters are meant to ensure a high quality for each data sample in the dataset, we cannot ensure that only low-quality examples were removed. Similarly, we cannot guarantee that our extensive filtering methods catch all low-quality examples. Some might remain in the dataset. Another limitation of our dataset is the low number of programming languages (there are many more) as well as our focus on English commit messages. There might be some people that only write commit messages in their respective languages, e.g., because the organization they work at has established this or because they do not speak English (confidently enough). Perhaps some languages' syntax better aligns with that of programming languages. These effects cannot be investigated with CommitBench.

    Although we anonymize the data as far as possible, the required information for reproducibility, including the organization, project name, and project hash, makes it possible to refer back to the original authoring user account, since this information is freely available in the original repository on GitHub.

    METADATA

    License: Dataset under the CC BY-NC 4.0 license

    DISCLOSURES AND ETHICAL REVIEW

    While we put substantial effort into removing privacy-sensitive information, our solutions cannot find 100% of such cases. This means that researchers and anyone using the data need to incorporate their own safeguards to effectively reduce the amount of personal information that can be exposed.

    ABOUT THIS DOCUMENT

    A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

    This data statement was written based on the template for the Data Statements Version 2 schema. The template was prepared by Angelina McMillan-Major, Emily M. Bender, and Batya Friedman and can be found at https://techpolicylab.uw.edu/data-statements/ and was updated from the community Version 1 Markdown template by Leon Dercyznski.

  8. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  9. Data from: COVID-19 and media dataset: Mining textual data according periods...

    • dataverse.cirad.fr
    application/x-gzip +1
    Updated Dec 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu Roche; Mathieu Roche (2020). COVID-19 and media dataset: Mining textual data according periods and countries (UK, Spain, France) [Dataset]. http://doi.org/10.18167/DVN1/ZUA8MF
    Explore at:
    application/x-gzip(511157), application/x-gzip(97349), text/x-perl-script(4982), application/x-gzip(93110), application/x-gzip(23765310), application/x-gzip(107669)Available download formats
    Dataset updated
    Dec 21, 2020
    Authors
    Mathieu Roche; Mathieu Roche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Spain, United Kingdom, France
    Dataset funded by
    ANR (#DigitAg)
    Horizon 2020 - European Commission - (MOOD project)
    Description

    These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],

  10. o

    Dataset for interactive course on BioImage Analysis with Python (BIAPy)

    • explore.openaire.eu
    Updated May 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Witz (2020). Dataset for interactive course on BioImage Analysis with Python (BIAPy) [Dataset]. http://doi.org/10.5281/zenodo.3786306
    Explore at:
    Dataset updated
    May 5, 2020
    Authors
    Guillaume Witz
    Description

    This dataset can be used to run the course on image processing with Python available here: https://github.com/guiwitz/neubias_academy_biapy It combines microscopy images from different publicly available sources. All files are either in the Public Domain (PD) or released with a CC-BY license. The list of the original location of the data as well as their licenses can be found in the LICENSE file.

  11. w

    Dataset of book subjects that contain Mastering natural language processing...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Mastering natural language processing with Python [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Mastering+natural+language+processing+with+Python&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 2 rows and is filtered where the books is Mastering natural language processing with Python. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  12. m

    Dataset for twitter Sentiment Analysis using Roberta and Vader

    • data.mendeley.com
    Updated May 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jannatul Ferdoshi Jannatul Ferdoshi (2023). Dataset for twitter Sentiment Analysis using Roberta and Vader [Dataset]. http://doi.org/10.17632/2sjt22sb55.1
    Explore at:
    Dataset updated
    May 14, 2023
    Authors
    Jannatul Ferdoshi Jannatul Ferdoshi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our dataset comprises 1000 tweets, which were taken from Twitter using the Python programming language. The dataset was stored in a CSV file and generated using various modules. The random module was used to generate random IDs and text, while the faker module was used to generate random user names and dates. Additionally, the textblob module was used to assign a random sentiment to each tweet.

    This systematic approach ensures that the dataset is well-balanced and represents different types of tweets, user behavior, and sentiment. It is essential to have a balanced dataset to ensure that the analysis and visualization of the dataset are accurate and reliable. By generating tweets with a range of sentiments, we have created a diverse dataset that can be used to analyze and visualize sentiment trends and patterns.

    In addition to generating the tweets, we have also prepared a visual representation of the data sets. This visualization provides an overview of the key features of the dataset, such as the frequency distribution of the different sentiment categories, the distribution of tweets over time, and the user names associated with the tweets. This visualization will aid in the initial exploration of the dataset and enable us to identify any patterns or trends that may be present.

  13. ChatGPT API and BERT NLP

    • figshare.com
    application/csv
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Atkins (2024). ChatGPT API and BERT NLP [Dataset]. http://doi.org/10.6084/m9.figshare.25403407.v2
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Carmen Atkins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).

  14. Titanic data for Data Preprocessing

    • kaggle.com
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akshay Sehgal (2021). Titanic data for Data Preprocessing [Dataset]. https://www.kaggle.com/akshaysehgal/titanic-data-for-data-preprocessing/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Akshay Sehgal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Public "Titanic" dataset for data exploration, preprocessing and benchmarking basic classification/regression models.

    Columns

    • 'survived'
    • 'pclass'
    • 'sex'
    • 'age'
    • 'sibsp'
    • 'parch'
    • 'fare'
    • 'embarked'
    • 'class'
    • 'who'
    • 'adult_male'
    • 'deck'
    • 'embark_town'
    • 'alive'
    • 'alone'

    Acknowledgements

    Github: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv

    Inspiration

    Playground for visualizations, preprocessing feature engineering, model pipelining, and more.

  15. d

    Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

    • datadryad.org
    • zenodo.org
    zip
    Updated Jul 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 8, 2024
    Dataset provided by
    Dryad
    Authors
    Yuqi Tan; Tim Kempchen
    Time period covered
    Jun 28, 2024
    Description

    Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline.

  16. f

    Python scripts for Song Exploder transcript pre-processing

    • figshare.com
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Dresel (2025). Python scripts for Song Exploder transcript pre-processing [Dataset]. http://doi.org/10.6084/m9.figshare.29484959.v1
    Explore at:
    text/x-script.pythonAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset provided by
    figshare
    Authors
    Robin Dresel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Python scripts to strip speaker name, host contributions and non-dialogue from interview transcripts and convert from PDF to txt files

  17. v

    Virginia Tech Natural Motion Dataset

    • data.lib.vt.edu
    xlsx
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Geissinger; Alan Asbeck; Mohammad Mehdi Alemi; S. Emily Chang (2021). Virginia Tech Natural Motion Dataset [Dataset]. http://doi.org/10.7294/2v3w-sb92
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 3, 2021
    Dataset provided by
    University Libraries, Virginia Tech
    Authors
    Jack Geissinger; Alan Asbeck; Mohammad Mehdi Alemi; S. Emily Chang
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Virginia
    Description

    The Virginia Tech Natural Motion Dataset contains 40 hours of unscripted human motion (full body kinematics) collected in the open world using an XSens MVN Link system. In total, there are data from 17 participants (13 participants on a college campus and 4 at a home improvement store). Participants did a wide variety of activities, including: walking from one place to another; operating machinery; talking with others; manipulating objects; working at a desk; driving; eating; pushing/pulling carts and dollies; physical exercises such as jumping jacks, jogging, and pushups; sweeping; vacuuming; and emptying a dishwasher. The code for analyzing the data is freely available with this dataset and also at: https://github.com/ARLab-VT/VT-Natural-Motion-Processing. The portion of the dataset involving workers was funded by Lowe's, Inc.

  18. Storage and Transit Time Data and Code

    • zenodo.org
    zip
    Updated Oct 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14009758
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrew Felton; Andrew Felton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Author: Andrew J. Felton
    Date: 10/29/2024

    This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

    "Global estimates of the storage and transit time of water through vegetation"

    Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.

    Data information:

    The data folder contains key data sets used for analysis. In particular:

    "data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

    #Code information

    Python scripts can be found in the "supporting_code" folder.

    Each R script in this project has a role:

    "01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

    "02_functions.R": This script contains custom functions. Load this using the
    `source()` function in the 01_start.R script.

    "03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
    `source()` function in the 01_start.R script.

    "04_figures_tables.R": This is the main workhouse for figure/table production and
    supporting analyses. This script generates the key figures and summary statistics
    used in the study that then get saved in the manuscript_figures folder. Note that all
    maps were produced using Python code found in the "supporting_code"" folder.

    "supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

    "supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.

  19. Augsburg data set and Berlin data set for multimodal classification

    • figshare.com
    zip
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    huiqing wang (2024). Augsburg data set and Berlin data set for multimodal classification [Dataset]. http://doi.org/10.6084/m9.figshare.28112405.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 31, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    huiqing wang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    ugsburg data set and Berlin data set for multimodal classification.This data set is a public data set, and the download address of the data set is provided in the related research articles. You can download it by following the link address. The data set is preprocessed into training data, test data and real label data respectively, and the real label data can be divided into training label data and test label data. The processing method is realized by writing a preprocessor in python.Augsburg data set:The data set contains HS data, SAR data and DSM data, and the data is divided into training sets, test sets and real label data.Berlin data set:The data set includes HS data, SAR data, and data has been divided into training set, test set and real label data.

  20. e

    Python post-processing and plotting script for Zacros KMC simulations -...

    • b2find.eudat.eu
    Updated Sep 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Python post-processing and plotting script for Zacros KMC simulations - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/777e276c-8fec-5ff5-9e88-6aa7eafaa6d6
    Explore at:
    Dataset updated
    Sep 3, 2022
    Description

    Python script that was used to postprocess and plot the raw data of the simulation of the Brusselator system and visualise the spiral wave formation. Created on 3-Sep-2022.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
JABERI Mohamed Habib (2024). ML Preprocessing Dataset for Python [Dataset]. https://www.kaggle.com/datasets/jaberimohamedhabib/ml-preprocessing-dataset-for-python/suggestions
Organization logo

ML Preprocessing Dataset for Python

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
JABERI Mohamed Habib
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by JABERI Mohamed Habib

Released under Apache 2.0

Contents

Search
Clear search
Close search
Google apps
Main menu