100+ datasets found
  1. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  2. food data cleaning

    • kaggle.com
    zip
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 13, 2024
    Authors
    AbdElRahman16
    Description

    Dataset

    This dataset was created by AbdElRahman16

    Contents

  3. Clean Data.csv

    • figshare.com
    txt
    Updated Dec 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 3, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Zaid Hattab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.

  4. h

    Tamazight-Clean-CSV

    • huggingface.co
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai (2025). Tamazight-Clean-CSV [Dataset]. https://huggingface.co/datasets/Datasmartly/Tamazight-Clean-CSV
    Explore at:
    Dataset updated
    Jul 5, 2025
    Authors
    Ai
    Description

    Datasmartly/Tamazight-Clean-CSV dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Data cleaning using unstructured data

    • zenodo.org
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this project, we work on repairing three datasets:

    • Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.
    • Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.
    • Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

    N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

    • "{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")
    • "{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")
    • "{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
  6. 11 Benchmark Clean-Clean ER datasets in CSV format

    • zenodo.org
    application/gzip
    Updated Feb 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). 11 Benchmark Clean-Clean ER datasets in CSV format [Dataset]. http://doi.org/10.5281/zenodo.14923071
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 26, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Contains:

    • D1: Contains restaurant descriptions, first introduced in OAEI 2010.
    • D2: Includes duplicate products from Abt.com and Buy.com.
    • D3: Matches product descriptions from Amazon and Google Base.
    • D4: Compares bibliographic data from DBLP and ACM.
    • D5, D6, D7: Contain descriptions of television shows and movies from TheTVDB, IMDb, and TMDb.
    • D8: Matches product descriptions from Walmart and Amazon.
    • D9: Involves bibliographic data from DBLP and Google Scholar.
    • D10: Links movie descriptions from IMDb and DBpedia.
    • D11: A large-scale dataset with millions of heterogeneous entities from two DBpedia versions spanning a 3-year gap.
  7. Spotify.csv - File modified for data cleaning

    • kaggle.com
    Updated Jul 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vitoria Rodrigues Silva (2020). Spotify.csv - File modified for data cleaning [Dataset]. https://www.kaggle.com/datasets/vitoriarodrigues/spotifycsv-file-modified-for-data-cleaning/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vitoria Rodrigues Silva
    Description

    Dataset

    This dataset was created by Vitoria Rodrigues Silva

    Contents

  8. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  9. _labels1.csv. This data set representss the label of the corresponding...

    • figshare.com
    txt
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    naillah gul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.

  10. ToS;DR policies dataset (clean)

    • zenodo.org
    csv
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (clean) [Dataset]. http://doi.org/10.5281/zenodo.15013541
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahmoud Istaiti; Mahmoud Istaiti
    License

    https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html

    Description

    Overview

    This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.

    File Descriptions

    1. clean_tosdr_all_data.csv

    • This file contains a comprehensive collection of terms of service data.
    • Each row represents a statement (or "point") extracted from a service's terms.
    • Key columns:
      • point_quote_text: Extracted text from the terms of service.
      • case_id: Unique identifier for the case.
      • case_title: Brief description of the case.
      • topic_id: Unique identifier for the topic.
      • topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

    2. clean_tosdr_privacy_data.csv

    • This file is a subset of clean_tosdr_all_data.csv containing only privacy-related entries.
    • Includes cases related to tracking, data collection, account deletion policies, and more.
    • Has the same structure as clean_tosdr_all_data.csv but filtered to include only privacy-related topics.

    Usage

    • Use clean_tosdr_all_data.csv for a broad analysis of various terms of service aspects.
    • Use clean_tosdr_privacy_data.csv for focused studies on privacy-related clauses.
  11. M

    Mali Clean Water Access | Historical Data | N/A-N/A

    • macrotrends.net
    csv
    Updated Jun 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MACROTRENDS (2025). Mali Clean Water Access | Historical Data | N/A-N/A [Dataset]. https://www.macrotrends.net/datasets/global-metrics/countries/mli/mali/clean-water-access-statistics
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 30, 2025
    Dataset authored and provided by
    MACROTRENDS
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Mali
    Description

    Historical dataset showing Mali clean water access by year from N/A to N/A.

  12. CORD-19-CSV

    • kaggle.com
    Updated Apr 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huáscar Méndez (2020). CORD-19-CSV [Dataset]. https://www.kaggle.com/huascarmendez1/cord19csv/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Huáscar Méndez
    Description

    Context

    This dataset is an extract from COVID-19 Open Research Dataset Challenge (CORD-19).

    This pre-process is neccesary since the original input data it is stored in JSON files, whose structure is likely too complex to directly perform the analysis.

    The preprocessing of the data further consisted of filtering the documents that specifically talk about the covid-19 disease and its other names, among other general data review and cleaning activities.

    Content

    As a result, this dataset contains a set of files in csv format, grouped into original sources (Biorxiv, Comm_use, Custom_licence, Nomcomm_use). Each of those files contains a subset of data columns, specifically paper_id, doc_title, doc_text, and source.

  13. T

    Data from: Survey Data on Customer Two-Stage Decision-Making Process in...

    • dataverse.tdl.org
    pdf, tsv
    Updated Mar 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha; Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha (2023). Survey Data on Customer Two-Stage Decision-Making Process in Household Vacuum Cleaner Market [Dataset]. http://doi.org/10.18738/T8/SPJSLI
    Explore at:
    tsv(2745951), tsv(267973), pdf(919108), tsv(3612), tsv(24430)Available download formats
    Dataset updated
    Mar 23, 2023
    Dataset provided by
    Texas Data Repository
    Authors
    Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha; Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The dataset contains several components, including: 1. The survey instrument used to collect the data in .pdf file format (pdf file for the questionnaire) 2. The raw survey data in .csv file format (csv file for survey data): The survey data contains 251 variables with responses from 1002 respondents. In the survey design, all questions are mandatory, and therefore no missing values exist, except for instances where respondents chose to respond with “prefer not to say” (in some sensitive demographic questions) or “I don’t know” (in some questions related to their social networks). Non-applicable responses are coded as "NULL" and blank values. 3. The codebook for the raw survey data in .xlsx file format (xlsx file for survey data): The codebook explains each of the 251 variables included in the survey data file. The codebook lists how each survey question and response option is numerically coded in the raw data and can be used as a guide for navigating the survey dataset. 4. The product feature list data in .csv file format (csv file for product data): The product feature list data contains the product features of 624 vacuum cleaner products and each product has 32 variables/features. Missing values, where no online information is available, are coded as “NA,” while non-applicable values, such as runtime for corded vacuum cleaners or navigation path for non-robotic vacuum cleaners, are coded as blank values. 5. The codebook for the product features list data in .xlsx file format (xlsx file for product data): The accompanying codebook provides a detailed description of each feature and its data type, as well as the number of missing values for each product feature in the last column.

  14. f

    Los Angeles Cleaning Services

    • flashmapy.com
    csv
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Los Angeles Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-los-angeles
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 22, 2024
    Area covered
    Los Angeles
    Variables measured
    X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
    Description

    A downloadable CSV file containing 335 Cleaning Services in Los Angeles with details like contact information, price range, reviews, and opening hours.

  15. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  16. r

    Data for PhD Chapter 3 and manuscript: Cleaner shrimp are true cleaners of...

    • researchdata.edu.au
    Updated Jul 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vaughan David; David Brendan Vaughan (2018). Data for PhD Chapter 3 and manuscript: Cleaner shrimp are true cleaners of injured fish [Dataset]. http://doi.org/10.4225/28/5B2C885B32331
    Explore at:
    Dataset updated
    Jul 5, 2018
    Dataset provided by
    James Cook University
    Authors
    Vaughan David; David Brendan Vaughan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    Datasets (all) for this work, provided in.csv format for direct import into R. The data collection consists of the following datasets:

    All.data.csv

    This dataset contains the data used for the first behavioural model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This dataset informed the initial exploratory mixed effects random intercept model using all cleaning contact locations (fish sides, oral, and ventral) recorded on the fish per day testing the response variable ‘cleaning time’ as a function of the fixed effects ‘day’, ‘cleaning contact locations’, and interaction ‘day x cleaning contact locations’, and ‘fish’ and ‘shrimp’ as random effects.

    All.dataR.14.csv

    This dataset contains the data used for the second to fifth behavioural models model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This is a subset of All.data.csv which excludes oral and ventral cleaning contact locations (scenarios 5 and 6). The analysis for All.data.csv was repeated using this analysis initially, and then two alternative approaches were used to model temporal change in cleaning times. In the first, day was treated as a numeric variable, included in the model as either a quadratic or a linear function to test for curvature testing the response variable ‘cleaning time’ as a function of the fixed effects ‘cleaning contact locations’, ‘day’, ‘day2’, and the interactions ‘cleaning contact locations with day’, ‘cleaning contact locations with day2’, and ‘fish’ and ‘shrimp’ as random effects. This analysis was carried out twice, once including all of the data, and once excluding day 0, to determine whether any temporal changes in behaviour extended beyond the initial establishment period of injury. In the second approach, based on the results of the first, the data were re-analysed with day treated as a category having two binary classes, ‘day0’ and ‘>day0’.

    Jolts.data1.csv

    This dataset was used for the analysis of jolting in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The number of ‘jolts’ were analysed using a random-intercept mixed effects model with ‘fish’ and ‘shrimp’ as random effects, and ‘treatment’ (two levels: Injured_with_shrimp; Uninjured_with_shrimp), and ‘day’ as fixed effects.

    Red.csv

    This dataset was used for the analysis of injury redness (rubor) in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The analysis examined spectral differences between groups with and without shrimp over the subsequent period to examine whether the presence of shrimp affected the spectral properties of the injury site as the injury healed. For this analysis, ‘day’ (either 4 or 6), ‘shrimp presence’ and the ‘shrimp x day’ interaction were all included as potential explanatory variables.

    Yellow.csv

    As for Red.csv.

    UV1.csv

    This dataset was used for the Nonspecific tissue damage analysis in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. Nonspecific tissue damage area was investigated between two levels of four treatment groups (With shrimp and Without shrimp; Injured fish and Uninjured fish) over time to determine their effects on tissue damage. Mixed effects random-intercept models were employed, with the ‘fish’ as the random effect to allow for photographic sampling on both sides of the same fish. The response variable ‘tissue damage area’ was tested as a function of the fixed effects ‘treatment’, ‘side’, ‘day’ (as a factor). Two levels of fish sides were included in the analyses representing injured and uninjured sides.

  17. f

    Dallas Cleaning Services

    • flashmapy.com
    csv
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Dallas Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-dallas
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 22, 2024
    Area covered
    Dallas
    Variables measured
    X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
    Description

    A downloadable CSV file containing 302 Cleaning Services in Dallas with details like contact information, price range, reviews, and opening hours.

  18. f

    OERRH Survey Data 2013-2015 csv

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OERHub OpenUniversity (2023). OERRH Survey Data 2013-2015 csv [Dataset]. http://doi.org/10.6084/m9.figshare.1528263.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    OERHub OpenUniversity
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CSV file containing responses to surveys conducted by the Hewlett-funded OER Research Hub Project during 2013-2015 exploring key questions around OER use and attitudes. xcel file available at http://figshare.com/articles/OERRH_Survey_Data_2013_2015/1546584You can read about the creation of the file at http://oerhub.net/blogs/cleaning-our-way-to-a-monster-dataset/ and the background of the project at http://oerhub.net/what-we-do/background-to-oer-hub/

  19. f

    Philadelphia Cleaning Services

    • flashmapy.com
    csv
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Philadelphia Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-philadelphia
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 22, 2024
    Area covered
    Philadelphia
    Variables measured
    X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
    Description

    A downloadable CSV file containing 64 Cleaning Services in Philadelphia with details like contact information, price range, reviews, and opening hours.

  20. f

    Seville Cleaning Services

    • flashmapy.com
    csv
    Updated Aug 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Seville Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-seville
    Explore at:
    csvAvailable download formats
    Dataset updated
    Aug 22, 2024
    Area covered
    Seville
    Variables measured
    X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
    Description

    A downloadable CSV file containing 179 Cleaning Services in Seville with details like contact information, price range, reviews, and opening hours.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Organization logo

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
text/x-python, csv, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.
Search
Clear search
Close search
Google apps
Main menu