100+ datasets found

Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
food data cleaning
kaggle.com
zip
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AbdElRahman16 (2024). food data cleaning [Dataset]. https://www.kaggle.com/datasets/abdelrahman16/food-n
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 13, 2024
Authors
AbdElRahman16
Description
Dataset

This dataset was created by AbdElRahman16

Contents
Clean Data.csv
figshare.com
txt
Updated Dec 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zaid Hattab (2023). Clean Data.csv [Dataset]. http://doi.org/10.6084/m9.figshare.24718401.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24718401.v1
Dataset updated
Dec 3, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Zaid Hattab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.
h
Tamazight-Clean-CSV
huggingface.co
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai (2025). Tamazight-Clean-CSV [Dataset]. https://huggingface.co/datasets/Datasmartly/Tamazight-Clean-CSV
Explore at:
Dataset updated
Jul 5, 2025
Authors
Ai
Description
Datasmartly/Tamazight-Clean-CSV dataset hosted on Hugging Face and contributed by the HF Datasets community
Data cleaning using unstructured data
zenodo.org
zip
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13135983
Dataset updated
Jul 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this project, we work on repairing three datasets:

Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.

Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.

Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

"{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")

"{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")

"{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")

"{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
11 Benchmark Clean-Clean ER datasets in CSV format
zenodo.org
application/gzip
Updated Feb 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). 11 Benchmark Clean-Clean ER datasets in CSV format [Dataset]. http://doi.org/10.5281/zenodo.14923071
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14923071
Dataset updated
Feb 26, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Contains:

D1: Contains restaurant descriptions, first introduced in OAEI 2010.

D2: Includes duplicate products from Abt.com and Buy.com.

D3: Matches product descriptions from Amazon and Google Base.

D4: Compares bibliographic data from DBLP and ACM.

D5, D6, D7: Contain descriptions of television shows and movies from TheTVDB, IMDb, and TMDb.

D8: Matches product descriptions from Walmart and Amazon.

D9: Involves bibliographic data from DBLP and Google Scholar.

D10: Links movie descriptions from IMDb and DBpedia.

D11: A large-scale dataset with millions of heterogeneous entities from two DBpedia versions spanning a 3-year gap.
Spotify.csv - File modified for data cleaning
kaggle.com
Updated Jul 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vitoria Rodrigues Silva (2020). Spotify.csv - File modified for data cleaning [Dataset]. https://www.kaggle.com/datasets/vitoriarodrigues/spotifycsv-file-modified-for-data-cleaning/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 14, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vitoria Rodrigues Silva
Description
Dataset

This dataset was created by Vitoria Rodrigues Silva

Contents
B
Data Cleaning Sample
borealisdata.ca
dataone.org
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
_labels1.csv. This data set representss the label of the corresponding...
figshare.com
txt
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24270088.v1
Dataset updated
Oct 9, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
naillah gul
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.
ToS;DR policies dataset (clean)
zenodo.org
csv
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Istaiti; Mahmoud Istaiti (2025). ToS;DR policies dataset (clean) [Dataset]. http://doi.org/10.5281/zenodo.15013541
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15013541
Dataset updated
May 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Istaiti; Mahmoud Istaiti
License
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
Description
Overview

This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.

File Descriptions

1. clean_tosdr_all_data.csv

This file contains a comprehensive collection of terms of service data.

Each row represents a statement (or "point") extracted from a service's terms.

Key columns:

point_quote_text: Extracted text from the terms of service.

case_id: Unique identifier for the case.

case_title: Brief description of the case.

topic_id: Unique identifier for the topic.

topic_title: Broad category the case falls under (e.g., Transparency, Copyright License).

2. clean_tosdr_privacy_data.csv

This file is a subset of clean_tosdr_all_data.csv containing only privacy-related entries.

Includes cases related to tracking, data collection, account deletion policies, and more.

Has the same structure as clean_tosdr_all_data.csv but filtered to include only privacy-related topics.

Usage

Use clean_tosdr_all_data.csv for a broad analysis of various terms of service aspects.

Use clean_tosdr_privacy_data.csv for focused studies on privacy-related clauses.
M
Mali Clean Water Access | Historical Data | N/A-N/A
macrotrends.net
csv
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MACROTRENDS (2025). Mali Clean Water Access | Historical Data | N/A-N/A [Dataset]. https://www.macrotrends.net/datasets/global-metrics/countries/mli/mali/clean-water-access-statistics
Explore at:
csvAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
MACROTRENDS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Mali
Description
Historical dataset showing Mali clean water access by year from N/A to N/A.
CORD-19-CSV
kaggle.com
Updated Apr 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huáscar Méndez (2020). CORD-19-CSV [Dataset]. https://www.kaggle.com/huascarmendez1/cord19csv/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Huáscar Méndez
Description
Context

This dataset is an extract from COVID-19 Open Research Dataset Challenge (CORD-19).

This pre-process is neccesary since the original input data it is stored in JSON files, whose structure is likely too complex to directly perform the analysis.

The preprocessing of the data further consisted of filtering the documents that specifically talk about the covid-19 disease and its other names, among other general data review and cleaning activities.

Content

As a result, this dataset contains a set of files in csv format, grouped into original sources (Biorxiv, Comm_use, Custom_licence, Nomcomm_use). Each of those files contains a subset of data columns, specifically paper_id, doc_title, doc_text, and source.
T
Data from: Survey Data on Customer Two-Stage Decision-Making Process in...
dataverse.tdl.org
pdf, tsv
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha; Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha (2023). Survey Data on Customer Two-Stage Decision-Making Process in Household Vacuum Cleaner Market [Dataset]. http://doi.org/10.18738/T8/SPJSLI
Explore at:
tsv(2745951), tsv(267973), pdf(919108), tsv(3612), tsv(24430)Available download formats
Unique identifier
https://doi.org/10.18738/T8/SPJSLI
Dataset updated
Mar 23, 2023
Dataset provided by
Texas Data Repository
Authors
Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha; Yinshuang Xiao; Yaxin Cui; Nikita Raut; Jonathan Haris Januar; Johan Koskinen; Noshir Contractor; Wei Chen; Zhenghui Sha
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The dataset contains several components, including: 1. The survey instrument used to collect the data in .pdf file format (pdf file for the questionnaire) 2. The raw survey data in .csv file format (csv file for survey data): The survey data contains 251 variables with responses from 1002 respondents. In the survey design, all questions are mandatory, and therefore no missing values exist, except for instances where respondents chose to respond with “prefer not to say” (in some sensitive demographic questions) or “I don’t know” (in some questions related to their social networks). Non-applicable responses are coded as "NULL" and blank values. 3. The codebook for the raw survey data in .xlsx file format (xlsx file for survey data): The codebook explains each of the 251 variables included in the survey data file. The codebook lists how each survey question and response option is numerically coded in the raw data and can be used as a guide for navigating the survey dataset. 4. The product feature list data in .csv file format (csv file for product data): The product feature list data contains the product features of 624 vacuum cleaner products and each product has 32 variables/features. Missing values, where no online information is available, are coded as “NA,” while non-applicable values, such as runtime for corded vacuum cleaners or navigation path for non-robotic vacuum cleaners, are coded as blank values. 5. The codebook for the product features list data in .xlsx file format (xlsx file for product data): The accompanying codebook provides a detailed description of each feature and its data type, as well as the number of missing values for each product feature in the last column.
f
Los Angeles Cleaning Services
flashmapy.com
csv
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Los Angeles Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-los-angeles
Explore at:
csvAvailable download formats
Dataset updated
Aug 22, 2024
Area covered
Los Angeles
Variables measured
X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
Description
A downloadable CSV file containing 335 Cleaning Services in Los Angeles with details like contact information, price range, reviews, and opening hours.
f
Data and tools for studying isograms
figshare.com
Updated Jul 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
Explore at:
application/x-sqlite3Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5245810.v1
Dataset updated
Jul 31, 2017
Dataset provided by
figshare
Authors
Florian Breit
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

Label Data type Description

isogramy int The order of isogramy, e.g. "2" is a second order isogram

length int The length of the word in letters

word text The actual word/isogram in ASCII

source_pos text The Part of Speech tag from the original corpus

count int Token count (total number of occurences)

vol_count int Volume count (number of different sources which contain the word)

count_per_million int Token count per million words

vol_count_as_percent int Volume count as percentage of the total number of volumes

is_palindrome bool Whether the word is a palindrome (1) or not (0)

is_tautonym bool Whether the word is a tautonym (1) or not (0)

The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

Label

Data type

Description

!total_1grams

int

The total number of words in the corpus

!total_volumes

int

The total number of volumes (individual sources) in the corpus

!total_isograms

int

The total number of isograms found in the corpus (before compacting)

!total_palindromes

int

How many of the isograms found are palindromes

!total_tautonyms

int

How many of the isograms found are tautonyms

The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
r
Data for PhD Chapter 3 and manuscript: Cleaner shrimp are true cleaners of...
researchdata.edu.au
Updated Jul 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaughan David; David Brendan Vaughan (2018). Data for PhD Chapter 3 and manuscript: Cleaner shrimp are true cleaners of injured fish [Dataset]. http://doi.org/10.4225/28/5B2C885B32331
Explore at:
Unique identifier
https://doi.org/10.4225/28/5B2C885B32331
Dataset updated
Jul 5, 2018
Dataset provided by
James Cook University
Authors
Vaughan David; David Brendan Vaughan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Datasets (all) for this work, provided in.csv format for direct import into R. The data collection consists of the following datasets:
All.data.csv
This dataset contains the data used for the first behavioural model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This dataset informed the initial exploratory mixed effects random intercept model using all cleaning contact locations (fish sides, oral, and ventral) recorded on the fish per day testing the response variable ‘cleaning time’ as a function of the fixed effects ‘day’, ‘cleaning contact locations’, and interaction ‘day x cleaning contact locations’, and ‘fish’ and ‘shrimp’ as random effects.
All.dataR.14.csv
This dataset contains the data used for the second to fifth behavioural models model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This is a subset of All.data.csv which excludes oral and ventral cleaning contact locations (scenarios 5 and 6). The analysis for All.data.csv was repeated using this analysis initially, and then two alternative approaches were used to model temporal change in cleaning times. In the first, day was treated as a numeric variable, included in the model as either a quadratic or a linear function to test for curvature testing the response variable ‘cleaning time’ as a function of the fixed effects ‘cleaning contact locations’, ‘day’, ‘day2’, and the interactions ‘cleaning contact locations with day’, ‘cleaning contact locations with day2’, and ‘fish’ and ‘shrimp’ as random effects. This analysis was carried out twice, once including all of the data, and once excluding day 0, to determine whether any temporal changes in behaviour extended beyond the initial establishment period of injury. In the second approach, based on the results of the first, the data were re-analysed with day treated as a category having two binary classes, ‘day0’ and ‘>day0’.
Jolts.data1.csv
This dataset was used for the analysis of jolting in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The number of ‘jolts’ were analysed using a random-intercept mixed effects model with ‘fish’ and ‘shrimp’ as random effects, and ‘treatment’ (two levels: Injured_with_shrimp; Uninjured_with_shrimp), and ‘day’ as fixed effects.
Red.csv
This dataset was used for the analysis of injury redness (rubor) in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The analysis examined spectral differences between groups with and without shrimp over the subsequent period to examine whether the presence of shrimp affected the spectral properties of the injury site as the injury healed. For this analysis, ‘day’ (either 4 or 6), ‘shrimp presence’ and the ‘shrimp x day’ interaction were all included as potential explanatory variables.
Yellow.csv
As for Red.csv.
UV1.csv
This dataset was used for the Nonspecific tissue damage analysis in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. Nonspecific tissue damage area was investigated between two levels of four treatment groups (With shrimp and Without shrimp; Injured fish and Uninjured fish) over time to determine their effects on tissue damage. Mixed effects random-intercept models were employed, with the ‘fish’ as the random effect to allow for photographic sampling on both sides of the same fish. The response variable ‘tissue damage area’ was tested as a function of the fixed effects ‘treatment’, ‘side’, ‘day’ (as a factor). Two levels of fish sides were included in the analyses representing injured and uninjured sides.
f
Dallas Cleaning Services
flashmapy.com
csv
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dallas Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-dallas
Explore at:
csvAvailable download formats
Dataset updated
Aug 22, 2024
Area covered
Dallas
Variables measured
X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
Description
A downloadable CSV file containing 302 Cleaning Services in Dallas with details like contact information, price range, reviews, and opening hours.
f
OERRH Survey Data 2013-2015 csv
figshare.com
txt
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OERHub OpenUniversity (2023). OERRH Survey Data 2013-2015 csv [Dataset]. http://doi.org/10.6084/m9.figshare.1528263.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1528263.v2
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
OERHub OpenUniversity
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CSV file containing responses to surveys conducted by the Hewlett-funded OER Research Hub Project during 2013-2015 exploring key questions around OER use and attitudes. xcel file available at http://figshare.com/articles/OERRH_Survey_Data_2013_2015/1546584You can read about the creation of the file at http://oerhub.net/blogs/cleaning-our-way-to-a-monster-dataset/ and the background of the project at http://oerhub.net/what-we-do/background-to-oer-hub/
f
Philadelphia Cleaning Services
flashmapy.com
csv
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Philadelphia Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-philadelphia
Explore at:
csvAvailable download formats
Dataset updated
Aug 22, 2024
Area covered
Philadelphia
Variables measured
X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
Description
A downloadable CSV file containing 64 Cleaning Services in Philadelphia with details like contact information, price range, reviews, and opening hours.
f
Seville Cleaning Services
flashmapy.com
csv
Updated Aug 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Seville Cleaning Services [Dataset]. https://www.flashmapy.com/lists/cleaning-service-seville
Explore at:
csvAvailable download formats
Dataset updated
Aug 22, 2024
Area covered
Seville
Variables measured
X, City, Email, Phone, State, Images, Country, Reviews, Youtube, Category, and 7 more
Description
A downloadable CSV file containing 179 Cleaning Services in Seville with details like contact information, price range, reviews, and opening hours.

Facebook

Twitter

Click to copy link

Link copied

Cite

Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

text/x-python, csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6957842

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juliane Köhler; Juliane Köhler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Data Cleaning, Translation & Split of the Dataset for the Automatic...

food data cleaning

Dataset

Contents

Clean Data.csv

Tamazight-Clean-CSV

Data cleaning using unstructured data

11 Benchmark Clean-Clean ER datasets in CSV format

Spotify.csv - File modified for data cleaning

Dataset

Contents

Data Cleaning Sample

_labels1.csv. This data set representss the label of the corresponding...

ToS;DR policies dataset (clean)

Overview

File Descriptions

1. clean_tosdr_all_data.csv

2. clean_tosdr_privacy_data.csv

Usage

Mali Clean Water Access | Historical Data | N/A-N/A

CORD-19-CSV

Context

Content

Data from: Survey Data on Customer Two-Stage Decision-Making Process in...

Los Angeles Cleaning Services

Data and tools for studying isograms

Data for PhD Chapter 3 and manuscript: Cleaner shrimp are true cleaners of...

Dallas Cleaning Services

OERRH Survey Data 2013-2015 csv

Philadelphia Cleaning Services

Seville Cleaning Services

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft