Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by AbdElRahman16
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A subset of the Oregon Health Insurance Experiment (OHIE) contains 12,229 individuals who satisfied the inclusion criteria and who responded to the in-person survey by October 2010. It has been used to explore the heterogeneity of the effects of the lottery and the Insurance on a number of outcomes.
Datasmartly/Tamazight-Clean-CSV dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this project, we work on repairing three datasets:
country_protocol_code
, conduct the same clinical trials which is identified by eudract_number
. Each clinical trial has a title
that can help find informative details about the design of the trial.eudract_number
. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion
.code
. Samples with the same code
represent the same product but are extracted from a differentb source
. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients
in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients. N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Contains:
This dataset was created by Vitoria Rodrigues Silva
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.
https://www.gnu.org/licenses/gpl-3.0-standalone.htmlhttps://www.gnu.org/licenses/gpl-3.0-standalone.html
This dataset contains two CSV files derived from Terms of Service; Didn't Read (ToS;DR) data. These files contain analyzed and categorized terms of service snippets from various online services after the cleaning process. The privacy dataset is a subset of the full dataset, focusing exclusively on privacy-related terms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Historical dataset showing Mali clean water access by year from N/A to N/A.
This dataset is an extract from COVID-19 Open Research Dataset Challenge (CORD-19).
This pre-process is neccesary since the original input data it is stored in JSON files, whose structure is likely too complex to directly perform the analysis.
The preprocessing of the data further consisted of filtering the documents that specifically talk about the covid-19 disease and its other names, among other general data review and cleaning activities.
As a result, this dataset contains a set of files in csv format, grouped into original sources (Biorxiv, Comm_use, Custom_licence, Nomcomm_use). Each of those files contains a subset of data columns, specifically paper_id, doc_title, doc_text, and source.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The dataset contains several components, including: 1. The survey instrument used to collect the data in .pdf file format (pdf file for the questionnaire) 2. The raw survey data in .csv file format (csv file for survey data): The survey data contains 251 variables with responses from 1002 respondents. In the survey design, all questions are mandatory, and therefore no missing values exist, except for instances where respondents chose to respond with “prefer not to say” (in some sensitive demographic questions) or “I don’t know” (in some questions related to their social networks). Non-applicable responses are coded as "NULL" and blank values. 3. The codebook for the raw survey data in .xlsx file format (xlsx file for survey data): The codebook explains each of the 251 variables included in the survey data file. The codebook lists how each survey question and response option is numerically coded in the raw data and can be used as a guide for navigating the survey dataset. 4. The product feature list data in .csv file format (csv file for product data): The product feature list data contains the product features of 624 vacuum cleaner products and each product has 32 variables/features. Missing values, where no online information is available, are coded as “NA,” while non-applicable values, such as runtime for corded vacuum cleaners or navigation path for non-robotic vacuum cleaners, are coded as blank values. 5. The codebook for the product features list data in .xlsx file format (xlsx file for product data): The accompanying codebook provides a detailed description of each feature and its data type, as well as the number of missing values for each product feature in the last column.
A downloadable CSV file containing 335 Cleaning Services in Los Angeles with details like contact information, price range, reviews, and opening hours.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):
Label Data type Description
isogramy int The order of isogramy, e.g. "2" is a second order isogram
length int The length of the word in letters
word text The actual word/isogram in ASCII
source_pos text The Part of Speech tag from the original corpus
count int Token count (total number of occurences)
vol_count int Volume count (number of different sources which contain the word)
count_per_million int Token count per million words
vol_count_as_percent int Volume count as percentage of the total number of volumes
is_palindrome bool Whether the word is a palindrome (1) or not (0)
is_tautonym bool Whether the word is a tautonym (1) or not (0)
The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:
Label
Data type
Description
!total_1grams
int
The total number of words in the corpus
!total_volumes
int
The total number of volumes (individual sources) in the corpus
!total_isograms
int
The total number of isograms found in the corpus (before compacting)
!total_palindromes
int
How many of the isograms found are palindromes
!total_tautonyms
int
How many of the isograms found are tautonyms
The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets (all) for this work, provided in.csv format for direct import into R. The data collection consists of the following datasets:
All.data.csv
This dataset contains the data used for the first behavioural model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This dataset informed the initial exploratory mixed effects random intercept model using all cleaning contact locations (fish sides, oral, and ventral) recorded on the fish per day testing the response variable ‘cleaning time’ as a function of the fixed effects ‘day’, ‘cleaning contact locations’, and interaction ‘day x cleaning contact locations’, and ‘fish’ and ‘shrimp’ as random effects.
All.dataR.14.csv
This dataset contains the data used for the second to fifth behavioural models model in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. This is a subset of All.data.csv which excludes oral and ventral cleaning contact locations (scenarios 5 and 6). The analysis for All.data.csv was repeated using this analysis initially, and then two alternative approaches were used to model temporal change in cleaning times. In the first, day was treated as a numeric variable, included in the model as either a quadratic or a linear function to test for curvature testing the response variable ‘cleaning time’ as a function of the fixed effects ‘cleaning contact locations’, ‘day’, ‘day2’, and the interactions ‘cleaning contact locations with day’, ‘cleaning contact locations with day2’, and ‘fish’ and ‘shrimp’ as random effects. This analysis was carried out twice, once including all of the data, and once excluding day 0, to determine whether any temporal changes in behaviour extended beyond the initial establishment period of injury. In the second approach, based on the results of the first, the data were re-analysed with day treated as a category having two binary classes, ‘day0’ and ‘>day0’.
Jolts.data1.csv
This dataset was used for the analysis of jolting in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The number of ‘jolts’ were analysed using a random-intercept mixed effects model with ‘fish’ and ‘shrimp’ as random effects, and ‘treatment’ (two levels: Injured_with_shrimp; Uninjured_with_shrimp), and ‘day’ as fixed effects.
Red.csv
This dataset was used for the analysis of injury redness (rubor) in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. The analysis examined spectral differences between groups with and without shrimp over the subsequent period to examine whether the presence of shrimp affected the spectral properties of the injury site as the injury healed. For this analysis, ‘day’ (either 4 or 6), ‘shrimp presence’ and the ‘shrimp x day’ interaction were all included as potential explanatory variables.
Yellow.csv
As for Red.csv.
UV1.csv
This dataset was used for the Nonspecific tissue damage analysis in PhD chapter 3, and the associated manuscript accepted in Marine Biology entitled: Cleaner shrimp are true cleaners of injured fish [authors: David B Vaughan, Alexandra S Grutter, Hugh W Ferguson, Rhondda Jones, Kate S Hutson]. Nonspecific tissue damage area was investigated between two levels of four treatment groups (With shrimp and Without shrimp; Injured fish and Uninjured fish) over time to determine their effects on tissue damage. Mixed effects random-intercept models were employed, with the ‘fish’ as the random effect to allow for photographic sampling on both sides of the same fish. The response variable ‘tissue damage area’ was tested as a function of the fixed effects ‘treatment’, ‘side’, ‘day’ (as a factor). Two levels of fish sides were included in the analyses representing injured and uninjured sides.
A downloadable CSV file containing 302 Cleaning Services in Dallas with details like contact information, price range, reviews, and opening hours.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CSV file containing responses to surveys conducted by the Hewlett-funded OER Research Hub Project during 2013-2015 exploring key questions around OER use and attitudes. xcel file available at http://figshare.com/articles/OERRH_Survey_Data_2013_2015/1546584You can read about the creation of the file at http://oerhub.net/blogs/cleaning-our-way-to-a-monster-dataset/ and the background of the project at http://oerhub.net/what-we-do/background-to-oer-hub/
A downloadable CSV file containing 64 Cleaning Services in Philadelphia with details like contact information, price range, reviews, and opening hours.
A downloadable CSV file containing 179 Cleaning Services in Seville with details like contact information, price range, reviews, and opening hours.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically