9 datasets found

R code
figshare.com
txt
Updated Jun 5, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5021297.v1
Dataset updated
Jun 5, 2017
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Christine Dodge
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers
MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository [Dataset]. https://healthdata.gov/dataset/MeSH-2023-Update-Delete-Report-4at4-q6rg-Archive-R/bjnp-cusd
Explore at:
csv, application/rdfxml, json, tsv, application/rssxml, xmlAvailable download formats
Dataset updated
Jul 16, 2025
Description
This dataset tracks the updates made on the dataset "MeSH 2023 Update - Delete Report" as a repository for previous versions of the data and metadata.
f
Data from: Valid Inference Corrected for Outlier Removal
tandf.figshare.com
pdf
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shuxiao Chen; Jacob Bien (2023). Valid Inference Corrected for Outlier Removal [Dataset]. http://doi.org/10.6084/m9.figshare.9762731.v4
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9762731.v4
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Shuxiao Chen; Jacob Bien
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and p-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.
Data Mining Project - Boston
kaggle.com
Updated Nov 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SophieLiu (2019). Data Mining Project - Boston [Dataset]. https://www.kaggle.com/sliu65/data-mining-project-boston/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SophieLiu
Area covered
Boston
Description
Context

To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.

Use of Data Files

You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:

This loads the file into R

df<-read.csv('uber.csv')

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

df_black<-subset(uber_df, uber_df$name == 'Black')

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

write.csv(df_black, "nameofthefileyouwanttosaveas.csv")

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

getwd()

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
r
Data for PhD thesis Chapter 5: Cleaner shrimp remove parasite eggs on fish...
researchdata.edu.au
Updated Jul 5, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vaughan David; David Brendan Vaughan (2018). Data for PhD thesis Chapter 5: Cleaner shrimp remove parasite eggs on fish cages [Dataset]. http://doi.org/10.4225/28/5B344DB8591A2
Explore at:
Unique identifier
https://doi.org/10.4225/28/5B344DB8591A2
Dataset updated
Jul 5, 2018
Dataset provided by
James Cook University
Authors
Vaughan David; David Brendan Vaughan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
Datasets (all) in .csv format for direct import into R. The data collection consists of the following datasets:
CH4.data.csv
This is the dataset used for the biocontrol analyses (all mixed effects random intercept models) using Lysmata vittata to reduce the reinfection pressure of Neobenedenia girellae on Epinephelus lanceolatus.
CH4WQ.csv
This is all the water quality data recorded and used the in the water quality analysis (linear regression).
Dataset - High-resolution mapping of wood burning appliance hotspots using...
zenodo.org
tar
Updated Feb 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Calum Kennedy; Calum Kennedy; Laura Horsfall; Laura Horsfall (2025). Dataset - High-resolution mapping of wood burning appliance hotspots using Energy Performance Certificates: A case study of England and Wales [Dataset]. http://doi.org/10.5281/zenodo.14640852
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14640852
Dataset updated
Feb 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Calum Kennedy; Calum Kennedy; Laura Horsfall; Laura Horsfall
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains open data and code to replicate the analysis in the manuscript "High-resolution mapping of wood burning appliance hotspots using Energy Performance Certificates: A case study of England and Wales".

To recreate the analysis on your local device, please carry out the following steps:

Clone the GitHub repository (available at: https://github.com/UCL-Wellcome-Trust-Air-Pollution/EPC_mapping_project_code) to your local device, or download the codebase from the 'Code.tar' folder and unzip in your project directory. Please ensure you use the directory with the R Project in it as your root directory.

Download the 'Data.tar' file and unzip the file in the R Project directory. The data should be in a folder called 'Data' in the root directory. All non-EPC data is provided under the UK Open Government License version 3.0. EPC data is provided under licence from DLUHC: https://epc.opendatacommunities.org/docs/copyright.

Download the main EPC data to your local device and unzip (see below for detailed instructions on how to do this). For Windows users, the 'Scripts' folder of the repository contains a .bat file which can be used to unzip the data. Note that this file requires the user to have installed 7Zip and added 7Zip to the system path. Otherwise, the .tar file can be unzipped manually.

Run the 'run.R' file in the 'Scripts' folder of the directory. You may need to change the 'path_data_epc_folders' variable to the path to the unzipped EPC data folders on your local device (see step 3). The full pipeline should now run.

Once you have run the pipeline for the first time, you should see a file called 'data_epc_raw.parquet' in the 'Data/raw/epc_data' folder. Once you have verified this is the case, you can safely delete the original unzipped EPC data folder, since the file is very large (>40Gb). If you run the pipeline again, you will be prompted that the raw EPC data .parquet file already exists, and you have the option to skip the merging of raw data files.
too many files, delete me
figshare.com
txt
Updated Jan 20, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remi Daigle (2016). too many files, delete me [Dataset]. http://doi.org/10.6084/m9.figshare.1564748.v10
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1564748.v10
Dataset updated
Jan 20, 2016
Dataset provided by
Figsharehttp://figshare.com/
Authors
Remi Daigle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model data from the case study on MPA design for Atlantic Cod generated by: http://dx.doi.org/10.6084/m9.figshare.1556143 Please unzip shapefiles folder before trying to reproduce analysis
f
Data from: Error and anomaly detection for intra-participant time-series...
tandf.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David R. Mullineaux; Gareth Irwin (2023). Error and anomaly detection for intra-participant time-series data [Dataset]. http://doi.org/10.6084/m9.figshare.5189002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5189002
Dataset updated
Jun 1, 2023
Dataset provided by
Taylor & Francis
Authors
David R. Mullineaux; Gareth Irwin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Identification of errors or anomalous values, collectively considered outliers, assists in exploring data or through removing outliers improves statistical analysis. In biomechanics, outlier detection methods have explored the ‘shape’ of the entire cycles, although exploring fewer points using a ‘moving-window’ may be advantageous. Hence, the aim was to develop a moving-window method for detecting trials with outliers in intra-participant time-series data. Outliers were detected through two stages for the strides (mean 38 cycles) from treadmill running. Cycles were removed in stage 1 for one-dimensional (spatial) outliers at each time point using the median absolute deviation, and in stage 2 for two-dimensional (spatial–temporal) outliers using a moving window standard deviation. Significance levels of the t-statistic were used for scaling. Fewer cycles were removed with smaller scaling and smaller window size, requiring more stringent scaling at stage 1 (mean 3.5 cycles removed for 0.0001 scaling) than at stage 2 (mean 2.6 cycles removed for 0.01 scaling with a window size of 1). Settings in the supplied Matlab code should be customised to each data set, and outliers assessed to justify whether to retain or remove those cycles. The method is effective in identifying trials with outliers in intra-participant time series data.
Percentage (%) and number (n) of missing values in the explanatory variables...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner (2024). Percentage (%) and number (n) of missing values in the explanatory variables and outcome by measurement occasion and sex. [Dataset]. http://doi.org/10.1371/journal.pone.0295726.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295726.t004
Dataset updated
May 29, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Lara Lusa; Cécile Proust-Lima; Carsten O. Schmidt; Katherine J. Lee; Saskia le Cessie; Mark Baillie; Frank Lawrence; Marianne Huebner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PA: physical activity. Here we show only the first interview data for variables used as time-fixed in the model (height, education and smoking—following the change suggested by IDA) and remove the observations missing by design.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Christine Dodge (2017). R code [Dataset]. http://doi.org/10.6084/m9.figshare.5021297.v1

R code

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.5021297.v1

Dataset updated

Jun 5, 2017

Dataset provided by

figshare
Figsharehttp://figshare.com/

Authors

Christine Dodge

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R code used for each data set to perform negative binomial regression, calculate overdispersion statistic, generate summary statistics, remove outliers

Clear search

Close search

Google apps

Main menu

R code

MeSH 2023 Update - Delete Report - 4at4-q6rg - Archive Repository

Data from: Valid Inference Corrected for Outlier Removal

Data Mining Project - Boston

Context

Use of Data Files

This loads the file into R

The next codes is to subset the data into specific car types. The example below only has Uber 'Black' car types.

This next portion of code will be to load it into R. First, we must write this dataframe into a csv file on our computer in order to load it into R.

The file will appear in you working directory. If you are not familiar with your working directory. Run this code:

The output will be the file path to your working directory. You will find the file you just created in that folder.

Inspiration

Data for PhD thesis Chapter 5: Cleaner shrimp remove parasite eggs on fish...

Dataset - High-resolution mapping of wood burning appliance hotspots using...

too many files, delete me

Data from: Error and anomaly detection for intra-participant time-series...

Percentage (%) and number (n) of missing values in the explanatory variables...

R code