Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data release for the paper "Waveform systematics in identifying gravitationally lensed gravitational waves: Posterior overlap method", which is available on https://arxiv.org/abs/2306.12908.
These results are derived from the gravitational-wave parameter-estimation results by the LIGO-Virgo-KAGRA Collaboration, released with the GWTC-1, GWTC-2, GWTC-2.1, and GWTC-3 catalogs under the following links:
https://dcc.ligo.org/P1800370-v5/public
https://dcc.ligo.org/P2000223-v7/public
https://doi.org/10.5281/zenodo.6513631
https://doi.org/10.5281/zenodo.5546663
For the lensed-unlensed hypothesis test posterior overlap Bayes factors, we provide the following files for event pairs from within each observing run:
blu_all_pairs_O1.txt
blu_all_pairs_O2.txt
blu_all_pairs_O3.txt
In each file, the column "event_pair" contains the names of the two events from the pair sorted chronologically, the column "data_releases" contains the names of the data releases from which the posterior samples of each event were taken, the column "waveform" contains the name of the waveform model used in the parameter estimation for both sets of posteriors, and the column "log10blu" contains the log10 of the Bayes factors.
The differences between runs for the same event pair, only including O1-O1, O2-O2, O3-O3 pairs, where at least one run gave log10blu>0, are also given in the file "blu_differences_pairs_with_log10blu_pos.txt". The column "event_pair" contains the event pairs, the columns "waveform_{1,2}" contain the names of the waveform models used in the parameter estimation for both sets of posteriors, the columns "data_releases_{1,2}" contain the the data releases from which the posterior samples of each event were taken, the columns "log10blu_{1,2}" contain the log10 Bayes factors, and the column "difference" contains the difference between "log10blu_1" and "log10blu_2".
We also provide the following files corresponding to the appendix of the paper, analyzing overlaps between posterior samples for individual events:
overlap_different_runs.txt
overlap_same_run.txt
rescaled_difference_single_event.txt
The file "overlap_different_runs.txt" contains Bayes factors for a single event, but comparing the posteriors from different runs. The file "overlap_same_run.txt" contains Bayes factors for the overlap of a single run on a single event with itself. The file "rescaled_difference_single_event.txt" contains the difference between the results contained in the file overlap_different_runs.txt and the results in overlap_same_run.txt, taking the ones that produce the biggest difference, as per equation (A.1) in the paper.
In these files, the column "event_name" is the name of the event, the column "data_release" or "data_releases" contains the name(s) of the data release(s) from which the posterior samples of each run were taken, the column "waveform" or "waveform_pair" contains the name(s) of the waveform model(s) used, and the column "log10blu" is the log10 Bayes factor obtained. In the file "rescaled_difference_single_event.txt", the columns "max_run_waveform" and "max_run_data_release" identify an entry from the "overlap_same_run.txt" file from which we use the "log10blu" to compute the value listed in the "difference" column using equation (A.1).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open access and open data are becoming more prominent on the global research agenda. Funders are increasingly requiring grantees to deposit their raw research data in appropriate public archives or stores in order to facilitate the validation of results and further work by other researchers.
While the rise of open access has fundamentally changed the academic publishing landscape, the policies around data are reigniting the conversation around what universities can and should be doing to protect the assets generated at their institution. The main difference between an open access and open data policy is that there is not already a precedent or status quo of how academia deals with the dissemination of research that is not in the form of a traditional ‘paper’ publication.
As governments and funders of research see the benefit of open content, the creation of recommendations, mandates and enforcement of mandates are coming thick and fast.
Facebook
TwitterThe minute weather dataset comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file minute_weather.csv, which is a comma-separated file. As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.
Each row in minute_weather.csv contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:
rowID: unique number for each row (Unit: NA) hpwren_timestamp: timestamp of measure (Unit: year-month-day hour:minute:second) air_pressure: air pressure measured at the timestamp (Unit: hectopascals) air_temp: air temperature measure at the timestamp (Unit: degrees Fahrenheit) avg_wind_direction: wind direction averaged over the minute before the timestamp (Unit: degrees, with 0 means coming from the North, and increasing clockwise) avg_wind_speed: wind speed averaged over the minute before the timestamp (Unit: meters per second) max_wind_direction: highest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and increasing clockwise) max_wind_speed: highest wind speed in the minute before the timestamp (Unit: meters per second) min_wind_direction: smallest wind direction in the minute before the timestamp (Unit: degrees, with 0 being North and inceasing clockwise) min_wind_speed: smallest wind speed in the minute before the timestamp (Unit: meters per second) rain_accumulation: amount of accumulated rain measured at the timestamp (Unit: millimeters) rain_duration: length of time rain has fallen as measured at the timestamp (Unit: seconds) relative_humidity: relative humidity measured at the timestamp (Unit: percent)
Facebook
Twitter🎬 Описание
Данный набор данных объединяет информацию о фильмах из различных топов Кинопоиска и IMDb. Он создан для сравнительного анализа рейтингов, жанров, стран производства и других характеристик популярных фильмов из двух крупнейших мировых баз данных о кино.
Датасет может быть полезен для исследователей, аналитиков и киноманов, желающих изучить различия между предпочтениями русскоязычной и международной аудиторий, а также выявить закономерности между рейтингами, бюджетами и жанрами фильмов.
📂 Состав датасета
Title – название фильма (строка)
kinopoiskId – уникальный идентификатор фильма на сайте Кинопоиск (целое число или строка)
imdbId – уникальный идентификатор фильма на сайте IMDb (строка)
Year – год выпуска фильма (целое число)
Rating Kinopoisk – рейтинг фильма по версии Кинопоиска (дробное число от 0 до 10)
Rating Imdb – рейтинг фильма по версии IMDb (дробное число от 0 до 10)
Age Limit – возрастное ограничение (например, "6+", "12+", "18+")
Genres – жанры фильма (строка или список жанров, разделённых запятой)
Country – страна или страны производства фильма (строка)
Director – имя режиссёра (строка)
Budget – бюджет фильма в долларах США (целое число)
Fees – кассовые сборы фильма в долларах США (целое число)
Description Kinopoisk – краткое описание фильма с сайта Кинопоиск (на русском языке)
Description Imdb – краткое описание фильма с сайта IMDb (на английском языке)
Возможные направления анализа
Сравнение рейтингов фильмов между Кинопоиском и IMDb;
Анализ наиболее популярных жанров и их динамики по годам;
Исследование зависимости рейтингов от бюджета или сборов;
Сравнение предпочтений зрителей разных стран.
ENGLISH: 🎬 Description
This dataset combines information about movies from various IMDb and Kinopoisk top lists. It was created for a comparative analysis of ratings, genres, countries of production, and other characteristics of popular films from two of the world's largest movie databases.
The dataset can be useful for researchers, analysts, and movie enthusiasts who want to explore the differences between Russian-speaking and international audiences’ preferences, as well as to identify patterns between ratings, budgets, and genres.
📂 Dataset Structure
Title – movie title (string)
kinopoiskId – unique movie identifier on Kinopoisk (integer or string)
imdbId – unique movie identifier on IMDb (string)
Year – year of release (integer)
Rating Kinopoisk – movie rating according to Kinopoisk (float from 0 to 10)
Rating Imdb – movie rating according to IMDb (float from 0 to 10)
Age Limit – age restriction (e.g., "6+", "12+", "18+")
Genres – movie genres (string or list of genres separated by commas)
Country – country or countries of production (string)
Director – name of the director (string)
Budget – movie budget in USD (integer)
Fees – box office revenue in USD (integer)
Description Kinopoisk – short movie description from Kinopoisk (in Russian)
Description Imdb – short movie description from IMDb (in English)
📊 Possible Analysis Directions
Comparing movie ratings between Kinopoisk and IMDb;
Analyzing the most popular genres and their evolution over time;
Studying the relationship between ratings, budgets, and box office revenue;
Comparing audience preferences across different countries.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
In the frame of the QUAE project, an identification procedure was develop to sort singular behaviours in river temperature time series. This procedure was conceived as a tool to indentify particular behaviours in time series despite non continuous measurements and regardless the type of measurement (temperature, streamflow...). Three types of singularities are identified: extreme values (in some cases similar as outliers), roughened data (such as the difference between water temperature and air temperature) and buffered data (such as signals caused by groundwater inflows).
Facebook
TwitterThis archive contains the logistic mapping output data at the conceptual well locations. Data are provided in spreadsheets containing the estimated probabilities of nitrate concentrations greater than 2 milligrams per liter at hypothetical 150 feet and 300 feet deep wells for each of the five-year categories from 2000 to 2019 and vulnerability differences between five-year categories when one or both of the predicted probabilities was equal to or greater than 50 percent.
Facebook
TwitterThere has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Facebook
TwitterThe ability to grow safe, fresh food to supplement packaged foods of astronauts in space has been an important goal for NASA. Food crops grown in space experience different environmental conditions than plants grown on Earth (e.g., reduced gravity, elevated radiation levels). To study the effects of space conditions, red romaine lettuce, Lactuca sativa cv ‘Outredgeous,’ plants were grown in Veggie plant growth chambers on the International Space Station (ISS) and compared with ground-grown plants. Multiple plantings were grown on ISS and harvested using either a single, final harvest, or sequential harvests in which several mature leaves were removed from the plants at weekly intervals. Ground controls were grown simultaneously with a 24–72 h delay using ISS environmental data. Food safety of the plants was determined by heterotrophic plate counts for bacteria and fungi, as well as isolate identification using samples taken from the leaves and roots. Molecular characterization was conducted using Next Generation Sequencing (NGS) to provide taxonomic composition and phylogenetic structure of the community. Leaves were also analyzed for elemental composition, as well as levels of phenolics, anthocyanins, and Oxygen Radical Absorbance Capacity (ORAC). Comparison of flight and ground tissues showed some differences in total counts for bacteria and yeast/molds (2.14 – 4.86 log10 CFU/g), while screening for select human pathogens yielded negative results. Bacterial and fungal isolate identification and community characterization indicated variation in the diversity of genera between leaf and root tissue with diversity being higher in root tissue, and included differences in the dominant genera. The only difference between ground and flight experiments was seen in the third experiment, VEG-03A, with significant differences in the genera from leaf tissue. Flight and ground tissue showed differences in Fe, K, Na, P, S, and Zn content and total phenolic levels, but no differences in anthocyanin and ORAC levels. This study indicated that leafy vegetable crops can produce safe, edible, fresh food to supplement to the astronauts’ diet, and provide baseline data for continual operation of the Veggie plant growth units on ISS.
Facebook
TwitterThis dataset contains all data and code necessary to reproduce the analysis presented in the manuscript: Winzeler, H.E., Owens, P.R., Read Q.D.., Libohova, Z., Ashworth, A., Sauer, T. 2022. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018. There are several steps to this analysis. The relevant scripts for each are listed below. The first step is to use the raw digital elevation data (DEM) to produce different versions of the topographic wetness index (TWI) for the study region (Calculating TWI). Then, these TWI output files are processed, along with soil moisture (volumetric water content or VWC) time series data from a number of sensors located within the study region, to create analysis-ready data objects (Processing TWI and VWC). Next, models are fit relating TWI to soil moisture (Model fitting) and results are plotted (Visualizing main results). A number of additional analyses were also done (Additional analyses). Input data The DEM of the study region is archived in this dataset as SourceDem.zip. This contains the DEM of the study region (DEM1.sgrd) and associated auxiliary files all called DEM1.* with different extensions. In addition, the DEM is provided as a .tif file called USGS_one_meter_x39y400_AR_R6_WashingtonCO_2015.tif. The remaining data and code files are archived in the repository created with a GitHub release on 2022-10-11, twi-moisture-0.1.zip. The data are found in a subfolder called data. 2017_LoggerData_HEW.csv through 2021_HEW.csv: Soil moisture (VWC) logger data for each year 2017-2021 (5 files total). 2882174.csv: weather data from a nearby station. DryPeriods2017-2021.csv: starting and ending days for dry periods 2017-2021. LoggerLocations.csv: Geographic locations and metadata for each VWC logger. Logger_Locations_TWI_2017-2021.xlsx: 546 topographic wetness indexes calculated at each VWC logger location. note: This is intermediate input created in the first step of the pipeline. Code pipeline To reproduce the analysis in the manuscript run these scripts in the following order. The scripts are all found in the root directory of the repository. See the manuscript for more details on the methods. Calculating TWI TerrainAnalysis.R: Taking the DEM file as input, calculates 546 different topgraphic wetness indexes using a variety of different algorithms. Each algorithm is run multiple times with different input parameters, as described in more detail in the manuscript. After performing this step, it is necessary to use the SAGA-GIS GUI to extract the TWI values for each of the sensor locations. The output generated in this way is included in this repository as Logger_Locations_TWI_2017-2021.xlsx. Therefore it is not necessary to rerun this step of the analysis but the code is provided for completeness. Processing TWI and VWC read_process_data.R: Takes raw TWI and moisture data files and processes them into analysis-ready format, saving the results as CSV. qc_avg_moisture.R: Does additional quality control on the moisture data and averages it across different time periods. Model fitting Models were fit regressing soil moisture (average VWC for a certain time period) against a TWI index, with and without soil depth as a covariate. In each case, for both the model without depth and the model with depth, prediction performance was calculated with and without spatially-blocked cross-validation. Where cross validation wasn't used, we simply used the predictions from the model fit to all the data. fit_combos.R: Models were fit to each combination of soil moisture averaged over 57 months (all months from April 2017-December 2021) and 546 TWI indexes. In addition models were fit to soil moisture averaged over years, and to the grand mean across the full study period. fit_dryperiods.R: Models were fit to soil moisture averaged over previously identified dry periods within the study period (each 1 or 2 weeks in length), again for each of the 546 indexes. fit_summer.R: Models were fit to the soil moisture average for the months of June-September for each of the five years, again for each of the 546 indexes. Visualizing main results Preliminary visualization of results was done in a series of RMarkdown notebooks. All the notebooks follow the same general format, plotting model performance (observed-predicted correlation) across different combinations of time period and characteristics of the TWI indexes being compared. The indexes are grouped by SWI versus TWI, DEM filter used, flow algorithm, and any other parameters that varied. The notebooks show the model performance metrics with and without the soil depth covariate, and with and without spatially-blocked cross-validation. Crossing those two factors, there are four values for model performance for each combination of time period and TWI index presented. performance_plots_bymonth.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by month across the five years of data to show within-year trends. performance_plots_byyear.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by year to show trends across multiple years. performance_plots_dry_periods.Rmd: Prediction performance was presented for the models fit to the previously identified dry periods. performance_plots_summer.Rmd: Prediction performance was presented for the models fit to the June-September moisture averages. Additional analyses Some additional analyses were done that may not be published in the final manuscript but which are included here for completeness. 2019dryperiod.Rmd: analysis, done separately for each day, of a specific dry period in 2019. alldryperiodsbyday.Rmd: analysis, done separately for each day, of the same dry periods discussed above. best_indices.R: after fitting models, this script was used to quickly identify some of the best-performing indexes for closer scrutiny. wateryearfigs.R: exploratory figures showing median and quantile interval of VWC for sensors in low and high TWI locations for each water year. Resources in this dataset:Resource Title: Digital elevation model of study region. File Name: SourceDEM.zipResource Description: .zip archive containing digital elevation model files for the study region. See dataset description for more details.Resource Title: twi-moisture-0.1: Archived git repository containing all other necessary data and code . File Name: twi-moisture-0.1.zipResource Description: .zip archive containing all data and code, other than the digital elevation model archived as a separate file. This file was generated by a GitHub release made on 2022-10-11 of the git repository hosted at https://github.com/qdread/twi-moisture (private repository). See dataset description and README file contained within this archive for more details.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Background
The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods
This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results
The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2).
Methods
Study Participants and Samples
The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.
All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.
Blood Collection and Processing
Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.
Characterization of DNA Methylation using the EPIC array
Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).
Processing and Analysis of DNA Methylation Data
The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.
Normalization Methods Evaluated
The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.
Facebook
TwitterA performance comparison between data structures.
Facebook
TwitterBackground Post-exercise muscle soreness is a dull, aching sensation that follows unaccustomed muscular exertion. Primarily on the basis of previous laboratory-based research on eccentric exercise, soreness is usually said to follow an inverted U-shaped curve over time, peaking 24 – 48 hours after exercise. As such, it is often described as "delayed-onset" muscle soreness. In a study of long-distance runners, soreness seemed to peak immediately and then reduce gradually over time. The study is a secondary analysis of clinical trial data that aims to determine whether the time course of soreness following a natural exercise, long-distance running, is different from that following a laboratory-based exercise, bench-stepping. Methods This is a reanalysis of data from three previous clinical trials. The trials included 400 runners taking part in long-distance races and 82 untrained volunteers performing a bench-stepping test. Subjects completed a Likert scale of muscle soreness every morning and evening for the five days following their exercise. Results Interaction between trial and time is highly significant, suggesting a different time course of soreness following running and bench-stepping. 45% of subjects in the bench-stepping trial experienced peak soreness at the third or fourth follow-up (approximately 36 – 48 hours after exercise) compared to only 14% of those in the running trial. The difference between groups is robust to multivariate analysis incorporating possible confounding variables. Conclusion Soreness in runners following long-distance running follows a different time course to that in untrained individuals undertaking bench-stepping. Research on exercise taking place in the laboratory context does not necessarily generalize to exercise undertaken by trained athletes when engaged in their chosen sport.
Facebook
TwitterJournal policiesA file giving the data archiving policies from the journals covered in the study.Data request protocolThe sequence of emails used to request data from authors.Vines_et_al_Rcode_4th_JanThe R code used in the statistical analysesVinesetal_data_4th JanThe data used in the statistical analyses
Facebook
TwitterAttribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Priest Map Series {title at top of page}Data Developers: Burhans, Molly A., Cheney, David M., Emege, Thomas, Gerlt, R.. . “Priest Map Series {title at top of page}”. Scale not given. Version 1.0. MO and CT, USA: GoodLands Inc., Catholic Hierarchy, Environmental Systems Research Institute, Inc., 2019.Web map developer: Molly Burhans, October 2019Web app developer: Molly Burhans, October 2019GoodLands’ polygon data layers, version 2.0 for global ecclesiastical boundaries of the Roman Catholic Church:Although care has been taken to ensure the accuracy, completeness and reliability of the information provided, due to this being the first developed dataset of global ecclesiastical boundaries curated from many sources it may have a higher margin of error than established geopolitical administrative boundary maps. Boundaries need to be verified with appropriate Ecclesiastical Leadership. The current information is subject to change without notice. No parties involved with the creation of this data are liable for indirect, special or incidental damage resulting from, arising out of or in connection with the use of the information. We referenced 1960 sources to build our global datasets of ecclesiastical jurisdictions. Often, they were isolated images of dioceses, historical documents and information about parishes that were cross checked. These sources can be viewed here:https://docs.google.com/spreadsheets/d/11ANlH1S_aYJOyz4TtG0HHgz0OLxnOvXLHMt4FVOS85Q/edit#gid=0To learn more or contact us please visit: https://good-lands.org/The Catholic Leadership global maps information is derived from the Annuario Pontificio, which is curated and published by the Vatican Statistics Office annually, and digitized by David Cheney at Catholic-Hierarchy.org -- updated are supplemented with diocesan and news announcements. GoodLands maps this into global ecclesiastical boundaries. Admin 3 Ecclesiastical Territories:Burhans, Molly A., Cheney, David M., Gerlt, R.. . “Admin 3 Ecclesiastical Territories For Web”. Scale not given. Version 1.2. MO and CT, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2019.Derived from:Global Diocesan Boundaries:Burhans, M., Bell, J., Burhans, D., Carmichael, R., Cheney, D., Deaton, M., Emge, T. Gerlt, B., Grayson, J., Herries, J., Keegan, H., Skinner, A., Smith, M., Sousa, C., Trubetskoy, S. “Diocesean Boundaries of the Catholic Church” [Feature Layer]. Scale not given. Version 1.2. Redlands, CA, USA: GoodLands Inc., Environmental Systems Research Institute, Inc., 2016.Using: ArcGIS. 10.4. Version 10.0. Redlands, CA: Environmental Systems Research Institute, Inc., 2016.Boundary ProvenanceStatistics and Leadership DataCheney, D.M. “Catholic Hierarchy of the World” [Database]. Date Updated: August 2019. Catholic Hierarchy. Using: Paradox. Retrieved from Original Source.Catholic HierarchyAnnuario Pontificio per l’Anno .. Città del Vaticano :Tipografia Poliglotta Vaticana, Multiple Years.The data for these maps was extracted from the gold standard of Church data, the Annuario Pontificio, published yearly by the Vatican. The collection and data development of the Vatican Statistics Office are unknown. GoodLands is not responsible for errors within this data. We encourage people to document and report errant information to us at data@good-lands.org or directly to the Vatican.Additional information about regular changes in bishops and sees comes from a variety of public diocesan and news announcements.
Facebook
TwitterCyclistic, a bike sharing company, wants to analyze their user data to find the main differences in behavior between their two types of users. The Casual Riders are those who pay for each ride and the Annual Member who pays a yearly subscription to the service.
Key objectives: 1.Identify The Business Task: - Cyclistic wants to analyze the data to find the key differences between Casual Riders and Annual Members. The goal of this project is to reach out to the casual riders and incentivize them into paying for the annual subscription.
Key objectives: 1. Download Data And Store It Appropriately - Downloaded the data as .csv files, which were saved in their own folder to keep everything organized. I then uploaded those files into BigQuery for cleaning and analysis. For this project I downloaded all of 2022 and up to May of 2023, as this is the most recent data that I have access to.
Identify How It's Organized
Sort and Filter The Data and Determine The Credibility of The Data
Key objectives: 1.Clean The Data and Prepare The Data For Analysis: -I used some simple SQL code in order to determine that no members were missing, that no information was repeated and that there were no misspellings in the data as well.
--no misspelling in either member or casual. This ensures that all results will not have missing information.
SELECT
DISTINCT member_casual
FROM
table
--This shows how many casual riders and members used the service, should add up to the numb of rows in the dataset SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM table GROUP BY member_type
--Shows that every bike has a distinct ID. SELECT DISTINCT ride_id FROM table
--Shows that there are no typos in the types of bikes, so no data will be missing from results. SELECT DISTINCT rideable_type FROM table
Key objectives: 1. Aggregate Your Data So It's Useful and Accessible -I had to write some SQL code so that I could combine all the data from the different files I had uploaded onto BigQuery
select rideable_type, started_at, ended_at, member_casual from table 1 union all select rideable_type, started_at, ended_at, member_casual from table 2 union all select rideable_type, started_at, ended_at, member_casual from table 3 union all select rideable_type, started_at, ended_at, member_casual from table 4 union all select rideable_type, started_at, ended_at, member_casual from table 5 union all select rideable_type, started_at, ended_at, member_casual from table 6 union all select rideable_type, started_at, ended_at, member_casual from table 7 union all select rideable_type, started_at, ended_at, member_casual from table 8 union all select rideable_type, started_at, ended_at, member_casual from table 9 union all select rideable_type, started_at, ended_at, member_casual from table10 union all select rideable_type, started_at, ended_at, member_casual from table 11 union all select rideable_type, started_at, ended_at, member_casual from table 12 union all select rideable_type, started_at, ended_at, member_casual from table 13 union all select rideable_type, started_at, ended_at, member_casual from table 14 union all select rideable_type, started_at, ended_at, member_casual from table 15 union all select rideable_type, started_at, ended_at, member_casual from table 16 union all select rideable_type, started_at, ended_at, member_casual from table 17
--This shows how many casual and annual members used bikes SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM Aggregate Data Table GROUP BY member_type
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How to run most effectively to catch a projectile, such as a baseball, that is flying in the air for a long period of time? The question about the best solution to the ball catching problem has been subject to intense scientific debate for almost 50 years. It turns out that this scientific debate is not focused on the ball catching problem alone, but revolves around the research question what constitutes the ingredients of intelligent decision making. Over time, two opposing views have emerged: the generalist view regarding intelligence as the ability to solve any task without knowing goal and environment in advance, based on optimal decision making using predictive models; and the specialist view which argues that intelligent decision making does not have to be based on predictive models and not even optimal, advocating simple and efficient rules of thumb (heuristics) as superior to enable accurate decisions. We study two types of approaches to the ball catching problem, one for each view, and investigate their properties using both a theoretical analysis and a broad set of simulation experiments. Our study shows that neither of the two types of approaches can be regarded as superior in solving all relevant variants of the ball catching problem: each approach is optimal under a different realistic environmental condition. Therefore, predictive models neither guarantee nor prevent success a priori, and we further show that the key difference between the generalist and the specialist approach to ball catching is the type of input representation used to control the agent. From this finding, we conclude that the right solution to a decision making or control problem is orthogonal to the generalist and specialist approach, and thus requires a reconciliation of the two views in favor of a representation-centric view.
Facebook
TwitterPulse wave velocity (PWV) has been recommended as an arterial damage assessment tool and a surrogate of arterial stiffness. However, the current technology does not allow to measure PWV both continuously and in real-time. We reported previously that peripherally measured ejection time (ET) overestimates ET measured centrally. This difference in ET is associated with the inherent vascular properties of the vessel. In the current study we examined ETs derived from plethysmography simultaneously at different peripheral locations and examined the influence of the underlying arterial properties on ET prolongation by changing the subject’s position. We calculated the ET difference between two peripheral locations (ΔET) and its corresponding PWV for the same heartbeat. The ΔET increased with a corresponding decrease in PWV. The difference between ΔET in the supine and standing (which we call ET index) was higher in young subjects with low mean arterial pressure and low PWV. These results suggest that the difference in ET between two peripheral locations in the supine vs standing positions represents the underlying vascular properties. We propose ΔET in the supine position as a potential novel real-time continuous and non-invasive parameter of vascular properties, and the ET index as a potential non-invasive parameter of vascular reactivity.
Facebook
TwitterThe NAMMA Lightning ZEUS data is provided by World-ZEUS Long Range Lightning Monitoring Network Data obtained from radio atmospheric signals located at thirteen ground stations spread across the European and African continents and Brazil from August 1, 2006 to October 1, 2006. Lightning activity occurring over a large part of the globe is continuously monitored at varying spatial accuracy (e.g. 10-20 km within and >50 km outside the network periphery) and high temporal (1 msec) resolution. Time is determined by the Arrival Time Difference between the time series from the pairs of the receivers. These data files were generated during support of the NASA African Monsoon Multidisciplinary Analyses (NAMMA) campaign, a field research investigation sponsored by the Science Mission Directorate of the National Aeronautics and Space Administration (NASA). This mission was based in the Cape Verde Islands, 350 miles off the coast of Senegal in west Africa. Commencing in August 2006, NASA scientists employed surface observation networks and aircraft to characterize the evolution and structure of African Easterly Waves (AEWs) and Mesoscale Convective Systems over continental western Africa, and their associated impacts on regional water and energy budgets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
Song’s Vegetation Continuous fields (VCF) product, based on AVHRR satellite data, is the longest time-series of its type, but lacks updates past 2016 due to the extensive degradation of the sensor. We used machine learning to extend this time-series using data from the Copernicus Land Cover dataset, which provides per-pixel proportions of different land cover classes between 2015 and 2019. In addition, we included MODIS VCF data.
Content
This repository contains the infrastructure used to model Song-like VCF data past 2016. This infrastructure contains a yaml file that configures the modelling framework (e.g. variables, directories, hyper-parameter tuning), and that interacts with a standardized folder structure.
Modelling approach
Song's VCF dataset includes data on generic categories, namely “tree cover”, “non-tree vegetation”, and “non vegetated”. Given the Copernicus dataset has a higher thematic detail, we first aggregated these data into comparable classes. We created a “Non-tree vegetation” layer (i.e. total per-pixel proportion of crops, grasses, shrubs, and mosses), and a “Non Vegetated” layer (i.e. total per-pixel proportion of bare land, permanent water, urban, and snow). Independent data on “Tree cover” was already present.
We then constructed a Random Forest Regression (RFReg) model to predict Song-like VCF layers between 2016 and 2019. The predictions were informed by variables on topography, climate, and fires (which limit the density of vegetation), and by variables on differences between the Copernicus VCF and MODIS-based VCF data. Because MODIS data is available past 2016, its inclusion informs our models on how MODIS data, and their differences compared to Copernicus data, relate to the values reported in Song's data.
Sampling scheme
For each VCF category, we collected samples on a country-by-country basis. Within each country, we estimated the difference in percent cover between the Song's and Copernicus VCF data, and sampled across a gradient of differences, from -100% (no cover in AVHRR and full cover in Copernicus) to +100% (full cover in AVHRR and no cover in Copernicus). We iterated through this range in intervals of 10% and sampled across a gradient of “tree cover”, “non-tree vegetation”, and “non vegetated”, in intervals of 10% from 0% to 100%. We collected at least one sample per 50 km2 in 2016, the last year where all VCF-related variables (Song's, Copernicus, MODIS) are available simultaneously. The amount of samples attributed to each range of differences is proportional to the area covered by this range within the country of reference. The sampling approach was repeated for each VCF class, and the outputs were later combined into a single set of samples that exclude duplicates, resulting in 238,052 samples.
Validation
The model outputs were validated using leave-one-out cross-validation. For each VCF class, the validation framework iterates through each country where samples were collected, excluding it for validation and using the remaining samples to train a RFReg models.This resulted in R2 values of 0.91, 0.87 and 0.91 for “tree cover”, “non-tree vegetation”, and “non vegetated”. respectively. The RMSE values were of 2.31%, 3.05%, and 2.25%.
The model was applied to data from 2015, which was not used to neither predict nor validate our models. A comparison between the 2015 Song data against our predictions, which consist of 8,764,232 pixels, yielded R2 values of 0.94, 0.91, and 0.97. The RMSE were 6.65%, 8.92%, and 5.96%. Additionally, we compared changes between 2015 and 2016, resulting in RMSE values of 2.83%, 3.69%, and 2.57%.
Post-processing
When observing annual VCF time-series based on Song's data, we noted that our predictions were the most plausible for “tree cover” and “non-tree vegetation”. In turn, our “non vegetated” are seemingly underestimated (see "temporal_trend_check.png"), reporting large year-to-year decreases om cover (-3.05% between 2016 and 2017, compared to -0.14% for "tree cover" and -0.26% for “non-tree vegetation”). To address this issue, we recommend deriving data on “non-vegetated” cover by computing the difference between 100% and the sum of "tree cover” and “non-tree vegetation”.
Facebook
TwitterBy City of Chicago [source]
This public health dataset contains a comprehensive selection of indicators related to natality, mortality, infectious disease, lead poisoning, and economic status from Chicago community areas. It is an invaluable resource for those interested in understanding the current state of public health within each area in order to identify any deficiencies or areas of improvement needed.
The data includes 27 indicators such as birth and death rates, prenatal care beginning in first trimester percentages, preterm birth rates, breast cancer incidences per hundred thousand female population, all-sites cancer rates per hundred thousand population and more. For each indicator provided it details the geographical region so that analyses can be made regarding trends on a local level. Furthermore this dataset allows various stakeholders to measure performance along these indicators or even compare different community areas side-by-side.
This dataset provides a valuable tool for those striving toward better public health outcomes for the citizens of Chicago's communities by allowing greater insight into trends specific to geographic regions that could potentially lead to further research and implementation practices based on empirical evidence gathered from this comprehensive yet digestible selection of indicators
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset effectively to assess the public health of a given area or areas in the city: - Understand which data is available: The list of data included in this dataset can be found above. It is important to know all that are included as well as their definitions so that accurate conclusions can be made when utilizing the data for research or analysis. - Identify areas of interest: Once you are familiar with what type of data is present it can help to identify which community areas you would like to study more closely or compare with one another. - Choose your variables: Once you have identified your areas it will be helpful to decide which variables are most relevant for your studies and research specific questions regarding these variables based on what you are trying to learn from this data set.
- Analyze the Data : Once your variables have been selected and clarified take right into analyzing the corresponding values across different community areas using statistical tests such as t-tests or correlations etc.. This will help answer questions like “Are there significant differences between two outputs?” allowing you to compare how different Chicago Community Areas stack up against each other with regards to public health statistics tracked by this dataset!
- Creating interactive maps that show data on public health indicators by Chicago community area to allow users to explore the data more easily.
- Designing a machine learning model to predict future variations in public health indicators by Chicago community area such as birth rate, preterm births, and childhood lead poisoning levels.
- Developing an app that enables users to search for public health information in their own community areas and compare with other areas within the city or across different cities in the US
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: public-health-statistics-selected-public-health-indicators-by-chicago-community-area-1.csv | Column name | Description | |:-----------------------------------------------|:--------------------------------------------------------------------------------------------------| | Community Area | Unique identifier for each community area in Chicago. (Integer) | | Community Area Name | Name of the community area in Chicago. (String) | | Birth Rate | Number of live births per 1,000 population. (Float) | | General Fertility Rate | Number of live births per 1,000 women aged 15-44. (Float) ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the data release for the paper "Waveform systematics in identifying gravitationally lensed gravitational waves: Posterior overlap method", which is available on https://arxiv.org/abs/2306.12908.
These results are derived from the gravitational-wave parameter-estimation results by the LIGO-Virgo-KAGRA Collaboration, released with the GWTC-1, GWTC-2, GWTC-2.1, and GWTC-3 catalogs under the following links:
https://dcc.ligo.org/P1800370-v5/public
https://dcc.ligo.org/P2000223-v7/public
https://doi.org/10.5281/zenodo.6513631
https://doi.org/10.5281/zenodo.5546663
For the lensed-unlensed hypothesis test posterior overlap Bayes factors, we provide the following files for event pairs from within each observing run:
blu_all_pairs_O1.txt
blu_all_pairs_O2.txt
blu_all_pairs_O3.txt
In each file, the column "event_pair" contains the names of the two events from the pair sorted chronologically, the column "data_releases" contains the names of the data releases from which the posterior samples of each event were taken, the column "waveform" contains the name of the waveform model used in the parameter estimation for both sets of posteriors, and the column "log10blu" contains the log10 of the Bayes factors.
The differences between runs for the same event pair, only including O1-O1, O2-O2, O3-O3 pairs, where at least one run gave log10blu>0, are also given in the file "blu_differences_pairs_with_log10blu_pos.txt". The column "event_pair" contains the event pairs, the columns "waveform_{1,2}" contain the names of the waveform models used in the parameter estimation for both sets of posteriors, the columns "data_releases_{1,2}" contain the the data releases from which the posterior samples of each event were taken, the columns "log10blu_{1,2}" contain the log10 Bayes factors, and the column "difference" contains the difference between "log10blu_1" and "log10blu_2".
We also provide the following files corresponding to the appendix of the paper, analyzing overlaps between posterior samples for individual events:
overlap_different_runs.txt
overlap_same_run.txt
rescaled_difference_single_event.txt
The file "overlap_different_runs.txt" contains Bayes factors for a single event, but comparing the posteriors from different runs. The file "overlap_same_run.txt" contains Bayes factors for the overlap of a single run on a single event with itself. The file "rescaled_difference_single_event.txt" contains the difference between the results contained in the file overlap_different_runs.txt and the results in overlap_same_run.txt, taking the ones that produce the biggest difference, as per equation (A.1) in the paper.
In these files, the column "event_name" is the name of the event, the column "data_release" or "data_releases" contains the name(s) of the data release(s) from which the posterior samples of each run were taken, the column "waveform" or "waveform_pair" contains the name(s) of the waveform model(s) used, and the column "log10blu" is the log10 Bayes factor obtained. In the file "rescaled_difference_single_event.txt", the columns "max_run_waveform" and "max_run_data_release" identify an entry from the "overlap_same_run.txt" file from which we use the "log10blu" to compute the value listed in the "difference" column using equation (A.1).