Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This excel file will do a statistical tests of whether two ROC curves are different from each other based on the Area Under the Curve. You'll need the coefficient from the presented table in the following article to enter the correct AUC value for the comparison: Hanley JA, McNeil BJ (1983) A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148:839-843.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The receiver operating characteristics (ROC) curve is typically employed when one wants to evaluate the discriminatory capability of a continuous or ordinal biomarker in the case where two groups are to be distinguished, commonly the ’healthy’ and the ’diseased’. There are cases for which the disease status has three categories. Such cases employ the (ROC) surface, which is a natural generalization of the ROC curve for three classes. In this paper, we explore new methodologies for comparing two continuous biomarkers that refer to a trichotomous disease status, when both markers are applied to the same patients. Comparisons based on the volume under the surface have been proposed, but that measure is often not clinically relevant. Here, we focus on comparing two correlated ROC surfaces at given pairs of true classification rates, which are more relevant to patients and physicians. We propose delta-based parametric techniques, power transformations to normality, and bootstrap-based smooth nonparametric techniques to investigate the performance of an appropriate test. We evaluate our approaches through an extensive simulation study and apply them to a real data set from prostate cancer screening.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In our everyday lives, we are required to make decisions based upon our statistical intuitions. Often, these involve the comparison of two groups, such as luxury versus family cars and their suitability. Research has shown that the mean difference affects judgements where two sets of data are compared, but the variability of the data has only a minor influence, if any at all. However, prior research has tended to present raw data as simple lists of values. Here, we investigated whether displaying data visually, in the form of parallel dot plots, would lead viewers to incorporate variability information. In Experiment 1, we asked a large sample of people to compare two fictional groups (children who drank ‘Brain Juice’ versus water) in a one-shot design, where only a single comparison was made. Our results confirmed that only the mean difference between the groups predicted subsequent judgements of how much they differed, in line with previous work using lists of numbers. In Experiment 2, we asked each participant to make multiple comparisons, with both the mean difference and the pooled standard deviation varying across data sets they were shown. Here, we found that both sources of information were correctly incorporated when making responses. Taken together, we suggest that increasing the salience of variability information, through manipulating this factor across items seen, encourages viewers to consider this in their judgements. Such findings may have useful applications for best practices when teaching difficult concepts like sampling variation.
Facebook
TwitterBy City of Chicago [source]
This public health dataset contains a comprehensive selection of indicators related to natality, mortality, infectious disease, lead poisoning, and economic status from Chicago community areas. It is an invaluable resource for those interested in understanding the current state of public health within each area in order to identify any deficiencies or areas of improvement needed.
The data includes 27 indicators such as birth and death rates, prenatal care beginning in first trimester percentages, preterm birth rates, breast cancer incidences per hundred thousand female population, all-sites cancer rates per hundred thousand population and more. For each indicator provided it details the geographical region so that analyses can be made regarding trends on a local level. Furthermore this dataset allows various stakeholders to measure performance along these indicators or even compare different community areas side-by-side.
This dataset provides a valuable tool for those striving toward better public health outcomes for the citizens of Chicago's communities by allowing greater insight into trends specific to geographic regions that could potentially lead to further research and implementation practices based on empirical evidence gathered from this comprehensive yet digestible selection of indicators
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to use this dataset effectively to assess the public health of a given area or areas in the city: - Understand which data is available: The list of data included in this dataset can be found above. It is important to know all that are included as well as their definitions so that accurate conclusions can be made when utilizing the data for research or analysis. - Identify areas of interest: Once you are familiar with what type of data is present it can help to identify which community areas you would like to study more closely or compare with one another. - Choose your variables: Once you have identified your areas it will be helpful to decide which variables are most relevant for your studies and research specific questions regarding these variables based on what you are trying to learn from this data set.
- Analyze the Data : Once your variables have been selected and clarified take right into analyzing the corresponding values across different community areas using statistical tests such as t-tests or correlations etc.. This will help answer questions like “Are there significant differences between two outputs?” allowing you to compare how different Chicago Community Areas stack up against each other with regards to public health statistics tracked by this dataset!
- Creating interactive maps that show data on public health indicators by Chicago community area to allow users to explore the data more easily.
- Designing a machine learning model to predict future variations in public health indicators by Chicago community area such as birth rate, preterm births, and childhood lead poisoning levels.
- Developing an app that enables users to search for public health information in their own community areas and compare with other areas within the city or across different cities in the US
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: public-health-statistics-selected-public-health-indicators-by-chicago-community-area-1.csv | Column name | Description | |:-----------------------------------------------|:--------------------------------------------------------------------------------------------------| | Community Area | Unique identifier for each community area in Chicago. (Integer) | | Community Area Name | Name of the community area in Chicago. (String) | | Birth Rate | Number of live births per 1,000 population. (Float) | | General Fertility Rate | Number of live births per 1,000 women aged 15-44. (Float) ...
Facebook
TwitterBy Throwback Thursday [source]
The dataset contains multiple columns that provide specific information for each year recorded. The column labeled Year indicates the specific year in which the data was recorded. The Pieces of Mail Handled column shows the total number of mail items that were processed or handled in a given year.
Another important metric is represented in the Number of Post Offices column, revealing the total count of post offices that were operational during a specific year. This information helps understand how postal services and infrastructure have evolved over time.
Examining financial aspects, there are two columns: Income and Expenses. The former represents the total revenue generated by the US Mail service in a particular year, while the latter showcases the expenses incurred by this service during that same period.
The dataset titled Week 22 - US Mail - 1790 to 2017.csv serves as an invaluable resource for researchers, historians, and analysts interested in studying trends and patterns within the US Mail system throughout its extensive history. By utilizing this dataset's wide range of valuable metrics, users can gain insights into how mail volume has changed over time alongside fluctuations in post office numbers and financial performance
Familiarize yourself with the columns:
- Year: This column represents the specific year in which data was recorded. It is represented by numeric values.
- Pieces of Mail Handled: This column indicates the number of mail items processed or handled in a given year. It is also represented by numeric values.
- Number of Post Offices: Here, you will find information on the total count of post offices in operation during a specific year. Like other columns, it consists of numeric values.
- Income: The Income column displays the total revenue generated by the US Mail service in a particular year. Numeric values are used to represent this data.
- Expenses: This column shows the total expenses incurred by the US Mail service for a particular year. Similar to other columns, it uses numeric values.
Understand data relationships: By exploring and analyzing different combinations of columns, you can uncover interesting patterns and relationships within mail statistics over time. For example:
Relationship between Year and Pieces of Mail Handled/Number of Post Offices/Income/Expenses: Analyzing these variables over years will allow you to observe trends such as increasing mail volume alongside changes in post office numbers or income and expenses patterns.
Relationship between Pieces of Mail Handled and Number Postal Office: By comparing these two variables across different years, you can assess if there is any correlation between mail volume growth and changes in post office counts.
Visualization:
To gain better insights into this vast amount of data visually, consider making use graphs or plots beyond just numerical analysis. You can use tools like Matplotlib, Seaborn, or Plotly to create various types of visualizations:
- Time-series line plots: Visualize the change in Pieces of Mail Handled, Number of Post Offices, Income, and Expenses over time.
- Scatter plots: Identify potential correlations between different variables such as Year and Pieces of Mail Handled/Number of Post Offices/Income/Expenses.
Drawing conclusions:
This dataset presents an extraordinary opportunity to learn about the history and evolution of the US Mail service. By examining various factors together or individually throughout time, you can draw conclusions about
- Trend Analysis: The dataset can be used to analyze the trends and patterns in mail volume, post office numbers, income, and expenses over time. This can help identify any significant changes or fluctuations in these variables and understand the factors that may have influenced them.
- Benchmarking: By comparing the performance of different years or periods, this dataset can be used for benchmarking purposes. For example, it can help assess how efficiently post offices have been handling mail items by comparing the number of pieces of mail handled with the corresponding expenses incurred.
- Forecasting: Based on historical data on mail volume and revenue generation, this dataset can be used for forecasting future trends. This could be valuable for planning purposes, such as determining resource allocation or projecting financial o...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By data.gov.ie [source]
This dataset contains data from the East Atlantic SWAN Wave Model, which is a powerful model developed to predict wave parameters in Irish waters. The output features of the model include Significant Wave Height (m), Mean Wave Direction (degreesTrue) and Mean Wave Period (seconds). These predictions are generated with NCEP GFS wind forcing and FNMOC Wave Watch 3 data as boundaries for the wave generation.
The accuracy of this model is important for safety critical applications as well as research efforts into understanding changes in tides, currents, and sea levels, so users are provided with up-to-date predictions for the previous 30 days and 6 days into the future with download service options that allow selection by date/time, one parameter only and output file type.
Data providers released this dataset under a Creative Commons Attribution 4.0 license at 2017-09-14. It can be used free of charge within certain restrictions set out by its respective author or publisher
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction:
Step 1: Acquire the Dataset:
The first step is getting access to the dataset which is free of cost. The original source of this data is from http://wwave2.marinecstl.org/archive/index?cat=model_height&xsl=download-csv-1 Meanwhile, you can also get your hands on this data by downloading it as a csv file from Kaggle’s website (https://www.kaggle.com/marinecstl/east-atlantic-swan-wave-model). This download should contain seven columns of various parameters; time, latitude, longitude, and significant wave height being the most important ones that you need to be familiar with before using this data set effectively in any project etc..Step 2: Understand Data Columns & Parameters :
Now that you have downloaded the data its time to understand what each column represents and how they are related to each other when comparing datasets from two different locations within one country or across two countries etc.. Time represents daily timestamps for each observation taken at an exact location specified by latitude & longitude parameters respectively while ranging between 0° - +90° (~ 85 degrees) where higher values indicate states closer towards North Pole; inversely lower values indicates states closer towards South Pole respectively.. Significant wave height on other hand represent total displacements in ocean surface due measurable variations within short period caused either due tides or waves i .e caused due weather difference such as wind forcing or during more extreme conditions like oceanic storms etc.,Step 3: Understanding Data Limitation & Applying Exclusion Criteria :
Moreover, keep in mind that since model runs every day across various geographical regions thus inevitable inaccuracy emerges regarding value predictions across any given timeslot; so its essential that users apply advanced criteria during analysis phase taking into consideration natural resource limitation such as current weather conditions and water depth scenarios while compiling buoyancy related readings during particular timestamps respectively when going through information outputted via obtained CSV file OR API services respectively;; however don’t forget these ;predictions may not be used for safety
- Visualizing wave heights in the East Atlantic area over time to map oceanic currents.
- Finding areas of high-wave activity: using this data, researchers can identify unique areas that experience particularly severe waves, which could be essential to know for protecting maritime vessels and informing navigation strategies.
- Predicting future wave behavior: by analyzing current and past trends in SWAN Wave Model data, scientists can predict how significant wave heights will change over future timescales in the studied area
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: download-csv-1.csv | Column name | Descrip...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundIndividual participant data (IPD) meta-analyses that obtain “raw” data from studies rather than summary data typically adopt a “two-stage” approach to analysis whereby IPD within trials generate summary measures, which are combined using standard meta-analytical methods. Recently, a range of “one-stage” approaches which combine all individual participant data in a single meta-analysis have been suggested as providing a more powerful and flexible approach. However, they are more complex to implement and require statistical support. This study uses a dataset to compare “two-stage” and “one-stage” models of varying complexity, to ascertain whether results obtained from the approaches differ in a clinically meaningful way. Methods and FindingsWe included data from 24 randomised controlled trials, evaluating antiplatelet agents, for the prevention of pre-eclampsia in pregnancy. We performed two-stage and one-stage IPD meta-analyses to estimate overall treatment effect and to explore potential treatment interactions whereby particular types of women and their babies might benefit differentially from receiving antiplatelets. Two-stage and one-stage approaches gave similar results, showing a benefit of using anti-platelets (Relative risk 0.90, 95% CI 0.84 to 0.97). Neither approach suggested that any particular type of women benefited more or less from antiplatelets. There were no material differences in results between different types of one-stage model. ConclusionsFor these data, two-stage and one-stage approaches to analysis produce similar results. Although one-stage models offer a flexible environment for exploring model structure and are useful where across study patterns relating to types of participant, intervention and outcome mask similar relationships within trials, the additional insights provided by their usage may not outweigh the costs of statistical support for routine application in syntheses of randomised controlled trials. Researchers considering undertaking an IPD meta-analysis should not necessarily be deterred by a perceived need for sophisticated statistical methods when combining information from large randomised trials.
Facebook
TwitterBackground Microarray experiments offer a potent solution to the problem of making and comparing large numbers of gene expression measurements either in different cell types or in the same cell type under different conditions. Inferences about the biological relevance of observed changes in expression depend on the statistical significance of the changes. In lieu of many replicates with which to determine accurate intensity means and variances, reliable estimates of statistical significance remain problematic. Without such estimates, overly conservative choices for significance must be enforced. Results A simple statistical method for estimating variances from microarray control data which does not require multiple replicates is presented. Comparison of datasets from two commercial entities using this difference-averaging method demonstrates that the standard deviation of the signal scales at a level intermediate between the signal intensity and its square root. Application of the method to a dataset related to the β-catenin pathway yields a larger number of biologically reasonable genes whose expression is altered than the ratio method. Conclusions The difference-averaging method enables determination of variances as a function of signal intensities by averaging over the entire dataset. The method also provides a platform-independent view of important statistical properties of microarray data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In research evaluating statistical analysis methods, a common aim is to compare point estimates and confidence intervals (CIs) calculated from different analyses. This can be challenging when the outcomes (and their scale ranges) differ across datasets. We therefore developed a plot to facilitate pairwise comparisons of point estimates and confidence intervals from different statistical analyses both within and across datasets.
The plot was developed and refined over the course of an empirical study. To compare results from a variety of different studies, a system of centring and scaling is used. Firstly, the point estimates from reference analyses are centred to zero, followed by scaling confidence intervals to span a range of one. The point estimates and confidence intervals from matching comparator analyses are then adjusted by the same amounts. This enables the relative positions of the point estimates and CI widths to be quickly assessed while maintaining the relative magnitudes of the difference in point estimates and confidence interval widths between the two analyses. Banksia plots can be graphed in a matrix, showing all pairwise comparisons of multiple analyses. In this paper, we show how to create a banksia plot and present two examples: the first relates to an empirical evaluation assessing the difference between various statistical methods across 190 interrupted time series (ITS) data sets with widely varying characteristics, while the second example assesses data extraction accuracy comparing results obtained from analysing original study data (43 ITS studies) with those obtained by four researchers from datasets digitally extracted from graphs from the accompanying manuscripts.
In the banksia plot of statistical method comparison, it was clear that there was no difference, on average, in point estimates and it was straightforward to ascertain which methods resulted in smaller, similar or larger confidence intervals than others. In the banksia plot comparing analyses from digitally extracted data to those from the original data it was clear that both the point estimates and confidence intervals were all very similar among data extractors and original data.
The banksia plot, a graphical representation of centred and scaled confidence intervals, provides a concise summary of comparisons between multiple point estimates and associated CIs in a single graph. Through this visualisation, patterns and trends in the point estimates and confidence intervals can be easily identified.
This collection of files allows the user to create the images used in the companion paper and amend this code to create their own banksia plots using either Stata version 17 or R version 4.3.1
Facebook
TwitterSupporting data for 2 region and 51 region models assessed in the manuscript "Exploring the relevance of spatial scale to life cycle inventory results using environmentally-extended input-output models of the United States". Includes results of the correlation and relative errors analysis, results in kg/$ intensities for the 17 commodities from the 2 region models and the 51 region model, the 51-region model Make and Use tables, 10 NEI emissions and water withdrawal data aggregated by the 15 BEA sectors, interstate commodity flow data aggregated by BEA sectors between states, BEA national level Make and Use tables for 2012 at sector level, and state GDP data. This dataset is associated with the following publication: Yang, Y., W. Ingwersen, and D. Meyer. Exploring the relevance of spatial scale to life cycle inventory results using environmentally-extended input-output models of the United States. ENVIRONMENTAL MODELLING & SOFTWARE. Elsevier Science, New York, NY, 99: 52-57, (2018).
Facebook
TwitterThe Shuttle Radar Topography Mission (SRTM) was flown aboard the space shuttle Endeavour February 11-22, 2000. The National Aeronautics and Space Administration (NASA) and the National Geospatial-Intelligence Agency (NGA) participated in an international project to acquire radar data which were used to create the first near-global set of land elevations. The radars used during the SRTM mission were actually developed and flown on two Endeavour missions in 1994. The C-band Spaceborne Imaging Radar and the X-Band Synthetic Aperture Radar (X-SAR) hardware were used on board the space shuttle in April and October 1994 to gather data about Earth's environment. The technology was modified for the SRTM mission to collect interferometric radar, which compared two radar images or signals taken at slightly different angles. This mission used single-pass interferometry, which acquired two signals at the same time by using two different radar antennas. An antenna located on board the space shuttle collected one data set and the other data set was collected by an antenna located at the end of a 60-meter mast that extended from the shuttle. Differences between the two signals allowed for the calculation of surface elevation. Endeavour orbited Earth 16 times each day during the 11-day mission, completing 176 orbits. SRTM successfully collected radar data over 80% of the Earth's land surface between 60° north and 56° south latitude with data points posted every 1 arc-second (approximately 30 meters). Two resolutions of finished grade SRTM data are available through EarthExplorer from the collection held in the USGS EROS archive: 1 arc-second (approximately 30-meter) high resolution elevation data offer worldwide coverage of void filled data at a resolution of 1 arc-second (30 meters) and provide open distribution of this high-resolution global data set. Some tiles may still contain voids. The SRTM 1 Arc-Second Global (30 meters) data set will be released in phases starting September 24, 2014. Users should check the coverage map in EarthExplorer to verify if their area of interest is available. 3 arc-second (approximately 90-meter) medium resolution elevation data are available for global coverage. The 3 arc-second data were resampled using cubic convolution interpolation for regions between 60° north and 56° south latitude. [Summary provided by the USGS.]
Facebook
TwitterThese are geospatial data that characterize the distribution of polar bear denning habitat in the 1002 Area of the Arctic National Wildlife Refuge, Alaska. They were generated to compare the efficacy of two different techniques for identifying areas with suitable den habitat: (1) from a previously published study (Durner et al., 2006) that used manual interpretation of aerial photos and (2) from computer interrogation of interferometric synthetic aperture radar (IfSAR) digital terrain models. Two datasets are included in this data package, they are both vector geospatial datasets of putative denning habitat (one dataset each for the manual photo interpretation data and the computer interpreted IfSAR data). Additionally included are: vector data used for sampling and metadata describing the IfSAR-derived digital terrain model (DTM) tiles used to generate the shapefiles. The IfSAR DTM are available for purchase through Intermap Technologies, Inc. All vector data are provided in both ESRI shapefile and Keyhole Markup Language (KML) formats.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
Digital Object Identifiers (DOIs) are regarded as persistent; however, they are sometimes deleted. Deleted DOIs are an important issue not only for persistent access to scholarly content but also for bibliometrics, because they may cause problems in correctly identifying scholarly articles. However, little is known about how much of deleted DOIs and what causes them. We identified deleted DOIs by comparing the datasets of all Crossref DOIs on two different dates, investigated the number of deleted DOIs in the scholarly content along with the corresponding document types, and analyzed the factors that cause deleted DOIs. Using the proposed method, 708,282 deleted DOIs were identified. The majority corresponded to individual scholarly articles such as journal articles, proceedings articles, and book chapters. There were cases of many DOIs assigned to the same content, e.g., retracted journal articles and abstracts of international conferences. We show the publishers and academic societies which are the most common in deleted DOIs. In addition, the top cases of single scholarly content with a large number of deleted DOIs were revealed. The findings of this study are useful for citation analysis and altmetrics, as well as for avoiding deleted DOIs.
Data Records
The data format of the dataset is JSON lines, where each line is a single record. In this dataset, we identified the deleted DOIs by from the difference set between Crossref DOIs as of March 2017 and January 2021. We note that the file "00_Non-Crossref_DOIs.jsonl.gz" is not deleted DOIs but other files are deleted DOIs. Please refer the conference paper shown in the references for details. Sample of the record is the following.
doi -- DOI name (String), e.g., "10.xxxx/xxxx"
whichRA -- Registration agency name or error message for the DOI name according to the “whichRA?.” (String). "Airiti," "Crossref," "DOI does not exist," "DataCite," "KISTI," "Public," or "mEDRA."
redirects -- Redirected URIs for the DOI name obtained by curl command (Array of String), e.g., ["https://doi.org/10.1001/archinte.166.4.387","http://archinte.jamanetwork.com/article.aspx?doi=10.1001/archinte.166.4.387"]
redirect_to_other_doi -- The other DOI when the DOI link redirects to. (Array of String), e.g., "[10.1001/archinte.166.4.387]"
timestamp -- Date the data was retrieved (Datetime), "2022-01-30T02:10:45Z"
label -- The group where the DOI belongs to. Alias DOIs," "DOIs with Deleted Description in Metadata," "DOIs without Redirects," "Defunct DOIs," "Non-Crossref DOIs," "Non-existing DOIs," or "Other DOIs."
As for the file "04_DOIs_with_Deleted_Description_on_Metadata.jsonl.gz," additional records are available as follows.
alias_doi -- Alias DOI name, the same as the value of "doi." (String), e.g., "10.1007/bf00400428."
primary_doi -- Primary DOI name for the alias DOI name, the same as the first value of "redirect_to_other_doi". (String), e.g., "10.1007/bf00400429"
container_title_of_alias_doi -- the container title for the alias DOI according to the Crossref REST API. e.g., "CrossRef Listing of Deleted DOIs."
title_of_alias_doi -- the title for the alias DOI according to the Crossref REST API. e.g., "CrossRef Listing of Deleted DOIs."
container_title_of_primary_doi -- the container title for the primary DOI according to the Crossref REST API. e.g., "CrossRef Listing of Deleted DOIs."
title_of_alias_doi -- the title for the primary DOI according to the Crossref REST API. e.g., "CrossRef Listing of Deleted DOIs."
References
Kikkawa, J., Takaku, M. & Yoshikane, F. "Analysis of the deletions of DOIs: What factors undermine their persistence and to what extent?", Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries (TPDL 2022), (to appear), 2022.
FUNDING
JSPS KAKENHI Grant Number JP21K21303, JP22K18147, JP20K12543, and JP21K12592
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia.The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale.Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences.From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful.The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first.You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. ""Categorization of Comparative Sentences for Argument Mining."" arXiv preprint arXiv:1809.06152 (2018).@inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy}}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset corresponding to the journal article "Mitigating the effect of errors in source parameters on seismic (waveform) inversion" by Blom, Hardalupas and Rawlinson, accepted for publication in Geophysical Journal International. In this paper, we demonstrate the effect or errors in source parameters on seismic tomography, with a particular focus on (full) waveform tomography. We study effect both on forward modelling (i.e. comparing waveforms and measurements resulting from a perturbed vs. unperturbed source) and on seismic inversion (i.e. using a source which contains an (erroneous) perturbation to invert for Earth structure. These data were obtained using Salvus, a state-of-the-art (though proprietary) 3-D solver that can be used for wave propagation simulations (Afanasiev et al., GJI 2018).
This dataset contains:
The entire Salvus project. This project was prepared using Salvus version 0.11.x and 0.12.2 and should be fully compatible with the latter.
A number of Jupyter notebooks used to create all the figures, set up the project and do the data processing.
A number of Python scripts that are used in above notebooks.
two conda environment .yml files: one with the complete environment as used to produce this dataset, and one with the environment as supplied by Mondaic (the Salvus developers), on top of which I installed basemap and cartopy.
An overview of the inversion configurations used for each inversion experiment and the name of hte corresponding figures: inversion_runs_overview.ods / .csv .
Datasets corresponding to the different figures.
One dataset for Figure 1, showing the effect of a source perturbation in a real-world setting, as previously used by Blom et al., Solid Earth 2020
One dataset for Figure 2, showing how different methodologies and assumptions can lead to significantly different source parameters, notably including systematic shifts. This dataset was kindly supplied by Tim Craig (Craig, 2019).
A number of datasets (stored as pickled Pandas dataframes) derived from the Salvus project. We have computed:
travel-time arrival predictions from every source to all stations (df_stations...pkl)
misfits for different metrics for both P-wave centered and S-wave centered windows for all components on all stations, comparing every time waveforms from a reference source against waveforms from a perturbed source (df_misfits_cc.28s.pkl)
addition of synthetic waveforms for different (perturbed) moment tenors. All waveforms are stored in HDF5 (.h5) files of the ASDF (adaptable seismic data format) type
How to use this dataset:
To set up the conda environment:
make sure you have anaconda/miniconda
make sure you have access to Salvus functionality. This is not absolutely necessary, but most of the functionality within this dataset relies on salvus. You can do the analyses and create the figures without, but you'll have to hack around in the scripts to build workarounds.
Set up Salvus / create a conda environment. This is best done following the instructions on the Mondaic website. Check the changelog for breaking changes, in that case download an older salvus version.
Additionally in your conda env, install basemap and cartopy:
conda-env create -n salvus_0_12 -f environment.yml conda install -c conda-forge basemap conda install -c conda-forge cartopy
Install LASIF (https://github.com/dirkphilip/LASIF_2.0) and test. The project uses some lasif functionality.
To recreate the figures: This is extremely straightforward. Every figure has a corresponding Jupyter Notebook. Suffices to run the notebook in its entirety.
Figure 1: separate notebook, Fig1_event_98.py
Figure 2: separate notebook, Fig2_TimCraig_Andes_analysis.py
Figures 3-7: Figures_perturbation_study.py
Figures 8-10: Figures_toy_inversions.py
To recreate the dataframes in DATA: This can be done using the example notebook Create_perturbed_thrust_data_by_MT_addition.py and Misfits_moment_tensor_components.M66_M12.py . The same can easily be extended to the position shift and other perturbations you might want to investigate.
To recreate the complete Salvus project: This can be done using:
the notebook Prepare_project_Phil_28s_absb_M66.py (setting up project and running simulations)
the notebooks Moment_tensor_perturbations.py and Moment_tensor_perturbation_for_NS_thrust.py
For the inversions: using the notebook Inversion_SS_dip.M66.28s.py as an example. See the overview table inversion_runs_overview.ods (or .csv) as to naming conventions.
References:
Michael Afanasiev, Christian Boehm, Martin van Driel, Lion Krischer, Max Rietmann, Dave A May, Matthew G Knepley, Andreas Fichtner, Modular and flexible spectral-element waveform modelling in two and three dimensions, Geophysical Journal International, Volume 216, Issue 3, March 2019, Pages 1675–1692, https://doi.org/10.1093/gji/ggy469
Nienke Blom, Alexey Gokhberg, and Andreas Fichtner, Seismic waveform tomography of the central and eastern Mediterranean upper mantle, Solid Earth, Volume 11, Issue 2, 2020, Pages 669–690, 2020, https://doi.org/10.5194/se-11-669-2020
Tim J. Craig, Accurate depth determination for moderate-magnitude earthquakes using global teleseismic data. Journal of Geophysical Research: Solid Earth, 124, 2019, Pages 1759– 1780. https://doi.org/10.1029/2018JB016902
Facebook
TwitterUntil recently, researchers who wanted to examine the determinants of state respect for most specific negative rights needed to rely on data from the CIRI or the Political Terror Scale (PTS). The new V-DEM dataset offers scholars a potential alternative to the individual human rights variables from CIRI. We analyze a set of key Cingranelli-Richards (CIRI) Human Rights Data Project and Varieties of Democracy (V-DEM) negative rights indicators, finding unusual and unexpectedly large patterns of disagreement between the two sets. First, we discuss the new V-DEM dataset by comparing it to the disaggregated CIRI indicators, discussing the history of each project, and describing its empirical domain. Second, we identify a set of disaggregated human rights measures that are similar across the two datasets and discuss each project's measurement approach. Third, we examine how these measures compare to each other empirically, showing that they diverge considerably across both time and space. These findings point to several important directions for future work, such as how conceptual approaches and measurement strategies affect rights scores. For the time being, our findings suggest that researchers should think carefully about using the measures as substitutes.
Facebook
TwitterBackground There has been a relentless increase in emergency medical admissions in the UK over recent years. Many of these patients suffer with chronic conditions requiring continuing medical attention. We wished to determine whether conventional outpatient clinic follow up after discharge has any impact on the rate of readmission to hospital. Methods Two consultant general physicians with the same patient case-mix but markedly different outpatient follow-up practice were chosen. Of 1203 patients discharged, one consultant saw twice as many patients in the follow-up clinic than the other (Dr A 9.8% v Dr B 19.6%). The readmission rate in the twelve months following discharge was compared in a retrospective analysis of hospital activity data. Due to the specialisation of the admitting system, patients mainly had cardiovascular or cerebrovascular disease or had taken an overdose. Few had respiratory or infectious diseases. Outpatient follow-up was focussed on patients with cardiac disease. Results Risk of readmission increased significantly with age and length of stay of the original episode and was less for digestive system and musculo-skeletal disorders. 28.7% of patients discharged by Dr A and 31.5 % of those discharged by Dr B were readmitted at least once. Relative readmission risk was not significantly different between the consultants and there was no difference in the length of stay of readmissions. Conclusions Increasing the proportion of patients with this age- and case-mix who are followed up in a hospital general medical outpatient clinic is unlikely to reduce the demand for acute hospital beds.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A generalized dataset of existing land use in the District of Columbia as existed during its most recent extract of the common ownership lots. This dataset is different from the Comprehensive Plan - Future Land Use, which shows land use as envisioned in the latest version of DC’s Comprehensive Plan. The primary land use categories used in this dataset are similar, but not identical. The Office of the Chief Technology Officer (OCTO) compared two datasets to create this generalized existing land use data. The data source identifying property use is the Property Use Code Lookup from the Office of Tax and Revenue (OTR). An index provided by the Office of Planning assigns each OTR property use code with a “primary land use” designation. Through an automated process, the common ownership lots were then joined with this index to create the Existing Land Use. Only properties with an assigned use code from OTR are categorized. Other properties without a use code were left as NULL. Many of these tend to be public lands such as national parks. Refer to https://opendata.dc.gov/pages/public-lands.This dataset has no legal status and is intended primarily as a resource and informational tool. The Office of the Chief Technology Officer anticipates replicating this work annually.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this upload we share processed crop type datasets from both France and Kenya. These datasets can be helpful for testing and comparing various domain adaptation methods. The datasets are processed, used, and described in this paper: https://doi.org/10.1016/j.rse.2021.112488 (arXiv version: https://arxiv.org/pdf/2109.01246.pdf).
In summary, each point in the uploaded datasets corresponds to a particular location. The label is the crop type grown at that location in 2017. The 70 processed features are based on Sentinel-2 satellite measurements at that location in 2017. The points in the France dataset come from 11 different departments (regions) in Occitanie, France, and the points in the Kenya dataset come from 3 different regions in Western Province, Kenya. Within each dataset there are notable shifts in the distribution of the labels and in the distribution of the features between regions. Therefore, these datasets can be helpful for testing for testing and comparing methods that are designed to address such distributional shifts.
More details on the dataset and processing steps can be found in Kluger et. al. (2021). Much of the processing steps were taken to deal with Sentinel-2 measurements that were corrupted by cloud cover. For users interested in the raw multi-spectral time series data and dealing with cloud cover issues on their own (rather than using the 70 processed features provided here), the raw dataset from Kenya can be found in Yeh et. al. (2021), and the raw dataset from France can be made available upon request from the authors of this Zenodo upload.
All of the data uploaded here can be found in "CropTypeDatasetProcessed.RData". We also post the dataframes and tables within that .RData file as separate .csv files for users who do not have R. The contents of each R object (or .csv file) is described in the file "Metadata.rtf".
Preferred Citation:
-Kluger, D.M., Wang, S., Lobell, D.B., 2021. Two shifts for crop mapping: Leveraging aggregate crop statistics to improve satellite-based maps in new regions. Remote Sens. Environ. 262, 112488. https://doi.org/10.1016/j.rse.2021.112488.
-URL to this Zenodo post https://zenodo.org/record/6376160
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This excel file will do a statistical tests of whether two ROC curves are different from each other based on the Area Under the Curve. You'll need the coefficient from the presented table in the following article to enter the correct AUC value for the comparison: Hanley JA, McNeil BJ (1983) A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148:839-843.