Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The receiver operating characteristics (ROC) curve is typically employed when one wants to evaluate the discriminatory capability of a continuous or ordinal biomarker in the case where two groups are to be distinguished, commonly the ’healthy’ and the ’diseased’. There are cases for which the disease status has three categories. Such cases employ the (ROC) surface, which is a natural generalization of the ROC curve for three classes. In this paper, we explore new methodologies for comparing two continuous biomarkers that refer to a trichotomous disease status, when both markers are applied to the same patients. Comparisons based on the volume under the surface have been proposed, but that measure is often not clinically relevant. Here, we focus on comparing two correlated ROC surfaces at given pairs of true classification rates, which are more relevant to patients and physicians. We propose delta-based parametric techniques, power transformations to normality, and bootstrap-based smooth nonparametric techniques to investigate the performance of an appropriate test. We evaluate our approaches through an extensive simulation study and apply them to a real data set from prostate cancer screening.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This excel file will do a statistical tests of whether two ROC curves are different from each other based on the Area Under the Curve. You'll need the coefficient from the presented table in the following article to enter the correct AUC value for the comparison: Hanley JA, McNeil BJ (1983) A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 148:839-843.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Unpacked data.zip archive contains several .json files with data. Each file contains data used to test two pairwise comparison matrix attack algorithms, following the article https://doi.org/10.48550/arXiv.2211.01809 referred to as "row" and "hadamard". The name of each of the files contains information of the matrix size n and the distance between the promoted and reference alternatives (e.g. 7x7_delta_3, where 7 is the matrix size, and 3 is the delta).
Facebook
TwitterThe Excel file contains the model input-out data sets that where used to evaluate the two-layer soil moisture and flux dynamics model. The model is original and was developed by Dr. Hantush by integrating the well-known Richards equation over the root layer and the lower vadose zone. The input-output data are used for: 1) the numerical scheme verification by comparison against HYDRUS model as a benchmark; 2) model validation by comparison against real site data; and 3) for the estimation of model predictive uncertainty and sources of modeling errors. This dataset is associated with the following publication: He, J., M.M. Hantush, L. Kalin, and S. Isik. Two-Layer numerical model of soil moisture dynamics: Model assessment and Bayesian uncertainty estimation. JOURNAL OF HYDROLOGY. Elsevier Science Ltd, New York, NY, USA, 613 part A: 128327, (2022).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Until recently, researchers who wanted to examine the determinants of state respect for most specific negative rights needed to rely on data from the CIRI or the Political Terror Scale (PTS). The new V-DEM dataset offers scholars a potential alternative to the individual human rights variables from CIRI. We analyze a set of key Cingranelli-Richards (CIRI) Human Rights Data Project and Varieties of Democracy (V-DEM) negative rights indicators, finding unusual and unexpectedly large patterns of disagreement between the two sets. First, we discuss the new V-DEM dataset by comparing it to the disaggregated CIRI indicators, discussing the history of each project, and describing its empirical domain. Second, we identify a set of disaggregated human rights measures that are similar across the two datasets and discuss each project's measurement approach. Third, we examine how these measures compare to each other empirically, showing that they diverge considerably across both time and space. These findings point to several important directions for future work, such as how conceptual approaches and measurement strategies affect rights scores. For the time being, our findings suggest that researchers should think carefully about using the measures as substitutes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical comparison of multiple time series in their underlying frequency patterns has many real applications. However, existing methods are only applicable to a small number of mutually independent time series, and empirical results for dependent time series are only limited to comparing two time series. We propose scalable methods based on a new algorithm that enables us to compare the spectral density of a large number of time series. The new algorithm helps us efficiently obtain all pairwise feature differences in frequency patterns between M time series, which plays an essential role in our methods. When all M time series are independent of each other, we derive the joint asymptotic distribution of their pairwise feature differences. The asymptotic dependence structure between the feature differences motivates our proposed test for multiple mutually independent time series. We then adapt this test to the case of multiple dependent time series by partially accounting for the underlying dependence structure. Additionally, we introduce a global test to further enhance the approach. To examine the finite sample performance of our proposed methods, we conduct simulation studies. The new approaches demonstrate the ability to compare a large number of time series, whether independent or dependent, while exhibiting competitive power. Finally, we apply our methods to compare multiple mechanical vibrational time series.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Three experimental data sets (WNRA0103, WNRA0305 and WNRA0506) involving three grapevine varieties and a range of deficit irrigation and pruning treatments are described. The purpose for obtaining the data sets was two-fold, (1) to meet the research goals of the Cooperative Research Centre for Viticulture (CRCV) during its tenure 1999-2006, and (2) to test the capacity of the VineLOGIC grapevine growth and development model to predict timing of bud burst, flowering, veraison and harvest, yield and yield components, berry attributes and components of water balance. A test script, included with the VineLOGIC source code publication (https://doi.org/10.25919/5eb3536b6a8a8), enables comparison between model predicted and measured values for key variables. Key references relating to the model and data sets are provided under Related Links. A description of selected terms and outcomes of regression analysis between values predicted by the model and observed values are provided under Supporting Files. Version 3 included the following amendments: (1) to WNRA0103 – alignment of settings for irrigation simulation control and initial soil water contents for soil layers with those in WNRA0305 and WNRA0506, and addition of missing berry anthocyanin data for season 2002-03; (2) to WNRA0305 - minor corrections to values for berry and bunch number and weight, and correction of target Brix value for harvest to 24.5 Brix; (3) minor corrections to some measured berry anthocyanin concentrations as mg/g fresh weight; minor amendments to treatment names for consistency across data sets, and to the name for irrigation type to improve clarity; and (4) update of regression analysis between VineLOGIC-predicted versus observed values for key variables. Version 4 (this version) includes a metadata only amendment with two additions to Related links: ‘VineLOGIC View’ and a recent publication. Lineage: The data sets were obtained at a commercial wine company vineyard in the Mildura region of north western Victoria, Australia. Vines were spaced 2.4 m within rows and 3 m between rows, trained to a two-wire vertical trellis and drip irrigated. The soil was a Nookamka sandy loam. Data Set 1 (WNRA0103): An experiment comparing the effects on grapevine growth and development of three pruning treatments, spur, light mechanical hedging and minimal pruning, involving Shiraz on Schwarzmann rootstock, irrigated with industry standard drip irrigation and collected over three seasons 2000-01, 2001-02 and 2002-03. The experiment was established and conducted by Dr Rachel Ashley with input from Peter Clingeleffer (CSIRO), Dr Bob Emmett (Department of Primary Industries, Victoria) and Dr Peter Dry (University of Adelaide). Seasons in the southern hemisphere span two calendar years, with budburst in the second half of the first calendar year and harvest in the first half of the second calendar year. Data Set 2 (WNRA0305): An experiment comparing the effects of three irrigation treatments, industry standard drip, Regulated Deficit (RDI) and Prolonged Deficit (PD) irrigation involving Cabernet Sauvignon on own roots and pruned by light mechanical hedging, over three seasons 2002-03, 2003-04 and 2004-05. The RDI treatment involved application of a water deficit in the post-fruit set to pre-veraison period. The PD treatment was initially the same as RDI but with an extended period of extreme deficit (no irrigation) after the RDI stress period until veraison. The experiment was established and conducted by Dr Nicola Cooley with input from Peter Clingeleffer and Dr Rob Walker (CSIRO). Data Set 3 (WNRA0506): Compared basic grapevine growth, development and berry maturation post fruit set at three Trial Sites over two seasons 2004-05 and 2005-06. Trial Site one is the same site used to collect Data Set 1. Data were collected from all three pruning treatments in season 2004-05 but only from the spur and light mechanical hedging treatments in season 2005-06. Trial Site two involved comparison of two scions, Chardonnay and Shiraz, both on Schwarzmann rootstock, irrigated with industry standard drip irrigation and pruned using light mechanical hedging. Data were collected in season 2004-05. Trial Site three is the same site used to collect Data Set 2. Data were collected from all three irrigation treatments in season 2004-05 but only from the industry standard drip and PD treatments in 2005-06. Establishment and conduct of experiments at Trial Sites one, two and three was by Dr Anne Pellegrino and Deidre Blackmore with input from Peter Clingeleffer and Dr Rob Walker. The decision to develop Data Set 3 followed a mid-term CRCV review and analysis of available Australian data sets and relevant literature, which identified the need to obtain a data set covering all of the required variables necessary to run VineLOGIC and in particular, to obtain data on berry development commencing as soon as possible after fruit set. Most prior data sets were from veraison onwards, which is later than desirable from a modelling perspective. Data Set 1, 2 and 3 compilation for VineLOGIC was by Deidre Blackmore with input from Dr Doug Godwin. Review and testing of the Data Sets with VineLOGIC was conducted by David Benn with input from Dr Paul Petrie (South Australian Research and Development Institute), Dr Vinay Pagay (University of Adelaide) and Drs Everard Edwards and Rob Walker (CSIRO). A collaboration agreement with University of Adelaide established in 2017 enabled further input to review of the Data Sets and their testing with VineLOGIC by Dr Sam Culley.
Facebook
TwitterData were expressed as Median (IQR) and Mean ± SD.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.
Facebook
TwitterThe derived acceleration maps were elaborated starting from SAR ERS satellite data (processed period goes from 1992 to 2000) and ENVISAT (processed period goes from 2003 to 2008). The two sets of data differ for an interval of about four years. In order to have a better discrimination of the areas subject to anomalous movements, for the determination of the hydrogeological risk, it was decided to develop measures of derived accelerations. From the average speeds obtained from the processing of ERS and ENVISAT data, the areas present in both data series and with the same geometry were identified: ERS ascending with ENVISAT ascending. Within these areas, comparisons were made between the speeds estimated in the time interval 1992-2000 with ERS data and those estimated in the time interval 2003-2008 with ENVISAT data: the result of these comparisons is an index of the variation of the speeds estimated in the two time intervals. There are two derived acceleration measurements provided, one relating to the ascending observation geometry and the other relating to the descending observation geometry. Each measurement contains the speed variation indices relating to the points identified and measured by the ERS and ENVISAT calculations within certain areas. The result of the comparison consists of two representative maps of the average speed difference measured in the time interval 1992-2000 with ERS data, and 2003-2008 with ENVISAT data. The differences refer to extension areas of 100 m x 100 m for a total coverage of almost 100,000 sq km in both geometries. The comparison refers to over ten million DPs overall between ERS and ENVISAT.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The motivation behind collecting this data-set was personal, with the objective of answering a simple question, “does exercise/working-out improve a person’s activeness?”. For the scope of this project a person’s activeness was the measure of their daily step-count (the number of steps they take in a day). Mood was measured in either "Happy", "Neutral" or "Sad" which were given numeric values of 300, 200 and 100 respectively. Feeling of activeness was measured in either "Active" or "Inactive" which were given numeric values of 500 and 0 respectively. I had noticed for a while that during the months when I was exercising regularly I felt more active and would move around a lot more. As opposed to when I was not working out, i would feel lethargic. I wanted to know for sure what the connection between exercise and activeness was. I started compiling the data on 6th October with the help Samsung Health application that was recording my daily step count and the number of calories burned. The purpose of the project was to establish through two sets of data (control and experimental) if working-out/exercise promotes an increase in the daily step-count or not.
Date Step Count Calories Burned Mood Hours of Sleep Feeling or Activeness or Inactiveness Weight
Special thanks to Samsung Health that contributed to the set by providing daily step count and the number of calories burned.
"Does exercise/working-out improve a person’s activeness?”
Facebook
TwitterAttribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
A Mapping analysis of all the sensilla following stimulation with 5% saponin in control and Gr28bMi flies (we followed Tanimura's nomenclature) (n ≥ 8). The error bars represent SEMs. The asterisks indicate significant differences from that of the control detected by a single-factor ANOVA with Scheffe's analysis to compare two sets of data (*P < 0.05, **P < 0.01).. List of tagged entities: arthropod sensillum (uberon:UBERON:0002536), Gr28b (ncbigene:117496), Mapping analysis,electrophysiological method (bao:BAO_0000424)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data was collected in order to compare quality of the signal acquired by two devices – BITalino (Da Silva, Guerreiro, Lourenço, Fred, & Martins, 2014) and BioNomadix (BIOPAC Systems Inc., Goleta, CA, USA).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2018. For a description of the data collection, processing, and output methods, please see the "methods" section below. Note that the RAMP data model changed in August, 2018 and two sets of documentation are provided to describe data collection and processing before and after the change.
Methods
RAMP Data Documentation – January 1, 2017 through August 18, 2018
Data Collection
RAMP data were downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data from January 1, 2017 through August 18, 2018 were downloaded in one dataset per participating IR. The following fields were downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
Following data processing describe below, on ingest into RAMP an additional field, citableContent, is added to the page level data.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, data are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the data which records whether each URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
Processed data are then saved in a series of Elasticsearch indices. From January 1, 2017, through August 18, 2018, RAMP stored data in one index per participating IR.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.
CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).
For any specified date range, the steps to calculate CCD are:
Filter data to only include rows where "citableContent" is set to "Yes."
Sum the value of the "clicks" field on these rows.
Output to CSV
Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above.
The data in these CSV files include the following fields:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
country: The country from which the corresponding search originated.
device: The device used for the search.
date: The date of the search.
citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
index: The Elasticsearch index corresponding to page click data for a single IR.
repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the index field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
Filenames for files containing these data follow the format 2018-01_RAMP_all.csv. Using this example, the file 2018-01_RAMP_all.csv contains all data for all RAMP participating IR for the month of January, 2018.
Data Collection from August 19, 2018 Onward
RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).
Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:
url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.
The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:
country: The country from which the corresponding search originated.
device: The device used for the search.
impressions: The number of times the URL appears within the SERP.
clicks: The number of clicks on a URL which took users to a page outside of the SERP.
clickThrough: Calculated as the number of clicks divided by the number of impressions.
position: The position of the URL within the SERP.
date: The date of the search.
Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.
More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en
Data Processing
Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."
The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.
Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.
About Citable Content Downloads
Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository
Facebook
TwitterThis data is from Gaia Data Release 3 (DR3) and includes data on two star clusters: NGC 188 and M67. The data is used in my astronomy class, wherein students are tasked with determining which star cluster is older. (Update, 12-Sep-2023: I'm hoping to add a ML version of the data set that includes more field stars and divides the data into test and train sets. TBA.)
NGC 188 and M67 stars are separate csv files, with each row corresponding to a star. There are two versions for each star cluster:
For more on these quantities, please see https://gea.esac.esa.int/archive/documentation/GDR3/Gaia_archive/chap_datamodel/sec_dm_main_source_catalogue/ssec_dm_gaia_source.html
SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec
FROM gaiadr3.gaia_source
WHERE
gaia_source.l BETWEEN 215 AND 216 AND
gaia_source.b BETWEEN 31.5 AND 32.5 AND
gaia_source.phot_g_mean_mag < 18 AND
gaia_source.parallax_over_error > 4 AND
gaia_source.bp_rp IS NOT NULL
SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec
FROM gaiadr3.gaia_source
WHERE
gaia_source.l BETWEEN 122 AND 123.5 AND
gaia_source.b BETWEEN 21.5 AND 23 AND
gaia_source.phot_g_mean_mag < 18 AND
gaia_source.parallax_over_error > 4 AND
gaia_source.bp_rp IS NOT NULL
Please see Gaia Archive's how to cite page for information regarding the use of the data.
The classroom activity and my code are free to use under an MIT License.
Facebook
TwitterThis dataset contains the 30 questions that were posed to the chatbots (i) ChatGPT-3.5; (ii) ChatGPT-4; and (iii) Google Bard, in May 2023 for the study “Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard”. These 30 questions describe mathematics and logic problems that have a unique correct answer. The questions are fully described with plain text only, without the need for any images or special formatting. The questions are divided into two sets of 15 questions each (Set A and Set B). The questions of Set A are 15 “Original” problems that cannot be found online, at least in their exact wording, while Set B contains 15 “Published” problems that one can find online by searching on the internet, usually with their solution. Each question is posed three times to each chatbot.
This dataset contains the following: (i) The full set of the 30 questions, A01-A15 and B01-B15; (ii) the correct answer for each one of them; (iii) an explanation of the solution, for the problems where such an explanation is needed, (iv) the 30 (questions) × 3 (chatbots) × 3 (answers) = 270 detailed answers of the chatbots. For the published problems of Set B, we also provide a reference to the source where each problem was taken from.
Facebook
TwitterThis data set contains data associated with MODIS fire maps generated using two different algorithms and compared against fire maps produced by ASTER. These data relate to a paper (Morisette et al., 2005) that describes the use of high spatial resolution ASTER data to evaluate the characteristics of two fire detection algorithms, both applied to MODIS-Terra data and both operationally producing publicly available fire locations. The two algorithms are NASA's operational Earth Observing System MODIS fire detection product and Brazil's National Institute for Space Research (INPE) algorithm. These data are the ASCII files used in the logistic regression and error matrices presented in the paper.
Facebook
Twitter🎬 Описание
Данный набор данных объединяет информацию о фильмах из различных топов Кинопоиска и IMDb. Он создан для сравнительного анализа рейтингов, жанров, стран производства и других характеристик популярных фильмов из двух крупнейших мировых баз данных о кино.
Датасет может быть полезен для исследователей, аналитиков и киноманов, желающих изучить различия между предпочтениями русскоязычной и международной аудиторий, а также выявить закономерности между рейтингами, бюджетами и жанрами фильмов.
📂 Состав датасета
Title – название фильма (строка)
kinopoiskId – уникальный идентификатор фильма на сайте Кинопоиск (целое число или строка)
imdbId – уникальный идентификатор фильма на сайте IMDb (строка)
Year – год выпуска фильма (целое число)
Rating Kinopoisk – рейтинг фильма по версии Кинопоиска (дробное число от 0 до 10)
Rating Imdb – рейтинг фильма по версии IMDb (дробное число от 0 до 10)
Age Limit – возрастное ограничение (например, "6+", "12+", "18+")
Genres – жанры фильма (строка или список жанров, разделённых запятой)
Country – страна или страны производства фильма (строка)
Director – имя режиссёра (строка)
Budget – бюджет фильма в долларах США (целое число)
Fees – кассовые сборы фильма в долларах США (целое число)
Description Kinopoisk – краткое описание фильма с сайта Кинопоиск (на русском языке)
Description Imdb – краткое описание фильма с сайта IMDb (на английском языке)
Возможные направления анализа
Сравнение рейтингов фильмов между Кинопоиском и IMDb;
Анализ наиболее популярных жанров и их динамики по годам;
Исследование зависимости рейтингов от бюджета или сборов;
Сравнение предпочтений зрителей разных стран.
ENGLISH: 🎬 Description
This dataset combines information about movies from various IMDb and Kinopoisk top lists. It was created for a comparative analysis of ratings, genres, countries of production, and other characteristics of popular films from two of the world's largest movie databases.
The dataset can be useful for researchers, analysts, and movie enthusiasts who want to explore the differences between Russian-speaking and international audiences’ preferences, as well as to identify patterns between ratings, budgets, and genres.
📂 Dataset Structure
Title – movie title (string)
kinopoiskId – unique movie identifier on Kinopoisk (integer or string)
imdbId – unique movie identifier on IMDb (string)
Year – year of release (integer)
Rating Kinopoisk – movie rating according to Kinopoisk (float from 0 to 10)
Rating Imdb – movie rating according to IMDb (float from 0 to 10)
Age Limit – age restriction (e.g., "6+", "12+", "18+")
Genres – movie genres (string or list of genres separated by commas)
Country – country or countries of production (string)
Director – name of the director (string)
Budget – movie budget in USD (integer)
Fees – box office revenue in USD (integer)
Description Kinopoisk – short movie description from Kinopoisk (in Russian)
Description Imdb – short movie description from IMDb (in English)
📊 Possible Analysis Directions
Comparing movie ratings between Kinopoisk and IMDb;
Analyzing the most popular genres and their evolution over time;
Studying the relationship between ratings, budgets, and box office revenue;
Comparing audience preferences across different countries.
Facebook
TwitterData on the efficacy of 5 pulicides as tools for suppressing fleas on black-tailed prairie dogs in Buffalo Gap National Grassland, South Dakota, 2015-2017. Fleas were collected from live-trapped prairie dogs on non-treated (CONTROL) sites and nearby sites treated with pulicides for flea control. Data are from 3 prairie dog colonies (South Exclosure, Cutbank, and Big Foot). We tested the following pulicides: Alpine ALPINE dust (0.25% dinotefuran wih 95% diatomaceous earth), Dusta-cide MALATHION dust (6% malathion), Sevin SEVIN dust (5% carbaryl), Tri-Die TRIDIE dust (1% pyrethrum with 40% amorphous silica and 10% piperonyl butoxide), and FIPRONIL grain (0.005% fipronil). Two sets of data are presented, each with flea counts (Fleas) from prairie dogs. Each line of data is from an individual prairie dog. The first set of data, Shortterm BACIs 2015-17, includes data from short-term before-after-control-impact (BACI) experiments comparing the abundance of fleas on prairie dogs at non-treated and treated sites in 3 time intervals: before pulicide treatments (Before), from 1 to 30 days after the treatments (After-1), and from 31 to 91 days after the treatments (After-2). The second set of data, Longterm BACI 2016-17, includes data from long-term BACI experiments comparing the abundance of fleas on prairie dogs at non-treated and treated sites in 3 time intervals: before pulicide treatments (June-July 2016), 11 months after the treatments (June 2017), and 12 months after the treatments (July 2017). Funding and logistical support were provided by the U. S. Geological Survey, U. S. Forest Service, National Park Service, U. S. Fish and Wildlife Service, Colorado State University, Prairie Wildlife Research, National Fish and Wildlife Foundation, and World Wildlife Fund.
Facebook
TwitterThese data include pitch angle diffusion coefficients for chorus waves which have been evaluated at the angle of loss cone calculated in multiple ways. We have predominately concentrated on the dawnside between 00-12 MLT (Magnetic Local Time), for 5<L*<5.5 as this is where we have Van Allen Radiation Belt Storm Probes (RBSP) measurements and scattering of electrons due to chorus waves is known to occur. We have used 7 years of RBSP wave and cold plasma measurements between November 2012 to October 2019 to calculate these diffusion coefficients. For the first two sets of data we provide chorus diffusion coefficients with fpe/fce times by 2 and divided by 2 respectively. The next four data sets have been calculated from RBSP data using two different methods, first using average values, as has previously been done (e.g. Horne et al [2013]) and used above, and secondly by using co-located measurements of the wave spectra and fpe/fce to calculate pitch angle diffusion coefficients (Daa), where fpe is the plasma frequency and fce is the proton gyro frequency, and then averaging, similar to that presented in Ross et al [2021] for Electromagnetic Ion Cyclotron (EMIC) waves and Wong et al [2022] for magnetosonic waves. Both methods use a modified version of the PADIE code Glauert et al [2005] which allows an arbitrary wave power spectral density input rather than Gaussian inputs. The RBSP chorus diffusion coefficient matrices are computed by combining RBSP data with a profile for how chorus wave power changes with latitude, derived from the VLF database in Meredith et al [2018]. The magnetic latitude profile enables us to map RBSP measurements to magnetic latitudes between 0<MLAT<60 and therefore include the effects of high latitude chorus in our results. The RBSP diffusion matrices also use a new chorus wave normal angle model derived from RBSP data composed of different wave normal angle distributions for different spatial location and fpe/fce bins. Lastly we include two data sets of RBSP-chorus diffusion coefficients combined with diffusion coefficients due to collisions with atmospheric particles to calculate the total diffusion of electrons near the loss cone between 00-12 MLT, for 5<L*<5.5. We have produced these different sets of chorus (and combined chorus and collision) diffusion coefficients to test our methods of calculating electron precipitation and find what variables these calculations are sensitive to.
Funding was provided by NERC Highlight Topic Grant NE/P01738X/1 (Rad-Sat) and NERC National Capability grants NE/R016038/1 and NE/R016445/1
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The receiver operating characteristics (ROC) curve is typically employed when one wants to evaluate the discriminatory capability of a continuous or ordinal biomarker in the case where two groups are to be distinguished, commonly the ’healthy’ and the ’diseased’. There are cases for which the disease status has three categories. Such cases employ the (ROC) surface, which is a natural generalization of the ROC curve for three classes. In this paper, we explore new methodologies for comparing two continuous biomarkers that refer to a trichotomous disease status, when both markers are applied to the same patients. Comparisons based on the volume under the surface have been proposed, but that measure is often not clinically relevant. Here, we focus on comparing two correlated ROC surfaces at given pairs of true classification rates, which are more relevant to patients and physicians. We propose delta-based parametric techniques, power transformations to normality, and bootstrap-based smooth nonparametric techniques to investigate the performance of an appropriate test. We evaluate our approaches through an extensive simulation study and apply them to a real data set from prostate cancer screening.