This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)
Official statistics are produced impartially and free from political influence.
The United Nations Energy Statistics Database (UNSTAT) is a comprehensive collection of international energy and demographic statistics prepared by the United Nations Statistics Division. The 2004 version represents the latest in the series of annual compilations which commenced under the title World Energy Supplies in Selected Years, 1929-1950. Supplementary series of monthly and quarterly data on production of energy may be found in the Monthly Bulletin of Statistics. The database contains comprehensive energy statistics for more than 215 countries or areas for production, trade and intermediate and final consumption (end-use) for primary and secondary conventional, non-conventional and new and renewable sources of energy. Mid-year population estimates are included to enable the computation of per capita data. Annual questionnaires sent to national statistical offices serve as the primary source of information. Supplementary data are also compiled from national, regional and international statistical publications. The Statistics Division prepares estimates where official data are incomplete or inconsistent. The database is updated on a continuous basis as new information and revisions are received. This metadata file represents the population statistics during the expressed time. For more information about the country site codes, click this link to the United Nations "Standard country or area codes for statistical use": https://unstats.un.org/unsd/methodology/m49/overview/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).
It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:
In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).
Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).
After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.
Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).
Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.
On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).
Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).
Validation set
Model
True
False
Presence
A
B
Background
C
D
We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).
The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.
Regarding the model evaluation and estimation, we selected the following estimators:
1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).
2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).
Files included are original data inputs on stream fishes (fish_data_OEPA_2012.csv), water chemistry (OEPA_WATER_2012.csv), geographic data (NHD_Plus_StreamCat); modeling files for generating predictions from the original data, including the R code (MVP_R_Final.txt) and Stan code (MV_Probit_Stan_Final.txt); and the model output file containing predictions for all NHDPlus catchments in the East Fork Little Miami River watershed (MVP_EFLMR_cooc_Final). This dataset is associated with the following publication: Martin, R., E. Waits, and C. Nietch. Empirically-based modeling and mapping to consider the co-occurrence of ecological receptors and stressors. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 613(614): 1228-1239, (2018).
This dataset is a subset only of Code Enforcement information and is used by the Performance Measure "Code Cases per Code Officer". Some fields are calculated or derived. Full "Code Enforcement" dataset = https://citydata.mesaaz.gov/dataset/Code-Enforcement/hgf6-yenu
Demographic statistics broken down by zip code
Housing code enforcement activities, including inspections and violations.
https://www.ons.gov.uk/methodology/geography/licenceshttps://www.ons.gov.uk/methodology/geography/licences
The Register of Geographic Codes (RGC) is a key product that contains the definitive list of UK statistical geographies. ONS maintains the definitive set of statistical geographies, coordinates the issue of new codes, and maintains the relationship between active and archived code ranges on behalf of the Government Statistical Service. The RGC should be used in conjunction with the Code History Database, available to download separately.(File Size - 20 MB)
Statistics on effort use in Cod Recovery Zone and Western waters are submitted to the European Commission on the 15th day of every month.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">44 KB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:communications@marinemanagement.org.uk" target="_blank" class="govuk-link">communications@marinemanagement.org.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute"><abbr title="Portable Document Format" class="gem-c-attachment_abbr">PDF</abbr></span>, <span class="gem-c-attachment_attribute">15.3 KB</span>, <span class="gem-c-attachment_attribute">2 pages</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "instructional_code-search-net-java"
Dataset Summary
This is an instructional dataset for Java. The dataset contains two different kind of tasks:
Given a piece of code generate a description of what it does. Given a description generate a piece of code that fulfils the description.
Languages
The dataset is in English.
Data Splits
There are no splits.
Dataset Creation
May of 2023
Curation Rationale
This dataset… See the full description on the dataset page: https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-java.
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
The Woodland Carbon Code is a voluntary standard, initiated in July 2011, for woodland creation projects that make claims about the carbon they sequester (take out of the atmosphere).
Woodland Carbon Code statistics are used to monitor the uptake of this new voluntary standard, and are published quarterly since January 2013. Attribution statement:
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This repository contains code and manuscripts for a project in the lab relating to predicting the amount of diet induced obesity.
This PredictorsDietInducedObesity data is made available under the Open Data Commons Attribution License: http://opendatacommons.org/licenses/by/1.0. - See more at: http://opendatacommons.org/licenses/by/#sthash.PVeWmN1l.dpuf. For more information see the LICENSE.txt file in this directory.
Raw Data
---------------
All raw data is in the data/raw folder. This data is automatically obtained from our internal LIMS system and can only be updated from that point. The script files use this raw data to do the analysis. Data which has been processed is saved in the data/processed folder.
Script files
--------------
This code base includes the raw data and reproducible R code for these analyses within the scripts folder. All analysis file are R scripts as Rmarkdown (Rmd) files. These files can be run inside RStudio https://www.rstudio.com/ and will generate the md and html files which include the processed data. For more information on using R and using these files see http://cran.us.r-project.org/.
Publications
------------------
All posters, manuscripts or external presentations are in the publications folder. This includes revisions, and where possible reviewer comments and responses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are the data and code for the first column in the Spatial Demography journal's Software and Code series
This dataset contains data from California resident tax returns filed with California adjusted gross income and self-assessed tax listed by zip code. This dataset contains data for taxable years 1992 to the most recent tax year available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and Code to accompany the paper "Correlation Neglect in Student-to-School Matching."Abstract: We present results from three experiments containing incentivized school-choice scenarios. In these scenarios, we vary whether schools' assessments of students are based on a common priority (inducing correlation in admissions decisions) or are based on independent assessments (eliminating correlation in admissions decisions). The quality of students' application strategies declines in the presence of correlated admissions: application strategies become substantially more aggressive and fail to include attractive ``safety'' options. We provide a battery of tests suggesting that this phenomenon is at least partially driven by correlation neglect, and we discuss implications for the design and deployment of student-to-school matching mechanisms.
They enable further analysis and comparison of Regional Trade in goods data and contain information that includes:
The spreadsheets provide data on businesses using both the whole number and proportion number methodology, (see section 3.24 (page 14) of the RTS methodology document).
The spreadsheets will cover:
The Exporters by proportional business count spreadsheet was previously produced by the Department for International Trade.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">5.16 MB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
Request an accessible format.
If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:different.format@hmrc.gov.uk" target="_blank" class="govuk-link">different.format@hmrc.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">5.07 MB</span></p>
<p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
<details data-module="ga
Tax code rates for Tax Years 2006 through 2013. Data is updated yearly. For more information on Tax codes visit the Cook County Clerk's website at: http://www.cookcountyclerk.com/tsd/Pages/default.aspx
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exports: Value: Canada: Wadding,Felt&Nonwoven,SpecYarn,Rope data was reported at 0.601 USD mn in Mar 2025. This records a decrease from the previous number of 0.703 USD mn for Feb 2025. Exports: Value: Canada: Wadding,Felt&Nonwoven,SpecYarn,Rope data is updated monthly, averaging 0.167 USD mn from Jan 2000 (Median) to Mar 2025, with 303 observations. The data reached an all-time high of 1.965 USD mn in Apr 2024 and a record low of 0.001 USD mn in Jan 2019. Exports: Value: Canada: Wadding,Felt&Nonwoven,SpecYarn,Rope data remains active status in CEIC and is reported by Korea Customs Service. The data is categorized under Global Database’s South Korea – Table KR.JA009: Trade Statistics: Export: Value: HS Code: 2 Digits: Top 20 Countries.
This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)