Facebook
TwitterIn order to analyse specific features of data papers, we established a representative sample of data journals, based on lists from the European FOSTER Plus project , the German wiki forschungsdaten.org hosted by the University of Konstanz and two French research organizations.The complete list consists of 82 data journals, i.e. journals which publish data papers. They represent less than 0,5% of academic and scholarly journals. For each of these 82 data journals, we gathered information about the discipline, the global business model, the publisher, peer reviewing etc. The analysis is partly based on data from ProQuest’s Ulrichsweb database, enriched and completed by information available on the journals’ home pages.One part of the data journals are presented as “pure” data journals stricto sensu , i.e. journals which publish exclusively or mainly data papers. We identified 28 journals of this category (34%). For each journal, we assessed through direct search on the journals’ homepages (information about the journal, author’s guidelines etc.) the use of identifiers and metadata, the mode of selection and the business model, and we assessed different parameters of the data papers themselves, such as length, structure, linking etc.The results of this analysis are compared with other research journals (“mixed” data journals) which publish data papers along with regular research articles, in order to identify possible differences between both journal categories, on the level of data papers as well as on the level of the regular research papers. Moreover, the results are discussed against concepts of knowledge organization.
Facebook
TwitterIn 2020, more than ** percent of hedge fund managers classified as alternative data market leaders used ***** or more alternative data sets globally, while only ***** percent of the rest of the market used at least ***** alternative data sets. This highlights the difference between the level of alternative data experience between the two groups. Using *** or more alternative data sets was the most popular approach across both groups with ** percent of market leaders and ** percent of the rest of the market doing this.
Facebook
TwitterA performance comparison between data structures.
Facebook
TwitterThe total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Open access and open data are becoming more prominent on the global research agenda. Funders are increasingly requiring grantees to deposit their raw research data in appropriate public archives or stores in order to facilitate the validation of results and further work by other researchers.
While the rise of open access has fundamentally changed the academic publishing landscape, the policies around data are reigniting the conversation around what universities can and should be doing to protect the assets generated at their institution. The main difference between an open access and open data policy is that there is not already a precedent or status quo of how academia deals with the dissemination of research that is not in the form of a traditional ‘paper’ publication.
As governments and funders of research see the benefit of open content, the creation of recommendations, mandates and enforcement of mandates are coming thick and fast.
Facebook
TwitterThis dataset contains all data and code necessary to reproduce the analysis presented in the manuscript: Winzeler, H.E., Owens, P.R., Read Q.D.., Libohova, Z., Ashworth, A., Sauer, T. 2022. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018. There are several steps to this analysis. The relevant scripts for each are listed below. The first step is to use the raw digital elevation data (DEM) to produce different versions of the topographic wetness index (TWI) for the study region (Calculating TWI). Then, these TWI output files are processed, along with soil moisture (volumetric water content or VWC) time series data from a number of sensors located within the study region, to create analysis-ready data objects (Processing TWI and VWC). Next, models are fit relating TWI to soil moisture (Model fitting) and results are plotted (Visualizing main results). A number of additional analyses were also done (Additional analyses). Input data The DEM of the study region is archived in this dataset as SourceDem.zip. This contains the DEM of the study region (DEM1.sgrd) and associated auxiliary files all called DEM1.* with different extensions. In addition, the DEM is provided as a .tif file called USGS_one_meter_x39y400_AR_R6_WashingtonCO_2015.tif. The remaining data and code files are archived in the repository created with a GitHub release on 2022-10-11, twi-moisture-0.1.zip. The data are found in a subfolder called data. 2017_LoggerData_HEW.csv through 2021_HEW.csv: Soil moisture (VWC) logger data for each year 2017-2021 (5 files total). 2882174.csv: weather data from a nearby station. DryPeriods2017-2021.csv: starting and ending days for dry periods 2017-2021. LoggerLocations.csv: Geographic locations and metadata for each VWC logger. Logger_Locations_TWI_2017-2021.xlsx: 546 topographic wetness indexes calculated at each VWC logger location. note: This is intermediate input created in the first step of the pipeline. Code pipeline To reproduce the analysis in the manuscript run these scripts in the following order. The scripts are all found in the root directory of the repository. See the manuscript for more details on the methods. Calculating TWI TerrainAnalysis.R: Taking the DEM file as input, calculates 546 different topgraphic wetness indexes using a variety of different algorithms. Each algorithm is run multiple times with different input parameters, as described in more detail in the manuscript. After performing this step, it is necessary to use the SAGA-GIS GUI to extract the TWI values for each of the sensor locations. The output generated in this way is included in this repository as Logger_Locations_TWI_2017-2021.xlsx. Therefore it is not necessary to rerun this step of the analysis but the code is provided for completeness. Processing TWI and VWC read_process_data.R: Takes raw TWI and moisture data files and processes them into analysis-ready format, saving the results as CSV. qc_avg_moisture.R: Does additional quality control on the moisture data and averages it across different time periods. Model fitting Models were fit regressing soil moisture (average VWC for a certain time period) against a TWI index, with and without soil depth as a covariate. In each case, for both the model without depth and the model with depth, prediction performance was calculated with and without spatially-blocked cross-validation. Where cross validation wasn't used, we simply used the predictions from the model fit to all the data. fit_combos.R: Models were fit to each combination of soil moisture averaged over 57 months (all months from April 2017-December 2021) and 546 TWI indexes. In addition models were fit to soil moisture averaged over years, and to the grand mean across the full study period. fit_dryperiods.R: Models were fit to soil moisture averaged over previously identified dry periods within the study period (each 1 or 2 weeks in length), again for each of the 546 indexes. fit_summer.R: Models were fit to the soil moisture average for the months of June-September for each of the five years, again for each of the 546 indexes. Visualizing main results Preliminary visualization of results was done in a series of RMarkdown notebooks. All the notebooks follow the same general format, plotting model performance (observed-predicted correlation) across different combinations of time period and characteristics of the TWI indexes being compared. The indexes are grouped by SWI versus TWI, DEM filter used, flow algorithm, and any other parameters that varied. The notebooks show the model performance metrics with and without the soil depth covariate, and with and without spatially-blocked cross-validation. Crossing those two factors, there are four values for model performance for each combination of time period and TWI index presented. performance_plots_bymonth.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by month across the five years of data to show within-year trends. performance_plots_byyear.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by year to show trends across multiple years. performance_plots_dry_periods.Rmd: Prediction performance was presented for the models fit to the previously identified dry periods. performance_plots_summer.Rmd: Prediction performance was presented for the models fit to the June-September moisture averages. Additional analyses Some additional analyses were done that may not be published in the final manuscript but which are included here for completeness. 2019dryperiod.Rmd: analysis, done separately for each day, of a specific dry period in 2019. alldryperiodsbyday.Rmd: analysis, done separately for each day, of the same dry periods discussed above. best_indices.R: after fitting models, this script was used to quickly identify some of the best-performing indexes for closer scrutiny. wateryearfigs.R: exploratory figures showing median and quantile interval of VWC for sensors in low and high TWI locations for each water year. Resources in this dataset:Resource Title: Digital elevation model of study region. File Name: SourceDEM.zipResource Description: .zip archive containing digital elevation model files for the study region. See dataset description for more details.Resource Title: twi-moisture-0.1: Archived git repository containing all other necessary data and code . File Name: twi-moisture-0.1.zipResource Description: .zip archive containing all data and code, other than the digital elevation model archived as a separate file. This file was generated by a GitHub release made on 2022-10-11 of the git repository hosted at https://github.com/qdread/twi-moisture (private repository). See dataset description and README file contained within this archive for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bed Surface Profiling (BSP) data
The Bed Surface Profiling (BSP) data scanned by an underwater 3D laser scanner for different flow conditions (Case Nos. 1-12). The sampling rate was 10 Hz. Each line of data corresponds to the xyz point cloud data, where x-direction corresponds to streamwise direction, y-direction to trnasverse dirction, and z-dirction to vertical direction.
Code
A python code is included to process the BSP data for calculating sediment pickup probability for each case. This code analyzes 12 case files (Case1.txt to Case12.txt) containing measurement data, processes them using dynamic denoising filters, and calculates a derived metric (p_value) for each case. The results are saved in an Excel file.
Dynamic Denoising Filter (dynamic_denoise_filter):
d:
d=0.00185): Checks immediate neighbors (previous and next columns) to correct 0, 1, or -1 values.d=0.00380): Uses a larger window (11 columns) with different criteria:
Case Processing (process_case):
y (2nd column) and z (3rd column) values.dz as the finite difference of z values scaled by d.m marking regions where dz exceeds ±0.dynamic_denoise_filter to clean the marker matrix.y_diff as half the absolute difference between adjacent y values.y range.Parameter Settings:
d Values**:
0.00185 for Cases 1–60.00380 for Cases 7–12cr): Set to 0 for thresholding dz.D:/data_ana_ca.Output:
analysis_result_ca_cr.xlsx with columns Case and p_value.Case1.txt to Case12.txt) with comma-separated data.The code efficiently handles large datasets using vectorized operations and sliding window techniques, ensuring scalability for similar analytical tasks.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
In the frame of the QUAE project, an identification procedure was develop to sort singular behaviours in river temperature time series. This procedure was conceived as a tool to indentify particular behaviours in time series despite non continuous measurements and regardless the type of measurement (temperature, streamflow...). Three types of singularities are identified: extreme values (in some cases similar as outliers), roughened data (such as the difference between water temperature and air temperature) and buffered data (such as signals caused by groundwater inflows).
Facebook
TwitterThere has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Facebook
TwitterState estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Small business transactions and revenue data aggregated from several credit card processors, collected by Womply and compiled by Opportunity Insights. Transactions and revenue are reported based on the ZIP code where the business is located.
Data provided for CT (FIPS code 9), MA (25), NJ (34), NY (36), and RI (44).
Data notes from Opportunity Insights: Seasonally adjusted change since January 2020. Data is indexed in 2019 and 2020 as the change relative to the January index period. We then seasonally adjust by dividing year-over-year, which represents the difference between the change since January observed in 2020 compared to the change since January observed since 2019. We account for differences in the dates of federal holidays between 2019 and 2020 by shifting the 2019 reference data to align the holidays before performing the year-over-year division.
Small businesses are defined as those with annual revenue below the Small Business Administration’s thresholds. Thresholds vary by 6 digit NAICS code ranging from a maximum number of employees between 100 to 1500 to be considered a small business depending on the industry.
County-level and metro-level data and breakdowns by High/Middle/Low income ZIP codes have been temporarily removed since the August 21st 2020 update due to revisions in the structure of the raw data we receive. We hope to add them back to the OI Economic Tracker soon.
More detailed documentation on Opportunity Insights data can be found here: https://github.com/OpportunityInsights/EconomicTracker/blob/main/docs/oi_tracker_data_documentation.pdf
Facebook
TwitterBackground: Adolescent girls in Kenya are disproportionately affected by early and unintended pregnancies, unsafe abortion and HIV infection. The In Their Hands (ITH) programme in Kenya aims to increase adolescents' use of high-quality sexual and reproductive health (SRH) services through targeted interventions. ITH Programme aims to promote use of contraception and testing for sexually transmitted infections (STIs) including HIV or pregnancy, for sexually active adolescent girls, 2) provide information, products and services on the adolescent girl's terms; and 3) promote communities support for girls and boys to access SRH services.
Objectives: The objectives of the evaluation are to assess: a) to what extent and how the new Adolescent Reproductive Health (ARH) partnership model and integrated system of delivery is working to meet its intended objectives and the needs of adolescents; b) adolescent user experiences across key quality dimensions and outcomes; c) how ITH programme has influenced adolescent voice, decision-making autonomy, power dynamics and provider accountability; d) how community support for adolescent reproductive and sexual health initiatives has changed as a result of this programme.
Methodology ITH programme is being implemented in two phases, a formative planning and experimentation in the first year from April 2017 to March 2018, and a national roll out and implementation from April 2018 to March 2020. This second phase is informed by an Annual Programme Review and thorough benchmarking and assessment which informed critical changes to performance and capacity so that ITH is fit for scale. It is expected that ITH will cover approximately 250,000 adolescent girls aged 15-19 in Kenya by April 2020. The programme is implemented by a consortium of Marie Stopes Kenya (MSK), Well Told Story, and Triggerise. ITH's key implementation strategies seek to increase adolescent motivation for service use, create a user-defined ecosystem and platform to provide girls with a network of accessible subsidized and discreet SRH services; and launch and sustain a national discourse campaign around adolescent sexuality and rights. The 3-year study will employ a mixed-methods approach with multiple data sources including secondary data, and qualitative and quantitative primary data with various stakeholders to explore their perceptions and attitudes towards adolescents SRH services. Quantitative data analysis will be done using STATA to provide descriptive statistics and statistical associations / correlations on key variables. All qualitative data will be analyzed using NVIVO software.
Study Duration: 36 months - between 2018 and 2020.
Narok and Homabay counties
Households
All adolescent girls aged 15-19 years resident in the household.
The sampling of adolescents for the household survey was based on expected changes in adolescent's intention to use contraception in future. According to the Kenya Demographic and Health Survey 2014, 23.8% of adolescents and young women reported not intending to use contraception in future. This was used as a baseline proportion for the intervention as it aimed to increase demand and reduce the proportion of sexually active adolescents who did not intend to use contraception in the future. Assuming that the project was to achieve an impact of at least 2.4 percentage points in the intervention counties (i.e. a reduction by 10%), a design effect of 1.5 and a non- response rate of 10%, a sample size of 1885 was estimated using Cochran's sample size formula for categorical data was adequate to detect this difference between baseline and end line time points. Based on data from the 2009 Kenya census, there were approximately 0.46 adolescents girls per a household, which meant that the study was to include approximately 4876 households from the two counties at both baseline and end line surveys.
We collected data among a representative sample of adolescent girls living in both urban and rural ITH areas to understand adolescents' access to information, use of SRH services and SRH-related decision making autonomy before the implementation of the intervention. Depending on the number of ITH health facilities in the two study counties, Homa Bay and Narok that, we sampled 3 sub-Counties in Homa Bay: West Kasipul, Ndhiwa and Kasipul; and 3 sub-Counties in Narok, Narok Town, Narok South and Narok East purposively. In each of the ITH intervention counties, there were sub-counties that had been prioritized for the project and our data collection focused on these sub-counties selected for intervention. A stratified sampling procedure was used to select wards with in the sub-counties and villages from the wards. Then households were selected from each village after all households in the villages were listed. The purposive selection of sub-counties closer to ITH intervention facilities meant that urban and semi-urban areas were oversampled due to the concentration of health facilities in urban areas.
Qualitative Sampling
Focus Group Discussion participants were recruited from the villages where the ITH adolescent household survey was conducted in both counties. A convenience sample of consenting adults living in the villages were invited to participate in the FGDS. The discussion was conducted in local languages. A facilitator and note-taker trained on how to use the focus group guide, how to facilitate the group to elicit the information sought, and how to take detailed notes. All focus group discussions took place in the local language and were tape-recorded, and the consent process included permission to tape-record the session. Participants were identified only by their first names and participants were asked not to share what was discussed outside of the focus group. Participants were read an informed consent form and asked to give written consent. In-depth interviews were conducted with purposively selected sample of consenting adolescent girls who participated in the adolescent survey. We conducted a total of 45 In-depth interviews with adolescent girls (20 in Homa Bay County and 25 in Narok County respectively). In addition, 8 FGDs (4 each per county) were conducted with mothers of adolescent girls who are usual residents of the villages which had been identified for the interviews and another 4 FGDs (2 each per county) with CHVs.
N/A
Face-to-face [f2f] for quantitative data collection and Focus Group Discussions and In Depth Interviews for qualitative data collection
The questionnaire covered; socio-demographic and household information, SRH knowledge and sources of information, sexual activity and relationships, family planning knowledge, access, choice and use when needed, exposure to family planning messages and voice and decision making autonomy and quality of care for those who visited health facilities in the 12 months before the survey. The questionnaire was piloted before the data collection and the questions reviewed for appropriateness, comprehension and flow. The questionnaire was piloted among a sample of 42 adolescent girls (two each per field interviewer) 15-19 from a community outside the study counties.
The questionnaire was originally developed in English and later translated into Kiswahili. The questionnaire was programmed using ODK-based Survey CTO platform for data collection and management and was administered through face-to-face interview.
The survey tools were programmed using the ODK-based SurveyCTO platform for data collection and management. During programming, consistency checks were in-built into the data capture software which ensured that there were no cases of missing or implausible information/values entered into the database by the field interviewers. For example, the application included controls for variables ranges, skip patterns, duplicated individuals, and intra- and inter-module consistency checks. This reduced or eliminated errors usually introduced at the data capture stage. Once programmed, the survey tools were tested by the programming team who in conjunction with the project team conducted further testing on the application's usability, in-built consistency checks (skips, variable ranges, duplicating individuals etc.), and inter-module consistency checks. Any issues raised were documented and tracked on the Issue Tracker and followed up to full and timely resolution. After internal testing was done, the tools were availed to the project and field teams to perform user acceptance testing (UAT) so as to verify and validate that the electronic platform worked exactly as expected, in terms of usability, questions design, checks and skips etc.
Data cleaning was performed to ensure that data were free of errors and that indicators generated from these data were accurate and consistent. This process begun on the first day of data collection as the first records were uploaded into the database. The data manager used data collected during pilot testing to begin writing scripts in Stata 14 to check the variables in the data in 'real-time'. This ensured the resolutions of any inconsistencies that could be addressed by the data collection teams during the fieldwork activities. The Stata 14 scripts that perform real-time checks and clean data also wrote to a .rtf file that detailed every check performed against each variable, any inconsistencies encountered, and all steps that were taken to address these inconsistencies. The .rtf files also reported when a variable was
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides the key configuration files and scripts required to reproduce the numerical weather simulations presented in our study “High-resolution data assimilation for two maritime extreme weather events: a comparison between 3D-Var and EnKF” (https://doi.org/10.5194/nhess-2024-177 ). The experiments were conducted using the Weather Research and Forecasting (WRF) model v4.0. Included in this dataset are: Namelist files (namelist.wps, namelist.input) that define the model setup and allow replication of the simulation environment used in the paper. Python scripts used to generate the figures presented in the publication, enabling users to reproduce the main visual results directly. Due to storage limitations, the full model outputs from the numerical simulations (several terabytes in size) cannot be hosted in this repository. Instead, we provide the essential configuration files and analysis scripts so that researchers can rerun the simulations on their own computational resources and replicate both the workflow and figures described in the paper.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Motivation
Song’s Vegetation Continuous fields (VCF) product, based on AVHRR satellite data, is the longest time-series of its type, but lacks updates past 2016 due to the extensive degradation of the sensor. We used machine learning to extend this time-series using data from the Copernicus Land Cover dataset, which provides per-pixel proportions of different land cover classes between 2015 and 2019. In addition, we included MODIS VCF data.
Content
This repository contains the infrastructure used to model Song-like VCF data past 2016. This infrastructure contains a yaml file that configures the modelling framework (e.g. variables, directories, hyper-parameter tuning), and that interacts with a standardized folder structure.
Modelling approach
Song's VCF dataset includes data on generic categories, namely “tree cover”, “non-tree vegetation”, and “non vegetated”. Given the Copernicus dataset has a higher thematic detail, we first aggregated these data into comparable classes. We created a “Non-tree vegetation” layer (i.e. total per-pixel proportion of crops, grasses, shrubs, and mosses), and a “Non Vegetated” layer (i.e. total per-pixel proportion of bare land, permanent water, urban, and snow). Independent data on “Tree cover” was already present.
We then constructed a Random Forest Regression (RFReg) model to predict Song-like VCF layers between 2016 and 2019. The predictions were informed by variables on topography, climate, and fires (which limit the density of vegetation), and by variables on differences between the Copernicus VCF and MODIS-based VCF data. Because MODIS data is available past 2016, its inclusion informs our models on how MODIS data, and their differences compared to Copernicus data, relate to the values reported in Song's data.
Sampling scheme
For each VCF category, we collected samples on a country-by-country basis. Within each country, we estimated the difference in percent cover between the Song's and Copernicus VCF data, and sampled across a gradient of differences, from -100% (no cover in AVHRR and full cover in Copernicus) to +100% (full cover in AVHRR and no cover in Copernicus). We iterated through this range in intervals of 10% and sampled across a gradient of “tree cover”, “non-tree vegetation”, and “non vegetated”, in intervals of 10% from 0% to 100%. We collected at least one sample per 50 km2 in 2016, the last year where all VCF-related variables (Song's, Copernicus, MODIS) are available simultaneously. The amount of samples attributed to each range of differences is proportional to the area covered by this range within the country of reference. The sampling approach was repeated for each VCF class, and the outputs were later combined into a single set of samples that exclude duplicates, resulting in 238,052 samples.
Validation
The model outputs were validated using leave-one-out cross-validation. For each VCF class, the validation framework iterates through each country where samples were collected, excluding it for validation and using the remaining samples to train a RFReg models.This resulted in R2 values of 0.91, 0.87 and 0.91 for “tree cover”, “non-tree vegetation”, and “non vegetated”. respectively. The RMSE values were of 2.31%, 3.05%, and 2.25%.
The model was applied to data from 2015, which was not used to neither predict nor validate our models. A comparison between the 2015 Song data against our predictions, which consist of 8,764,232 pixels, yielded R2 values of 0.94, 0.91, and 0.97. The RMSE were 6.65%, 8.92%, and 5.96%. Additionally, we compared changes between 2015 and 2016, resulting in RMSE values of 2.83%, 3.69%, and 2.57%.
Post-processing
When observing annual VCF time-series based on Song's data, we noted that our predictions were the most plausible for “tree cover” and “non-tree vegetation”. In turn, our “non vegetated” are seemingly underestimated (see "temporal_trend_check.png"), reporting large year-to-year decreases om cover (-3.05% between 2016 and 2017, compared to -0.14% for "tree cover" and -0.26% for “non-tree vegetation”). To address this issue, we recommend deriving data on “non-vegetated” cover by computing the difference between 100% and the sum of "tree cover” and “non-tree vegetation”.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
These are peer-reviewed supplementary materials for the article 'Healthcare cost comparison between first-line ibrutinib and acalabrutinib in chronic lymphocytic leukemia patients in the Veterans Affairs' published in the Journal of Comparative Effectiveness Research.Supplemental Table 1: Administrative codes for comorbiditiesAim: Bruton’s tyrosine kinase inhibitors (BTKis), including ibrutinib and acalabrutinib, transformed the treatment landscape of chronic lymphocytic leukemia (CLL) and small lymphocytic lymphoma (SLL) by improving outcomes compared with chemoimmunotherapy. Real-world economic comparisons between BTKis are needed in diverse populations. This study aimed to compare healthcare costs in the Veterans Health Administration (VHA) among patients with CLL/SLL treated with, and remaining persistent on, first-line (1L) ibrutinib versus acalabrutinib monotherapy for 12 months. Materials & methods: This retrospective study used VHA electronic medical record data from January 2006 to July 2024. Eligible patients initiated 1L ibrutinib or acalabrutinib monotherapy on or after November 2019 and remained on continuous treatment for ≥12 months. All-cause and CLL/SLL-related costs were assessed over 12 months of follow-up. Generalized linear models were used to estimate adjusted costs and compare differences between treatment cohorts. Results: A total of 1059 patients were included (ibrutinib: n = 732; acalabrutinib: n = 327). During the 12-month follow-up of continuous 1L treatment, the annual adjusted all-cause total healthcare cost difference between ibrutinib and acalabrutinib was -$2422 (p = 0.46) (adjusted medical cost difference: $5259, p = 0.03; adjusted pharmacy cost difference: -$5886, p = 0.02). The annual adjusted CLL/SLL-related total healthcare cost difference between ibrutinib and acalabrutinib was -$3793 (p = 0.15) (adjustedmedical cost difference: $2085, p = 0.05; adjusted pharmacy cost difference: -$5860, p = 0.02). Conclusion: Among VHA patientswith CLL/SLL who initiated and remained on treatment with 1L BTKi monotherapy for 12 months, annual all-cause and CLL/SLL-related total healthcare costs were similar between ibrutinib and acalabrutinib. Pharmacy costs were lower for ibrutinib, while medical costs were lower for acalabrutinib, resulting in overall comparable total costs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The international investment position (IIP) is a statistical statement that shows at a point in time the value and composition of: -financial assets of residents of an economy that are claims on non-residents and gold bullion held as reserve assets, and -liabilities of residents of an economy to non-residents. The difference between an economy’s external financial assets and liabilities is the economy’s net IIP, which may be positive or negative. Respectively the net international investment position (NIIP) provides an aggregate view of the net financial position (assets minus liabilities) of a country vis-à-vis the rest of the world. It allows for a stock-flow analysis of external position of the country. The MIP scoreboard indicator is the net international investment position expressed in percent of GDP. The indicator is based on the Eurostat data from the Balance of payment statistics. These data are quaterly reported by the EU Member States. Definitions are based on the Sixth Edition of the IMF's Balance of Payments and International Investment Position Manual (BPM6). The indicative threshold is -35%. The MIP scoreboard indicator is calculated as: [NIIPt/GDPt]*100.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The CAGES program (Comparative Assessment of Gulf Estuarine Systems) is designed to examine the differences between estuarine ecosystems and investigate why some are more productive than others. The program focuses on estuarine areas important to commercial fisheries and includes data on commercial finfish and invertebrate species, as well as other species commonly captured in trawl sampling. This Louisiana dataset is a subset of the CAGES Relational Database which is a compilation of fishery-independent data contributed by natural resource agencies of the Gulf States (Texas, Louisiana, Mississippi, Alabama, and Florida). This data set includes data from trawls collected by the Alabama Department of Conservation and Natural Resources (ADCNR) and Gulf Coast Research Laboratory (GCRL), as well as associated hydrographic data. Trawl data were used to calculate CPUE (catch per unit effort) at each station for the years 1981 through 2007.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionUK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is distributing this electricity across our regions through circuits. Electricity enters our network through Super Grid Transformers at substations shared with National Grid we call Grid Supply Points. It is then sent at across our 132 kV Circuits towards our grid substations and primary substations. These circuits can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.
This dataset provides half-hourly current and power flow data across these named circuits from 2021 through to the previous month across our license areas. The data are aligned with the same naming convention as the LTDS for improved interoperability.
Care is taken to protect the private affairs of companies connected to the 132 kV network, resulting in the redaction of certain circuits. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.
To find which circuit you are looking for, use the ‘ltds_line_name’ that can be cross-referenced in the 132kV Circuits Monthly Data, which describes by month what circuits were triaged, if they could be made public, and what the monthly statistics are of that site.
If you want to download all this data, it is perhaps more convenient from our public sharepoint: Sharepoint
This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.
Methodological Approach
The dataset is not derived, it is the measurements from our network stored in our historian.
The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps.
We developed a data redactions process to protect the privacy of companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer.
The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation.
Quality Control Statement
The data is provided "as is".
In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these measurements are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that.
Assurance Statement
Creating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS circuit from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same circuit in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets.
There is potential for human error during the manual data processing. These issues can include missing circuits, incorrectly labelled circuits, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible.
Additional Information
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary.
Download dataset information: Metadata (JSON)To view this data please register and login.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Code:
Packet_Features_Generator.py & Features.py
To run this code:
pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j
-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j
Purpose:
Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.
Uses Features.py to calcualte the features.
startMachineLearning.sh & machineLearning.py
To run this code:
bash startMachineLearning.sh
This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags
Options (to be edited within this file):
--evaluate-only to test 5 fold cross validation accuracy
--test-scaling-normalization to test 6 different combinations of scalers and normalizers
Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use
--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'
Purpose:
Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.
Data
Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.
Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:
First number is a classification number to denote what website, query, or vr action is taking place.
The remaining numbers in each line denote:
The size of a packet,
and the direction it is traveling.
negative numbers denote incoming packets
positive numbers denote outgoing packets
Figure 4 Data
This data uses specific lines from the Virtual Reality.txt file.
The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.
The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.
The .xlsx and .csv file are identical
Each file includes (from right to left):
The origional packet data,
each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,
and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.
Facebook
TwitterDifferent countries have different health outcomes that are in part due to the way respective health systems perform. Regardless of the type of health system, individuals will have health and non-health expectations in terms of how the institution responds to their needs. In many countries, however, health systems do not perform effectively and this is in part due to lack of information on health system performance, and on the different service providers.
The aim of the WHO World Health Survey is to provide empirical data to the national health information systems so that there is a better monitoring of health of the people, responsiveness of health systems and measurement of health-related parameters.
The overall aims of the survey is to examine the way populations report their health, understand how people value health states, measure the performance of health systems in relation to responsiveness and gather information on modes and extents of payment for health encounters through a nationally representative population based community survey. In addition, it addresses various areas such as health care expenditures, adult mortality, birth history, various risk factors, assessment of main chronic health conditions and the coverage of health interventions, in specific additional modules.
The objectives of the survey programme are to: 1. develop a means of providing valid, reliable and comparable information, at low cost, to supplement the information provided by routine health information systems. 2. build the evidence base necessary for policy-makers to monitor if health systems are achieving the desired goals, and to assess if additional investment in health is achieving the desired outcomes. 3. provide policy-makers with the evidence they need to adjust their policies, strategies and programmes as necessary.
The survey sampling frame must cover 100% of the country's eligible population, meaning that the entire national territory must be included. This does not mean that every province or territory need be represented in the survey sample but, rather, that all must have a chance (known probability) of being included in the survey sample.
There may be exceptional circumstances that preclude 100% national coverage. Certain areas in certain countries may be impossible to include due to reasons such as accessibility or conflict. All such exceptions must be discussed with WHO sampling experts. If any region must be excluded, it must constitute a coherent area, such as a particular province or region. For example if ¾ of region D in country X is not accessible due to war, the entire region D will be excluded from analysis.
Households and individuals
The WHS will include all male and female adults (18 years of age and older) who are not out of the country during the survey period. It should be noted that this includes the population who may be institutionalized for health reasons at the time of the survey: all persons who would have fit the definition of household member at the time of their institutionalisation are included in the eligible population.
If the randomly selected individual is institutionalized short-term (e.g. a 3-day stay at a hospital) the interviewer must return to the household when the individual will have come back to interview him/her. If the randomly selected individual is institutionalized long term (e.g. has been in a nursing home the last 8 years), the interviewer must travel to that institution to interview him/her.
The target population includes any adult, male or female age 18 or over living in private households. Populations in group quarters, on military reservations, or in other non-household living arrangements will not be eligible for the study. People who are in an institution due to a health condition (such as a hospital, hospice, nursing home, home for the aged, etc.) at the time of the visit to the household are interviewed either in the institution or upon their return to their household if this is within a period of two weeks from the first visit to the household.
Sample survey data [ssd]
SAMPLING GUIDELINES FOR WHS
Surveys in the WHS program must employ a probability sampling design. This means that every single individual in the sampling frame has a known and non-zero chance of being selected into the survey sample. While a Single Stage Random Sample is ideal if feasible, it is recognized that most sites will carry out Multi-stage Cluster Sampling.
The WHS sampling frame should cover 100% of the eligible population in the surveyed country. This means that every eligible person in the country has a chance of being included in the survey sample. It also means that particular ethnic groups or geographical areas may not be excluded from the sampling frame.
The sample size of the WHS in each country is 5000 persons (exceptions considered on a by-country basis). An adequate number of persons must be drawn from the sampling frame to account for an estimated amount of non-response (refusal to participate, empty houses etc.). The highest estimate of potential non-response and empty households should be used to ensure that the desired sample size is reached at the end of the survey period. This is very important because if, at the end of data collection, the required sample size of 5000 has not been reached additional persons must be selected randomly into the survey sample from the sampling frame. This is both costly and technically complicated (if this situation is to occur, consult WHO sampling experts for assistance), and best avoided by proper planning before data collection begins.
All steps of sampling, including justification for stratification, cluster sizes, probabilities of selection, weights at each stage of selection, and the computer program used for randomization must be communicated to WHO
STRATIFICATION
Stratification is the process by which the population is divided into subgroups. Sampling will then be conducted separately in each subgroup. Strata or subgroups are chosen because evidence is available that they are related to the outcome (e.g. health, responsiveness, mortality, coverage etc.). The strata chosen will vary by country and reflect local conditions. Some examples of factors that can be stratified on are geography (e.g. North, Central, South), level of urbanization (e.g. urban, rural), socio-economic zones, provinces (especially if health administration is primarily under the jurisdiction of provincial authorities), or presence of health facility in area. Strata to be used must be identified by each country and the reasons for selection explicitly justified.
Stratification is strongly recommended at the first stage of sampling. Once the strata have been chosen and justified, all stages of selection will be conducted separately in each stratum. We recommend stratifying on 3-5 factors. It is optimum to have half as many strata (note the difference between stratifying variables, which may be such variables as gender, socio-economic status, province/region etc. and strata, which are the combination of variable categories, for example Male, High socio-economic status, Xingtao Province would be a stratum).
Strata should be as homogenous as possible within and as heterogeneous as possible between. This means that strata should be formulated in such a way that individuals belonging to a stratum should be as similar to each other with respect to key variables as possible and as different as possible from individuals belonging to a different stratum. This maximises the efficiency of stratification in reducing sampling variance.
MULTI-STAGE CLUSTER SELECTION
A cluster is a naturally occurring unit or grouping within the population (e.g. enumeration areas, cities, universities, provinces, hospitals etc.); it is a unit for which the administrative level has clear, nonoverlapping boundaries. Cluster sampling is useful because it avoids having to compile exhaustive lists of every single person in the population. Clusters should be as heterogeneous as possible within and as homogenous as possible between (note that this is the opposite criterion as that for strata). Clusters should be as small as possible (i.e. large administrative units such as Provinces or States are not good clusters) but not so small as to be homogenous.
In cluster sampling, a number of clusters are randomly selected from a list of clusters. Then, either all members of the chosen cluster or a random selection from among them are included in the sample. Multistage sampling is an extension of cluster sampling where a hierarchy of clusters are chosen going from larger to smaller.
In order to carry out multi-stage sampling, one needs to know only the population sizes of the sampling units. For the smallest sampling unit above the elementary unit however, a complete list of all elementary units (households) is needed; in order to be able to randomly select among all households in the TSU, a list of all those households is required. This information may be available from the most recent population census. If the last census was >3 years ago or the information furnished by it was of poor quality or unreliable, the survey staff will have the task of enumerating all households in the smallest randomly selected sampling unit. It is very important to budget for this step if it is necessary and ensure that all households are properly enumerated in order that a representative sample is obtained.
It is always best to have as many clusters in the PSU as possible. The reason for this is that the fewer the number of respondents in each PSU, the lower will be the clustering effect which
Facebook
TwitterIn order to analyse specific features of data papers, we established a representative sample of data journals, based on lists from the European FOSTER Plus project , the German wiki forschungsdaten.org hosted by the University of Konstanz and two French research organizations.The complete list consists of 82 data journals, i.e. journals which publish data papers. They represent less than 0,5% of academic and scholarly journals. For each of these 82 data journals, we gathered information about the discipline, the global business model, the publisher, peer reviewing etc. The analysis is partly based on data from ProQuest’s Ulrichsweb database, enriched and completed by information available on the journals’ home pages.One part of the data journals are presented as “pure” data journals stricto sensu , i.e. journals which publish exclusively or mainly data papers. We identified 28 journals of this category (34%). For each journal, we assessed through direct search on the journals’ homepages (information about the journal, author’s guidelines etc.) the use of identifiers and metadata, the mode of selection and the business model, and we assessed different parameters of the data papers themselves, such as length, structure, linking etc.The results of this analysis are compared with other research journals (“mixed” data journals) which publish data papers along with regular research articles, in order to identify possible differences between both journal categories, on the level of data papers as well as on the level of the regular research papers. Moreover, the results are discussed against concepts of knowledge organization.