Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization"
Abstract of the article:
This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques(a) threshold optimization using (i) GHOST and (ii) the area under the precision–recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomekand generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A simple dataset for benchmarking CreateML object detection models. The images are sampled from COCO dataset with eyes and nose bounding boxes added. It’s not meant to be serious or useful in a real application. The purpose is to look at how long it takes to train CreateML models with varying dataset and batch sizes.
Training performance is affected by model configuration, dataset size and batch configuration. Larger models and batches require more memory. I used CreateML object detection project to compare the performance.
Hardware
M1 Macbook Air * 8 GPU * 4/4 CPU * 16G memory * 512G SSD
M1 Max Macbook Pro * 24 GPU * 2/8 CPU * 32G memory * 2T SSD
Small Dataset Train: 144 Valid: 16 Test: 8
Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 16 | 11 | 1.5 | |32 | 29 | 17 | 2.8 | |64 | 56 | 30 | 5.4 | |128 | 170 | 57 | 12 |
Larger Dataset Train: 301 Valid: 29 Test: 18
Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 21 | 10 | 1.5 | |32 | 42 | 17 | 3.5 | |64 | 85 | 30 | 8.4 | |128 | 281 | 54 | 16.5 |
CreateML Settings
For all tests, training was set to Full Network. I closed CreateML between each run to make sure memory issues didn't cause a slow down. There is a bug with Monterey as of 11/2021 that leads to memory leak. I kept an eye on the memory usage. If it looked like there was a memory leak, I restarted MacOS.
Observations
In general, more GPU and memory with MBP reduces the training time. Having more memory lets you train with larger datasets. On M1 Macbook Air, the practical limit is 12G before memory pressure impacts performance. On M1 Max MBP, the practical limit is 26G before memory pressure impacts performance. To work around memory pressure, use smaller batch sizes.
On the larger dataset with batch size 128, the M1Max is 5x faster than Macbook Air. Keep in mind a real dataset should have thousands of samples like Coco or Pascal. Ideally, you want a dataset with 100K images for experimentation and millions for the real training. The new M1 Max Macbooks is a cost effective alternative to building a Windows/Linux workstation with RTX 3090 24G. For most of 2021, the price of RTX 3090 with 24G is around $3,000.00. That means an equivalent windows workstation would cost the same as the M1Max Macbook pro I used to run the benchmarks.
Full Network vs Transfer Learning
As of CreateML 3, training with full network doesn't fully utilize the GPU. I don't know why it works that way. You have to select transfer learning to fully use the GPU. The results of transfer learning with the larger dataset. In general, the training time is faster and loss is better.
batch | ET min | Train Acc | Val Acc | Test Acc | Top IU Train | Top IU Valid | Top IU Test | Peak mem G | loss |
---|---|---|---|---|---|---|---|---|---|
16 | 4 | 75 | 19 | 12 | 78 | 23 | 13 | 1.5 | 0.41 |
32 | 8 | 75 | 21 | 10 | 78 | 26 | 11 | 2.76 | 0.02 |
64 | 13 | 75 | 23 | 8 | 78 | 24 | 9 | 5.3 | 0.017 |
128 | 25 | 75 | 22 | 13 | 78 | 25 | 14 | 8.4 | 0.012 |
Github Project
The source code and full results are up on Github https://github.com/woolfel/createmlbench
A tracer breakthrough curve (BTC) for each sampling station is the ultimate goal of every quantitative hydrologic tracing study, and dataset size can critically affect the BTC. Groundwater-tracing data obtained using in situ automatic sampling or detection devices may result in very high-density data sets. Data-dense tracer BTCs obtained using in situ devices and stored in dataloggers can result in visually cluttered overlapping data points. The relatively large amounts of data detected by high-frequency settings available on in situ devices and stored in dataloggers ensure that important tracer BTC features, such as data peaks, are not missed. Alternatively, such dense datasets can also be difficult to interpret. Even more difficult, is the application of such dense data sets in solute-transport models that may not be able to adequately reproduce tracer BTC shapes due to the overwhelming mass of data. One solution to the difficulties associated with analyzing, interpreting, and modeling dense data sets is the selective removal of blocks of the data from the total dataset. Although it is possible to arrange to skip blocks of tracer BTC data in a periodic sense (data decimation) so as to lessen the size and density of the dataset, skipping or deleting blocks of data also may result in missing the important features that the high-frequency detection setting efforts were intended to detect. Rather than removing, reducing, or reformulating data overlap, signal filtering and smoothing may be utilized but smoothing errors (e.g., averaging errors, outliers, and potential time shifts) need to be considered. Appropriate probability distributions to tracer BTCs may be used to describe typical tracer BTC shapes, which usually include long tails. Recognizing appropriate probability distributions applicable to tracer BTCs can help in understanding some aspects of the tracer migration. This dataset is associated with the following publications: Field, M. Tracer-Test Results for the Central Chemical Superfund Site, Hagerstown, Md. May 2014 -- December 2015. U.S. Environmental Protection Agency, Washington, DC, USA, 2017. Field, M. On Tracer Breakthrough Curve Dataset Size, Shape, and Statistical Distribution. ADVANCES IN WATER RESOURCES. Elsevier Science Ltd, New York, NY, USA, 141: 1-19, (2020).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Influence of the training set on the generalizability of a model. For each dataset, models were trained on 100 different training sets containing 4 images each and were validated on the remaining images. The difference between the maximum and minimum Dice scores obtained for each sample across all runs was calculated and averaged over all samples. The standard deviation of the differences is also shown to provide a reference regarding the degree of variation observed among samples.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.
Methods Overview
This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.
Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.
The methods for compilation are contained in the supplementary information of the manuscript but also here:
Bird data
For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).
Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.
The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.
Environmental data
Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package accompanies the paper "How Do Machine Learning Models Change?" In this study, we conducted a comprehensive analysis of over 200,000 commits and 1,200 releases across more than 50,000 models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve over time by classifying commit types based on an extended ML change taxonomy and analyzing patterns in commit and release activities using Bayesian networks.
Our research addresses three main aspects:
This replication package contains all the necessary code, datasets, and documentation to reproduce the results presented in the paper.
We collected data from the Hugging Face platform using the Hugging Face Hub API and the `HfApi` class. The data extraction was performed on November 6th, 2023. The collected data includes:
To enrich the commit data with detailed file change information, we integrated the PyDriller framework within the HFCommunity dataset.
Commit Diffs
We computed the differences between commits for key files, specifically JSON configuration files (e.g., `config.json`). For each commit that modifies these files, we compared the changes with the previous commit affecting the same file to identify added, deleted, and updated keys.
Commit Classification
We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 1.5 Flash Large Language Model (LLM). This classification, using LLMs to apply Bhatia et al.'s taxonomy on a large-scale ML repository, is one of the main contributions of our paper. We ensured the correctness of the classification by achieving a Cohen's kappa coefficient ≥ 0.9 through iterative validation. In addition, we performed classification based on Swanson's categories using a simpler neural network approach, following methods from prior work. This classification has less impact compared to the detailed classification using Bhatia et al.'s taxonomy.
Model Metadata
We extracted detailed metadata from the model files of selected releases, focusing on attributes such as the number of parameters, tensor shapes, etc. We also calculated the differences between the metadata of successive releases.
The replication package is organized as follows:
- code/
: Contains the Jupyter notebooks with the data extraction, preprocessing, analysis, and model training scripts.
HFTotalExtraction.ipynb
: Script for collecting data on the entire Hugging Face platform.HFReleasesExtraction.ipynb
: Script for collecting data on models that contain releases.HFTotalPreprocessing.ipynb
: Preprocesses the dataset obtained from `HFTotalExtraction.ipynb`.HFCommitsPreprocessing.ipynb
: Processes commit data, including:
HFReleasesPreprocessing.ipynb
: Processes release data, including classification and preparation for analysis.RQ1_Analysis.ipynb
: Analysis for RQ1.RQ2_Analysis.ipynb
: Analysis for RQ2.RQ3_Analysis.ipynb
: Analysis for RQ3.- datasets/
: Contains the raw, processed, and manually curated datasets used for the analysis.
HFCommits_50K_RANDOM.csv
: Contains the commits of 50,000 randomly sampled models from HF with the classification based on Bhatia et al.'s taxonomy.HFCommits_MultipleCommits.csv
: Contains the commits of 10,000 models with at least 10 commits, used for analyzing commit sequences.HFReleases.csv
: Contains over 1,200 releases from 127 models, classified using Bhatia et al.'s taxonomy.model_metadata_with_diff.csv
: Contains the metadata of releases from 27 models, including differences between successive releases.HF_Total_Raw.csv
: Contains a snapshot of the entire Hugging Face platform with over 380,000 models, as obtained from HFTotalExtraction.ipynb
.HF_Total_Preprocessed.csv
: Contains the preprocessed version of the entire HF dataset, as obtained from HFTotalPreprocessing.ipynb
. This dataset is needed for the commits preprocessing.- metadata/
: Contains the tags_metadata.yaml
file used during preprocessing.
- models/
: Contains the model trained to classify commit messages into corrective, perfective, and adaptive types based on Swanson's traditional software maintenance categories.
- requirements.txt
: Lists the required Python packages to set up the environment and run the code.
bash
python -m venv venv
source venv/bin/activate # On Windows, use venv\Scripts\activate
bash
pip install -r requirements.txt
- LLM Usage: The classification of commits using the Gemini 1.5 Flash LLM requires access to the model. Ensure you have the necessary permissions and API keys to use the model.
- Computational Resources: Processing large datasets and running Bayesian network analyses may require significant computational resources. It is recommended to use a machine with ample memory and processing power.
- Reproducing Results: The auxiliary datasets included can be used to reproduce specific parts of the code without re-running the entire data collection and preprocessing pipeline.
Contact: If you have any questions or encounter issues, please contact the authors at joel.castano@upc.edu.
This README provides detailed instructions and information to reproduce and understand the analyses performed in the paper. If you find this package useful, please cite our work.
This geographic raster dataset presents statistical model predictions for the probability of exceeding 40 gallons per minute well yield from a 400 foot deep well, and is based on the available data. The model was developed as part of the New Hampshire Bedrock Aquifer Assessment. The New Hampshire Bedrock Aquifer Assessment was designed to provide information (including this raster) that can be used by communities, industry, professional consultants, and other interests to evaluate the ground-water development potential of the fractured-bedrock aquifer in the State. The assessment was done at statewide, regional, and well field scales to identify relations that potentially could increase the success in locating high-yield water supplies in the fractured-bedrock aquifer. Statewide, data were collected for well construction and yield information, bedrock lithology, surficial geology, lineaments, topography, and various derivatives of these basic data sets. Regionally, geologic, fracture, and lineament data were collected for the Pinardville and Windham quadrangles in New Hampshire. The regional scale of the study examined the degree to which predictive well-yield relations, developed as part of the statewide reconnaissance investigation, could be improved by use of quadrangle-scale geologic mapping. Beginning in 1984, water-well contractors in the State were required to report detailed information on newly constructed wells to the New Hampshire Department of Environmental Services (NHDES). The reports contain basic data on well construction, including six characteristics used in this study—well yield, well depth, well use, method of construction, date drilled, and depth to bedrock (or length of casing). The NHDES has determined accurate georeferenced locations for more than 20,000 wells reported since 1984. The availability of this large data set provided an opportunity for a statistical analysis of bedrock-well yields. Well yields in the database ranged from zero to greater than 500 gallons per minute (gal/min). Multivariate regression was used as the primary statistical method of analysis because it is the most efficient tool for predicting a single variable with many potentially independent variables. The dependent variable that was explored in this study was the natural logarithm (ln) of the reported well yield. One complication with using well yield as a dependent variable is that yield also is a function of demand. An innovative statistical technique that involves the use of instrumental variables was implemented to compensate for the effect of demand on well yield. Results of the multivariate-regression model show that a variety of factors are either positively or negatively related to well yields (see USGS Professional Paper 1660). Model results are presented statewide as the probability of exceeding 40 gallons per minute well yield from a 400 foot deep well. Probability values represented in this raster (ranging from near 0 to near 100) have all been multiplied by 100 in order to store the data more efficiently as a raster.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Login Data Set for Risk-Based Authentication
Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.
This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.
The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.
WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.
Overview
The data set contains the following features related to each login attempt on the SSO:
Feature
Data Type
Description
Range or Example
IP Address
String
IP address belonging to the login attempt
0.0.0.0 - 255.255.255.255
Country
String
Country derived from the IP address
US
Region
String
Region derived from the IP address
New York
City
String
City derived from the IP address
Rochester
ASN
Integer
Autonomous system number derived from the IP address
0 - 600000
User Agent String
String
User agent string submitted by the client
Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version
String
Operating system name and version derived from the user agent string
Windows 10
Browser Name and Version
String
Browser name and version derived from the user agent string
Chrome 70.0.3538
Device Type
String
Device type derived from the user agent string
(mobile, desktop, tablet, bot, unknown)1
User ID
Integer
Idenfication number related to the affected user account
[Random pseudonym]
Login Timestamp
Integer
Timestamp related to the login attempt
[64 Bit timestamp]
Round-Trip Time (RTT) [ms]
Integer
Server-side measured latency between client and server
1 - 8600000
Login Successful
Boolean
True: Login was successful, False: Login failed
(true, false)
Is Attack IP
Boolean
IP address was found in known attacker data set
(true, false)
Is Account Takeover
Boolean
Login attempt was identified as account takeover by incident response team of the online service
(true, false)
Data Creation
As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.
The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.
The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.
Regarding the Data Values
Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.
You can recognize them by the following values:
ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)
Study Reproduction
Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.
The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.
See RESULTS.md for more details.
Ethics
By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.
The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.
Publication
You can find more details on our conducted study in the following journal article:
Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security
Bibtex
@article{Wiefling_Pump_2022, author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi}, title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}}, journal = {{ACM} {Transactions} on {Privacy} and {Security}}, doi = {10.1145/3546069}, publisher = {ACM}, year = {2022} }
License
This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069
Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎
By US Open Data Portal, data.gov [source]
This dataset offers a closer look into the mental health care received by U.S. households in the last four weeks during the Covid-19 pandemic. The sheer scale of this crisis is inspiring people of all ages, backgrounds, and geographies to come together to tackle the problem. The Household Pulse Survey from the U.S. Census Bureau was published with federal agency collaboration in order to draw up accurate and timely estimates about how Covid-19 is impacting employment status, consumer spending, food security, housing stability, education interruption, and physical and mental wellness amongst American households. In order to deliver meaningful results from this survey data about wellbeing at various levels of society during this trying period – which includes demographic characteristics such as age gender race/ethnicity training attainment – each consulted household was randomly selected according to certain weighted criteria to maintain accuracy throughout the findings This dataset will help you explore what's it like on the ground right now for everyone affected by Covid-19 - Will it inform your decisions or point you towards new opportunities?
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about the mental health care that U.S. households have received in the last 4 weeks, during the Covid-19 pandemic. This data is valuable when wanting to track and measure mental health needs across the country and draw comparisons between regions based on support available.
To use this dataset, it is important to understand each of its columns or variables in order to draw meaningful insights from the data. The ‘Indicator’ column indicates which type of indicator (percentage or absolute number) is being measured by this survey, while ‘Group’ and 'Subgroup' provide more specific details about who was surveyed for each indicator included in this dataset.
The Columns ‘Phase’ and 'Time Period' provide information regarding when each of these indicators was measured - whether during a certain phase or over a particular timespan - while columns such as 'Value', 'LowCI' & 'HighCI' show us how many individuals fell into what quartile range for each measurement taken (e.g., how many people reported they rarely felt lonely). Similarly, the column Suppression Flag helps us identify cases where value has been suppressed if it falls below a certain benchmark; this allows us to calculate accurate estimates more quickly without needing to sort through all suppressed values manually each time we use this dataset for analysis purposes. Finally, columns such as ‘Time Period Start Date’ & ‘Time Period End Date’ indicate which exact dates were used for measurements taken over different periods throughout those dates specified – useful when conducting time-series related analyses over longer periods of time within our research scope)
Overall, when using this dataset it's important to keep in mind exactly what indicator type you're looking at - percentage points or absolute numbers - as well its associated group/subgroup characteristics so that you can accurately interpret trends based on key findings had by interpreting any correlations drawn from these results!
- Analyzing the effects of the Covid-19 pandemic on mental health care among different subgroups such as racial and ethnic minorities, gender and age categories.
- Identifying geographical disparities in mental health services by comparing state level data for the same time period.
- Comparing changes in mental health care indicators over time to understand how the pandemic has impacted people's access to care within a quarter or over longer periods
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. ...
Few paired lake-watershed studies examine long term effects of climate on the ecosystem function of lakes in a hydrological context. We use thirty-two years of hydrological and biogeochemical data from a high-elevation site in the Sierra Nevada of California to characterize variation in snowmelt in relation to climate variability, and explore the impact on factors affecting phytoplankton biomass. The magnitude of accumulated winter snow, quantified through basin-wide estimates of snow water equivalent (SWE), was the most important climate factor controlling variation in the timing and rate of spring snowmelt. Variations in SWE and snowmelt led to significant differences in lake flushing rate, water temperature, and nitrate concentrations across years. On average in dry years, snowmelt started 25 days earlier and proceeded 7 mm/d slower, and the lake began the ice-free season with nitrate concentrations ~2 uM higher and water temperatures 9 C warmer than in wet years. Flushing rates in wet years were 2.5 times larger than dry years. Consequently, particulate organic matter concentrations, a proxy for phytoplankton biomass, were 5 – 6 uM higher in dry years. There was a temporal trend of increase in particulate organic matter across dry years that corresponded to lake warming independent of variation in SWE. These results suggest that phytoplankton biomass is increasing as a result of both interannual variability in precipitation and long term warming trends. Our study underscores the need to account for local-scale catchment variability that may affect the accumulation of winter snowpack when predicting climate responses in lakes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Research on metaheuristics has focused almost exclusively on (novel) algorithmic development and on competitive testing, both of which have been frequently argued to yield very little generalizable knowledge. One way to obtain problem- and implementation-independent insights about metaheuristic algorithms is meta-analysis, a systematic statistical examination that combines the results of several independent studies. Meta-analysis is widely used in several scientific domains, most notably the medical sciences (e.g., to establish the efficacy of a certain treatment), but has not yet been applied in operations research.
In order to demonstrate its potential in learning about algorithms, we carried out a meta-analysis of the adaptive layer in adaptive large neighborhood search (ALNS). Although ALNS has been widely used to solve a broad range of problems, it has not yet been established whether or not adaptiveness actually contributes to the performance of an ALNS algorithm. A total of 134 studies were identified through Google Scholar or personal e-mail correspondence with researchers in the domain, 63 of which fit a set of predefined eligibility criteria. After sending requests for data to the authors of the eligible studies, results for 25 different implementations of ALNS were collected and analyzed using a random-effects model.
The dataset contains a detailed comparison of ALNS with the non-adaptive variant per study and per instance, together with the meta-analysis summary results. The data allows to replicate the analysis, to evaluate the algorithms using other metrics, to revisit the importance of ALNS adaptive layer if results from more studies become available, or to simply consult the ready-to-use formulas in data_analyzed.xls to carry out a meta-analysis of any research question.
On average, the addition of an adaptive layer in an ALNS algorithm improves the objective function value by 0.14% (95% confidence interval 0.06 to 0.21%). Although the adaptive layer can (and in a limited number of studies does) have an added value, it also adds considerable complexity and can therefore only be recommended in some very specific situations.
The findings of this meta-analysis underline the importance of evaluating the contribution of metaheuristic components, and of knowledge over competitive testing. Our goal is to promote meta-analysis as a methodology of obtaining knowledge and understanding of metahueristics frameworks, and we hope to see an increase in its popularity in the domain of operations research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains citation-based impact indicators (a.k.a, "measures") for ~209M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture): Influence indicators (i.e., indicators of the "total" impact of each research product; how established it is in general) Citation Count: The total number of citations of the product, the most well-known influence indicator. PageRank score: An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score). Popularity indicators (i.e., indicators of the "current" impact of each research product; how popular the product is currently) RAM score: A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact). AttRank score: A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently. Impulse indicators (i.e., indicators of the initial momentum that the research product received right after its publication) Incubation Citation Count (3-year CC): This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted. More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found here and in the respective references (e.g., in [5]). From version 5.1 onward, the impact indicators are calculated in two levels: The PID level (assuming that each PID corresponds to a distinct research product). The OpenAIRE-id level (leveraging PID synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id). Previous versions of the dataset only provided the scores at the PID level. From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included). Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%). Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the OpenAIRE article deduplication algorithm. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph. For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format "identifier
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Relevant training splits of the VessMAP dataset. Each row shows the average Dice score obtained on the remaining 96 samples when training a neural network using the four images indicated in the training set column. The standard deviation obtained across five repetitions of the training runs is also shown.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Data from: Data-driven analysis of oscillations in Hall thruster simulations
- Authors: Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino
- Contact email: dmaddalo@ing.uc3m.es
- Date: 2022-03-24
- Keywords: higher order dynamic mode decomposition, hall effect thruster, breathing mode, ion transit time, data-driven analysis
- Version: 1.0.4
- Digital Object Identifier (DOI): 10.5281/zenodo.6359505
- License: This dataset is made available under the Open Data Commons Attribution License
Abstract
This dataset contains the outputs of the HODMD algorithm and the original simulations used in the journal publication:
Davide Maddaloni, Adrián Domínguez Vázquez, Filippo Terragni, Mario Merino, "Data-driven analysis of oscillations in Hall thruster simulations", 2022 Plasma Sources Sci. Technol. 31:045026. Doi: 10.1088/1361-6595/ac6444.
Additionally, the raw simulation data is also employed in the following journal publication:
Borja Bayón-Buján and Mario Merino, "Data-driven sparse modeling of oscillations in plasma space propulsion", 2024 Mach. Learn.: Sci. Technol. 5:035057. Doi: 10.1088/2632-2153/ad6d29
Dataset description
The simulations from which data stems have been produced using the full 2D hybrid PIC/fluid code HYPHEN, while the HODMD results have been produced using an adaptation of the original HODMD algorithm with an improved amplitude calculation routine.
Please refer to the relative article for further details regarding any of the parameters and/or configurations.
Data files
The data files are in standard Matlab .mat format. A recent version of Matlab is recommended.
The HODMD outputs are collected within 18 different files, subdivided into three groups, each one referring to a different case. For the file names, "case1" refers to the nominal case, "case2" refers to the low voltage case and "case3" refers to the high mass flow rate case. Following, the variables are referred as:
In particular, axial electric field, ionization production term and single charged ions axial velocity are available only for the first case. Such files have a cell structure: the first row contains the frequencies (in Hz), the second row contains the normalized modes (alongside their complex conjugates), the third row collects the growth rates (in 1/s) while the amplitudes (dimensionalized) are collected within the last row. Additionally, the time vector is simply given as "t", common to all cases and all variables.
The raw simulation data are collected within additional 15 variables, following the same nomenclature as above, with the addition of the suffix "_raw" to differentiate them from the HODMD outputs.
Citation
Works using this dataset or any part of it in any form shall cite it as follows.
The preferred means of citation is to reference the publication associated to this dataset, as soon as it is available.
Optionally, the dataset may be cited directly by referencing the DOI: 10.5281/zenodo.6359505.
Acknowledgments
This work has been supported by the Madrid Government (Comunidad de Madrid) under the Multiannual Agreement with UC3M in the line of ‘Fostering Young Doctors Research’ (MARETERRA-CM-UC3M), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation). F. Terragni was also supported by the Fondo Europeo de Desarrollo Regional, Ministerio de Ciencia, Innovación y Universidades - Agencia Estatal de Investigación, under grants MTM2017-84446-C2-2-R and PID2020-112796RB-C22.
The re-expansion of large mammals in European human-dominated landscapes poses new challenges for wildlife conservation and management practices. Supplementary feeding of ungulates is a widespread practice with several motivations, including hunting, yet the known effects on target and non-target species have yet to be disentangled. According to optimal foraging theory, such concentrated food sources may attract herbivores and carnivores in turn. As such, feeding sites may skew spatial distribution of wildlife, and alter intra and interspecific interactions, including predator-prey dynamics.Here, we investigated the use of ungulate-specific feeding sites by target and non-target species in a human-dominated and touristic area of the Alps, using systematic camera-trapping. We assessed potential temporal segregation between roe deer and red deer at feeding sites, and whether these concentrated artificial food sources influenced occurrence and site-use intensity of ungulates and wolves at the broader scale.We found that feeding sites frequentation by roe deer was influenced by the presence of red deer, with a higher crepuscular and diurnal activity at feeding stations strongly used by red deer, indicating potential temporal niche partitioning between the two ungulates. We also found that ungulates occurred with a higher probability at shorter distances from feeding sites, and used sites with high human outdoor activity less intensively than undisturbed ones. Wolves’ site-use intensity was higher closer to feeding sites, indicating a potential effect of supplemental feeding sites on both preys’ and predators’ space use.Our results reveal side effects of artificial feeding sites, thus contributing to a more informed and evidence-based management, with high relevance especially in light of the considerable recovery of large mammals across anthropized regions of Europe, and the popularity of artificial feeding of ungulates for hunting or recreational purposes. We thus advice for limiting this practice in areas where large herbivores, predators and humans closely coexist.
The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
Sample survey data [ssd]
The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.
A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.
It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.
Face-to-face [f2f]
Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.
Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
SUMMARYThis analysis, designed and executed by Ribble Rivers Trust, identifies areas across England with the greatest levels of cancer (in persons of all ages). Please read the below information to gain a full understanding of what the data shows and how it should be interpreted.ANALYSIS METHODOLOGYThe analysis was carried out using Quality and Outcomes Framework (QOF) data, derived from NHS Digital, relating to cancer (in persons of all ages).This information was recorded at the GP practice level. However, GP catchment areas are not mutually exclusive: they overlap, with some areas covered by 30+ GP practices. Therefore, to increase the clarity and usability of the data, the GP-level statistics were converted into statistics based on Middle Layer Super Output Area (MSOA) census boundaries.The percentage of each MSOA’s population (all ages) with cancer was estimated. This was achieved by calculating a weighted average based on:The percentage of the MSOA area that was covered by each GP practice’s catchment areaOf the GPs that covered part of that MSOA: the percentage of registered patients that have that illness The estimated percentage of each MSOA’s population with cancer was then combined with Office for National Statistics Mid-Year Population Estimates (2019) data for MSOAs, to estimate the number of people in each MSOA with cancer, within the relevant age range.Each MSOA was assigned a relative score between 1 and 0 (1 = worst, 0 = best) based on:A) the PERCENTAGE of the population within that MSOA who are estimated to have cancerB) the NUMBER of people within that MSOA who are estimated to have cancerAn average of scores A & B was taken, and converted to a relative score between 1 and 0 (1= worst, 0 = best). The closer to 1 the score, the greater both the number and percentage of the population in the MSOA that are estimated to have cancer, compared to other MSOAs. In other words, those are areas where it’s estimated a large number of people suffer from cancer, and where those people make up a large percentage of the population, indicating there is a real issue with cancer within the population and the investment of resources to address that issue could have the greatest benefits.LIMITATIONS1. GP data for the financial year 1st April 2018 – 31st March 2019 was used in preference to data for the financial year 1st April 2019 – 31st March 2020, as the onset of the COVID19 pandemic during the latter year could have affected the reporting of medical statistics by GPs. However, for 53 GPs (out of 7670) that did not submit data in 2018/19, data from 2019/20 was used instead. Note also that some GPs (997 out of 7670) did not submit data in either year. This dataset should be viewed in conjunction with the ‘Health and wellbeing statistics (GP-level, England): Missing data and potential outliers’ dataset, to determine areas where data from 2019/20 was used, where one or more GPs did not submit data in either year, or where there were large discrepancies between the 2018/19 and 2019/20 data (differences in statistics that were > mean +/- 1 St.Dev.), which suggests erroneous data in one of those years (it was not feasible for this study to investigate this further), and thus where data should be interpreted with caution. Note also that there are some rural areas (with little or no population) that do not officially fall into any GP catchment area (although this will not affect the results of this analysis if there are no people living in those areas).2. Although all of the obesity/inactivity-related illnesses listed can be caused or exacerbated by inactivity and obesity, it was not possible to distinguish from the data the cause of the illnesses in patients: obesity and inactivity are highly unlikely to be the cause of all cases of each illness. By combining the data with data relating to levels of obesity and inactivity in adults and children (see the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset), we can identify where obesity/inactivity could be a contributing factor, and where interventions to reduce obesity and increase activity could be most beneficial for the health of the local population.3. It was not feasible to incorporate ultra-fine-scale geographic distribution of populations that are registered with each GP practice or who live within each MSOA. Populations might be concentrated in certain areas of a GP practice’s catchment area or MSOA and relatively sparse in other areas. Therefore, the dataset should be used to identify general areas where there are high levels of cancer, rather than interpreting the boundaries between areas as ‘hard’ boundaries that mark definite divisions between areas with differing levels of cancer.TO BE VIEWED IN COMBINATION WITH:This dataset should be viewed alongside the following datasets, which highlight areas of missing data and potential outliers in the data:Health and wellbeing statistics (GP-level, England): Missing data and potential outliersLevels of obesity, inactivity and associated illnesses (England): Missing dataDOWNLOADING THIS DATATo access this data on your desktop GIS, download the ‘Levels of obesity, inactivity and associated illnesses: Summary (England)’ dataset.DATA SOURCESThis dataset was produced using:Quality and Outcomes Framework data: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital.GP Catchment Outlines. Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital. Data was cleaned by Ribble Rivers Trust before use.MSOA boundaries: © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021.Population data: Mid-2019 (June 30) Population Estimates for Middle Layer Super Output Areas in England and Wales. © Office for National Statistics licensed under the Open Government Licence v3.0. © Crown Copyright 2020.COPYRIGHT NOTICEThe reproduction of this data must be accompanied by the following statement:© Ribble Rivers Trust 2021. Analysis carried out using data that is: Copyright © 2020, Health and Social Care Information Centre. The Health and Social Care Information Centre is a non-departmental body created by statute, also known as NHS Digital; © Office for National Statistics licensed under the Open Government Licence v3.0. Contains OS data © Crown copyright and database right 2021. © Crown Copyright 2020.CaBA HEALTH & WELLBEING EVIDENCE BASEThis dataset forms part of the wider CaBA Health and Wellbeing Evidence Base.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Intensive longitudinal interventions (ILIs) have gained prominence as powerful tools for treating and preventing mental and behavioral disorders (Heron & Smyth, 2010). However, most studies analyze ILI data use traditional methods like ANOVA or linear mixed models, which overlook individual differences and the autocorrelation structure inherent in time series data (Hedeker et al., 2008). Moreover, existing methods typically assess intervention effects based solely on changes in the mean level of key variables (e.g., anxiety). This study demonstrates how to model ILI data within the framework of dynamic structural equation modeling (DSEM) to evaluate intervention effects across three dimensions: mean, autoregression, and individual intra-variation (IIV), for two intervention designs: non-randomized single-arm trial (NST) and randomized control trial (RCT). We conducted two simulation studies to investigate sample size recommendations for DSEM in ILI studies, considering both statistical power and accuracy in parameter estimation (AIPE). Additionally, we compared the two designs based on type I error rate in a separate simulation. Finally, we illustrated sample size planning using data from a pre-ILI study focused on reducing appearance anxiety.Simulation Studies 1 and 2 investigated the power and AIPE across varying sample sizes, as well as the required sample size for both NST and RCT designs. The effect sizes of intervention effects for mean, autoregression and IIV were fixed at the medium level. Two factors regarding sample size were manipulated: number of participants (N = 30, 60, 100,150, 200, 300,400), number of time-points (T= 10, 20, 40, 60, 80, 100). The data-generating models and fitted models were identical, with analysis conducted using Mplus 8.10 and Bayesian estimation. Model performance was assessed in terms of convergence rate, power and AIPE for intervention effects, as well as bias in the standard errors of the intervention effects. Simulation Study 3 assessed the type I error rate for both designs when changes in the control group was different from zero, indicating a change (on average) due to time. Last, the empirical study conducted sample size planning based on a pre-study aimed at reducing appearance anxiety using an ILI design.The results are as following. First, there were no convergence issues under all the conditions. Second, power increased, width of the credible intervals decreased as either N or T increased. However, a minimum of 60 participants was required to achieve adequate power (i.e., ). The relative bias in intervention effect was generally small. Except in the NST design, the intervention effects on autoregression and IIV were underestimated when the number of time-points was low (i.e., T=10 or 20), while in the RCT design, the intervention effect on mean was underestimated when sample size in both levels were small (i.e., N=30 or 60, T=10). Bias in the standard error was also minimal across conditions. Third, a credible interval width contours plot could be applied to recommend sample sizes in DSEM. The sample size requirements based on power and AIPE were different under NST design and RCT design, with RCT requiring larger samples due to the addition of a control group. Fourth, when a natural change (on average) occurred between pre- and post-intervention phrases, the NST design led to inflated type I error rates compared to the RCT design, particularly with larger sample sizes.In conclusion, we first recommend using DSEM to analyze ILI data, as it better captures intervention effects on mean, autoregression, and IIV. Second, practitioners should select either the NST or RCT design based on theoretical and empirical considerations. While the RCT design controls for confounding factors like time-related changes in mean, it requires a larger sample size. NST designs were usually conducted before large RCTs with relatively small samples, especially for rare participants. Finally, choosing the true parameters for the data-generating model was crucial in sample size planning using a monte carlo method. We suggested derive these parameters from pre-studies, similar empirical studies or meta-analysis when possible, as many parameters (i.e., regarding to fixed effects and random effects) should be set in DSEM. If no prior information is available, we suggest following the procedures outlined in this study.This database includes the code for data generating and analysis in simulation studies, and data, code and results in empirical example.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data of investigation published in the article: "Using Machine Learning for Web Page Classification in Search Engine Optimization"
Abstract of the article:
This paper presents a novel approach of using machine learning algorithms based on experts’ knowledge to classify web pages into three predefined classes according to the degree of content adjustment to the search engine optimization (SEO) recommendations. In this study, classifiers were built and trained to classify an unknown sample (web page) into one of the three predefined classes and to identify important factors that affect the degree of page adjustment. The data in the training set are manually labeled by domain experts. The experimental results show that machine learning can be used for predicting the degree of adjustment of web pages to the SEO recommendations—classifier accuracy ranges from 54.59% to 69.67%, which is higher than the baseline accuracy of classification of samples in the majority class (48.83%). Practical significance of the proposed approach is in providing the core for building software agents and expert systems to automatically detect web pages, or parts of web pages, that need improvement to comply with the SEO guidelines and, therefore, potentially gain higher rankings by search engines. Also, the results of this study contribute to the field of detecting optimal values of ranking factors that search engines use to rank web pages. Experiments in this paper suggest that important factors to be taken into consideration when preparing a web page are page title, meta description, H1 tag (heading), and body text—which is aligned with the findings of previous research. Another result of this research is a new data set of manually labeled web pages that can be used in further research.