Facebook
TwitterThis repository provides the raw data, analysis code, and results generated during a systematic evaluation of the impact of selected experimental protocol choices on the metagenomic sequencing analysis of microbiome samples. Briefly, a full factorial experimental design was implemented varying biological sample (n=5), operator (n=2), lot (n=2), extraction kit (n=2), 16S variable region (n=2), and reference database (n=3), and the main effects were calculated and compared between parameters (bias effects) and samples (real biological differences). A full description of the effort is provided in the associated publication.
Facebook
Twitter
According to our latest research, the global Variable Data Shrink Sleeve Printing market size reached USD 1.87 billion in 2024, demonstrating robust expansion driven by the increasing demand for personalized packaging solutions across various industries. The market is expected to grow at a CAGR of 7.1% from 2025 to 2033, projecting a market value of approximately USD 3.49 billion by 2033. This growth is primarily fueled by advancements in digital printing technologies, the rising trend of product customization, and stringent regulations regarding packaging authenticity and traceability.
The surge in demand for unique and personalized packaging is one of the key growth factors propelling the Variable Data Shrink Sleeve Printing market. As brands and manufacturers strive to differentiate their products on crowded shelves, the ability to incorporate variable data such as barcodes, QR codes, serialized numbers, and customized graphics has become crucial. This trend is particularly prominent in the food and beverage sector, where consumer engagement and anti-counterfeiting measures are vital. The flexibility offered by variable data printing enables brands to launch limited edition products, regional campaigns, and promotional activities, thus enhancing consumer interaction and brand loyalty.
Technological advancements in printing methods have significantly contributed to the market's upward trajectory. The integration of digital printing technology has revolutionized the shrink sleeve printing process, enabling high-speed, cost-effective, and high-quality production of short runs and complex designs. Flexographic and gravure printing also continue to evolve, offering improved color accuracy and substrate versatility. These innovations have made it easier for manufacturers to respond quickly to market trends and regulatory requirements, while reducing waste and operational costs. As a result, the adoption of variable data shrink sleeve printing is expanding across industries that require agility and precision in their packaging operations.
Another major growth driver is the increasing emphasis on regulatory compliance and product security. Governments and industry bodies worldwide are implementing stricter regulations to combat counterfeiting and ensure product authenticity, especially in sensitive sectors such as pharmaceuticals and personal care. Variable data printing allows for the integration of tamper-evident features and traceability elements directly onto shrink sleeves, providing a robust solution to meet these compliance standards. Moreover, the rise of e-commerce and global supply chains has further heightened the need for secure and trackable packaging, reinforcing the role of variable data shrink sleeve printing in modern packaging strategies.
Regionally, the Asia Pacific market stands out as a major contributor to global growth, supported by rapid industrialization, expanding retail sectors, and a burgeoning middle-class population. North America and Europe also exhibit strong demand, driven by advanced manufacturing infrastructure and a high focus on product innovation. Meanwhile, emerging markets in Latin America and the Middle East & Africa are witnessing increasing adoption, albeit at a relatively slower pace, as local brands recognize the value of sophisticated packaging in enhancing brand image and consumer trust.
The printing technology segment of the Variable Data Shrink Sleeve Printing market encompasses digital printing, flexographic printing, gravure printing, offset printing, and other emerging technologies. Digital printing has emerged as the fastest-growing sub-segment, owing to its unparalleled ability to deliver high-quality, customizable prints with minimal setup time. The technology’s capacity for on-demand printing and short production runs makes it ideal for brands seeking to implement targeted marketing campaigns or comply with regi
Facebook
TwitterThe QoG Institute is an independent research institute within the Department of Political Science at the University of Gothenburg. Overall 30 researchers conduct and promote research on the causes, consequences and nature of Good Governance and the Quality of Government - that is, trustworthy, reliable, impartial, uncorrupted and competent government institutions.
The main objective of our research is to address the theoretical and empirical problem of how political institutions of high quality can be created and maintained. A second objective is to study the effects of Quality of Government on a number of policy areas, such as health, the environment, social policy, and poverty.
The dataset was created as part of a research project titled “Quality of Government and the Conditions for Sustainable Social Policy”. The aim of the dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
The data comes in three versions: one cross-sectional dataset, and two cross-sectional time-series datasets for a selection of countries. The two combined datasets are called “long” (year 1946-2009) and “wide” (year 1970-2005).
The data contains six types of variables, each provided under its own heading in the codebook: Social policy variables, Tax system variables, Social Conditions, Public opinion data, Political indicators, Quality of government variables.
QoG Social Policy Dataset can be downloaded from the Data Archive of the QoG Institute at http://qog.pol.gu.se/data/datadownloads/data-archive Its variables are now included in QoG Standard.
Purpose:
The primary aim of QoG is to conduct and promote research on corruption. One aim of the QoG Institute is to make publicly available cross-national comparative data on QoG and its correlates. The aim of the QoG Social Policy Dataset is to promote cross-national comparative research on social policy output and its correlates, with a special focus on the connection between social policy and Quality of Government (QoG).
A cross-section dataset based on data from and around 2002 of QoG Social Policy-dataset. If there was no data for 2002 on a variable, data from the year closest year available have been used, however not further back in time than 1995.
Samanni, Marcus. Jan Teorell, Staffan Kumlin, Stefan Dahlberg, Bo Rothstein, Sören Holmberg & Richard Svensson. 2012. The QoG Social Policy Dataset, version 4Apr12. University of Gothenburg:The Quality of Government Institute. http://www.qog.pol.gu.se
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In Clinical Trials, not all randomised patients follow the course of treatment they are allocated to. The potential impact of such deviations is increasingly recognised, and it has been one of the reasons for a redefinition of the targets of estimation (“Estimands”) in the ICH E9 draft Addendum. Among others, the effect of treatment assignment, regardless of the adherence, appears a Estimand of practical interest, in line with the intention-to-treat principle. This study aims at evaluating the performance of different estimation techniques in trials with incomplete post-discontinuation follow-up when a 'treatment-policy' strategy is implemented.. In order to achieve that, we have (i) modelled and visualised as directed acyclic diagram a reasonable data-generating model; (ii) investigated which set of variables allows identification and estimation of such effect; (iii) simulated 10000 trials in Major Depressive Disorder, with varying real treatment effects, proportions of patients discontinuing the treatment, and incomplete follow-up. Our results suggest that, at least in a 'missing at random (MAR)' setting, all studied estimation methods increase their performance when a variable representing compliance is used. This effect is more pronounced the higher the proportion of post-discontinuation follow-up is.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In many research fields, measurement data containing too many zeros are often called semicontinuous data. For semicontinuous data, the most common method is the two-part model, which establishes the corresponding regression model for both the zero-valued part and the nonzero-valued part. Considering that each part of the two-part regression model often encounters a large number of candidate variables, the variable selection becomes an important problem in semicontinuous data analysis. However, there is little research literature on this topic. To bridge this gap, we propose a new type of variable selection methods for the two-part regression model. In this paper, the Bernoulli-Normal two-part (BNT) regression model is presented, and a variable selection method based on Lasso penalty function is proposed. To solve the problem that Lasso estimator does not have Oracle attribute, we then propose a variable selection method based on adaptive Lasso penalty function. The simulation results show that both methods can select variables for BNT regression model and are easy to implement, and the performance of adaptive Lasso method is superior to the Lasso method. We demonstrate the effectiveness of the proposed tools using dietary intake data to further analyze the important factors affecting dietary intake of patients.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data used for the manuscript "Living in a variable visual environment: How stable versus fluctuating suspended sediments affect fish behavior". The sister code repository is located at DOI: 10.5281/zenodo.15103558
Facebook
TwitterThis respiratory contains TigerRAY moored deployment data for each day in which data was collected between January 10, 2024 and March 3, 2024. For sensors on and inside TigerRAY, there is one .mat file for each day in which TigerRAY operated and collected data. These files are labeled as: "DDMMMYYYY_TigerRAYdata.mat" corresponding to the date collected. These .mat files contain a single variable, data, formatted into: - time stamps and load cell readings from the heave plate - time stamps and data from the two heave plate mounted pressure sensors - structure containing data from the heave plate mounted IMU - data collected by the central data acquisition system in the nacelle - timestamps and data from encoder 1 - timestamps and data from encoder 2 - structure containing data from the nacelle mounted IMU - data from the satellite compass mounted to the mast of the nacelle For SWIFT data, there is one data file that contains all reprocessed SWIFT data for the entire deployment. This repository contains three structures, named SWIFT22_rp, SWIFT23_rp, and SWIFT24_rp. Reprocessing of the data was done to remove frequency components in the wave spectra with frequencies < 0.2 Hz. The remaining energy is distributed between 0.2 Hz and 1 Hz. New significant wave height, peak period, energy period, and peak direction were then calculated from these trimmed energy spectra. See attached data guide for a complete summary of data included in this submission, description of the data products (TigerRAY and SWIFT data), and deployment setup information and figures.
Facebook
TwitterThe complexity and high dimensionality of neuroimaging data pose problems for decoding information with machine learning (ML) models because the number of features is often much larger than the number of observations. Feature selection is one of the crucial steps for determining meaningful target features in decoding; however, optimizing the feature selection from such high-dimensional neuroimaging data has been challenging using conventional ML models. Here, we introduce an efficient and high-performance decoding package incorporating a forward variable selection (FVS) algorithm and hyper-parameter optimization that automatically identifies the best feature pairs for both classification and regression models, where a total of 18 ML models are implemented by default. First, the FVS algorithm evaluates the goodness-of-fit across different models using the k-fold cross-validation step that identifies the best subset of features based on a predefined criterion for each model. Next, the hyperparameters of each ML model are optimized at each forward iteration. Final outputs highlight an optimized number of selected features (brain regions of interest) for each model with its accuracy. Furthermore, the toolbox can be executed in a parallel environment for efficient computation on a typical personal computer. With the optimized forward variable selection decoder (oFVSD) pipeline, we verified the effectiveness of decoding sex classification and age range regression on 1,113 structural magnetic resonance imaging (MRI) datasets. Compared to ML models without the FVS algorithm and with the Boruta algorithm as a variable selection counterpart, we demonstrate that the oFVSD significantly outperformed across all of the ML models over the counterpart models without FVS (approximately 0.20 increase in correlation coefficient, r, with regression models and 8% increase in classification models on average) and with Boruta variable selection algorithm (approximately 0.07 improvement in regression and 4% in classification models). Furthermore, we confirmed the use of parallel computation considerably reduced the computational burden for the high-dimensional MRI data. Altogether, the oFVSD toolbox efficiently and effectively improves the performance of both classification and regression ML models, providing a use case example on MRI datasets. With its flexibility, oFVSD has the potential for many other modalities in neuroimaging. This open-source and freely available Python package makes it a valuable toolbox for research communities seeking improved decoding accuracy.
Facebook
TwitterThis dataset is designed for beginners to practice regression problems, particularly in the context of predicting house prices. It contains 1000 rows, with each row representing a house and various attributes that influence its price. The dataset is well-suited for learning basic to intermediate-level regression modeling techniques.
Beginner Regression Projects: This dataset can be used to practice building regression models such as Linear Regression, Decision Trees, or Random Forests. The target variable (house price) is continuous, making this an ideal problem for supervised learning techniques.
Feature Engineering Practice: Learners can create new features by combining existing ones, such as the price per square foot or age of the house, providing an opportunity to experiment with feature transformations.
Exploratory Data Analysis (EDA): You can explore how different features (e.g., square footage, number of bedrooms) correlate with the target variable, making it a great dataset for learning about data visualization and summary statistics.
Model Evaluation: The dataset allows for various model evaluation techniques such as cross-validation, R-squared, and Mean Absolute Error (MAE). These metrics can be used to compare the effectiveness of different models.
The dataset is highly versatile for a range of machine learning tasks. You can apply simple linear models to predict house prices based on one or two features, or use more complex models like Random Forest or Gradient Boosting Machines to understand interactions between variables.
It can also be used for dimensionality reduction techniques like PCA or to practice handling categorical variables (e.g., neighborhood quality) through encoding techniques like one-hot encoding.
This dataset is ideal for anyone wanting to gain practical experience in building regression models while working with real-world features.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you can find the model results of the report:
De Felice, M., Busch, S., Kanellopoulos, K., Kavvadias, K. and Hidalgo Gonzalez, I., Power system flexibility in a variable climate, EUR 30184 EN, Publications Office of the European Union, Luxembourg, 2020, ISBN 978-92-76-18183-5 (online), doi:10.2760/75312 (online), JRC120338.
This dataset contains both the raw GDX files generated by the GAMS () optimiser for the Dispa-SET model. Details on the output format and the names of the variables can be found in the Dispa-SET documentation. A markdown notebook in R (and the rendered PDF) contains an example on how to read the GDX files in R.
We also include in this dataset a data frame saved in the Apache Parquet format that can be read both in R and Python.
A description of the methodology and the data sources with the references can be found into the report.
Linked resources
Input files: https://zenodo.org/record/3775569#.XqqY3JpS-fc
Source code for the figures: https://github.com/energy-modelling-toolkit/figures-JRC-report-power-system-and-climate-variability
Update
[29/06/2020] Updated new version of the Parquet file with the right data in the column climate_year
Facebook
Twitterhttps://www.icpsr.umich.edu/web/ICPSR/studies/21742/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/21742/terms
This data collection contains information compiled from the questions asked of a sample of persons and housing units enumerated in Census 2000. Population items include sex, age, race, Hispanic or Latino origin, type of living quarters (household/group quarters), urban/rural status, household relationship, marital status, grandparents as caregivers, language and ability to speak English, ancestry, place of birth, citizenship status and year of entry into the United States, migration, place of work, journey to work (commuting), school enrollment and educational attainment, veteran status, disability, employment status, occupation and industry, class of worker, income, and poverty status. Housing items include vacancy status, tenure (owner/renter), number of rooms, number of bedrooms, year moved into unit, household size, occupants per room, number of units in structure, year structure was built, heating fuel, telephone service, plumbing and kitchen facilities, vehicles available, value of home, and monthly rent. With subject content identical to that provided in Summary File 3, the information is presented in 813 tables that are tabulated for every geographic unit represented in the data. There is one variable per table cell, plus additional variables with geographic information. The data cover more than a dozen geographic levels of observation (known as "summary levels" in the Census Bureau's nomenclature) based on the 108th Congressional Districts, e.g., the 108th Congressional Districts, themselves, Census tracts within the 108th Congressional Districts, and county subdivisions within the 108th Congressional Districts. There are 77 data files for each state, the District of Columbia, and Puerto Rico. The collection is supplied in 54 ZIP archives. There is a separate ZIP file for each state, the District of Columbia, and Puerto Rico, and for the convenience of those who need all of the data, a separate ZIP archive with all 4,004 data files. The codebook and other documentation are located in the last ZIP archive.
Facebook
TwitterThe National Child Development Study (NCDS) is a continuing longitudinal study that seeks to follow the lives of all those living in Great Britain who were born in one particular week in 1958. The aim of the study is to improve understanding of the factors affecting human development over the whole lifespan.
The NCDS has its origins in the Perinatal Mortality Survey (PMS) (the original PMS study is held at the UK Data Archive under SN 2137). This study was sponsored by the National Birthday Trust Fund and designed to examine the social and obstetric factors associated with stillbirth and death in early infancy among the 17,000 children born in England, Scotland and Wales in that one week. Selected data from the PMS form NCDS sweep 0, held alongside NCDS sweeps 1-3, under SN 5565.
Survey and Biomeasures Data (GN 33004):
To date there have been ten attempts to trace all members of the birth cohort in order to monitor their physical, educational and social development. The first three sweeps were carried out by the National Children's Bureau, in 1965, when respondents were aged 7, in 1969, aged 11, and in 1974, aged 16 (these sweeps form NCDS1-3, held together with NCDS0 under SN 5565). The fourth sweep, also carried out by the National Children's Bureau, was conducted in 1981, when respondents were aged 23 (held under SN 5566). In 1985 the NCDS moved to the Social Statistics Research Unit (SSRU) - now known as the Centre for Longitudinal Studies (CLS). The fifth sweep was carried out in 1991, when respondents were aged 33 (held under SN 5567). For the sixth sweep, conducted in 1999-2000, when respondents were aged 42 (NCDS6, held under SN 5578), fieldwork was combined with the 1999-2000 wave of the 1970 Birth Cohort Study (BCS70), which was also conducted by CLS (and held under GN 33229). The seventh sweep was conducted in 2004-2005 when the respondents were aged 46 (held under SN 5579), the eighth sweep was conducted in 2008-2009 when respondents were aged 50 (held under SN 6137), the ninth sweep was conducted in 2013 when respondents were aged 55 (held under SN 7669), and the tenth sweep was conducted in 2020-24 when the respondents were aged 60-64 (held under SN 9412).
A Secure Access version of the NCDS is available under SN 9413, containing detailed sensitive variables not available under Safeguarded access (currently only sweep 10 data). Variables include uncommon health conditions (including age at diagnosis), full employment codes and income/finance details, and specific life circumstances (e.g. pregnancy details, year/age of emigration from GB).
Four separate datasets covering responses to NCDS over all sweeps are available. National Child Development Deaths Dataset: Special Licence Access (SN 7717) covers deaths; National Child Development Study Response and Outcomes Dataset (SN 5560) covers all other responses and outcomes; National Child Development Study: Partnership Histories (SN 6940) includes data on live-in relationships; and National Child Development Study: Activity Histories (SN 6942) covers work and non-work activities. Users are advised to order these studies alongside the other waves of NCDS.
From 2002-2004, a Biomedical Survey was completed and is available under Safeguarded Licence (SN 8731) and Special Licence (SL) (SN 5594). Proteomics analyses of blood samples are available under SL SN 9254.
Linked Geographical Data (GN 33497):
A number of geographical variables are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies.
Linked Administrative Data (GN 33396):
A number of linked administrative datasets are available, under more restrictive access conditions, which can be linked to the NCDS EUL and SL access studies. These include a Deaths dataset (SN 7717) available under SL and the Linked Health Administrative Datasets (SN 8697) available under Secure Access.
Multi-omics Data and Risk Scores Data (GN 33592)
Proteomics analyses were run on the blood samples collected from NCDS participants in 2002-2004 and are available under SL SN 9254. Metabolomics analyses were conducted on respondents of sweep 10 and are available under SL SN 9411. Polygenic indices are available under SL SN 9439. Derived summary scores have been created that combine the estimated effects of many different genes on a specific trait or characteristic, such as a person's risk of Alzheimer's disease, asthma, substance abuse, or mental health disorders, for example. These scores can be combined with existing survey data to offer a more nuanced understanding of how cohort members' outcomes may be shaped.
Additional Sub-Studies (GN 33562):
In addition to the main NCDS sweeps, further studies have also been conducted on a range of subjects such as parent migration, unemployment, behavioural studies and respondent essays. The full list of NCDS studies available from the UK Data Service can be found on the NCDS series access data webpage.
How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
For information on how to access biomedical data from NCDS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.
Further information about the full NCDS series can be found on the Centre for Longitudinal Studies website.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dental care patients (n=95) participated in a quasi-experiment during 2012 in Sweden. The respondents were interviewed twice about dental visits they had made between 2002 and 2012. For verification purposes, the narratives were compared to the dental records. The qualitative data has been quantified and is stored as .csv supplemented with a codebook in plain text. In addition, all study material is freely available online at https://osf.io/thwcb. For anonymity reasons, a few adjustments were made to the shared data set: three continuous variables were categorised, one variable, sex, was removed, and all respondents were randomly assigned new ID-numbers (to avoid potential self-identification). The data can be reused to further analyse memory for repeated events. It can be used as experimental data (including both interviews) or as single interview data (including data from only the first interview).
Facebook
TwitterThe Digital Bathymetric Data Base Variable resolution (DBDBV) is a digital bathymetric data base that provides ocean depths at various gridded resolutions. DBDBV was developed by NAVOCEANO to support the generation of bathymetric chart products, and to provide bathymetric data to be integrated with other geophysical and environmental parameters for ocean modeling. Grid resolutions available include 0.5, 1, 2, and 5 minutes of latitude/longitude (1 minute of latitude = 1 nautical mile or 1.852 km). For data coverage for each resolutions see Related URLs below.
The data base for DBDBV consists of four file types. The depth information is expressed in meters, uncorrected at an assumed sound velocity of 1500 meters per second. The first file is a Master Index File that contains a pointer or byte address to each populated one degree cell for each of the resolutions available. The second file is an Index File that provides a linkage to the detailed depth values, as well as, a linkage to a description file associated with the depths. The third file, or Description File, provides details on the compressions, scaling and storage of the depth information. The fourth file is the Data File that contains the depth values for a on degree cell of a specific resolution. The geographic coverage of the 5 minute gridded bathymetry of the current version of DBDBV includes all ocean areas and adjacent seas above 78 degrees South latitude to 90 degrees North latitude.
The geographic coverage of the 2 minute gridded bathymetry of the current version of DBDBV includes:
a. Mediterranean Sea including the Adriatic Sea and the Black Sea, b. Red Sea, c. Persian Gulf, d. Gulf of Aden west of 50 degrees East longitude, e. Gulf of Oman north of 23 degrees North latitude and west of 70 degrees East longitude, and f. an area encompassing the Bay of Biscay, Gulf of Cadiz and the Atlantic approaches to the Straits of Gibraltar bounded by 10 degrees West longitude, 30 degrees North latitude and 48 degrees North latitude.
The geographic coverage of the 1 minute gridded bathymetry of the current version of DBDBV includes:
a. Mediterranean Sea including the Adriatic Sea and the Black Sea, b. an area of the Baltic Sea bounded by 15 degrees East longitude, 25 degrees East longitude, 54 degrees North latitude and 60 degrees North latitude, c. the Atlantic approaches to the Straits of Gibralter bounded by 10 degrees West longitude, 35 degrees North latitude and 40 degrees North latitude, and d. an area of the eastern Pacific bounded by the west coasts of Mexico and the United States, 140 degrees West longitude, 29 degrees North latitude and 45 degrees North latitude.
The geographic extent of the 0.5 minute gridded bathymetry of the current version of DBDBV is selectively dispersed areas of the world.
Another source for this data is the National Oceanic and Atmospheric Administration's National Geophysical Data Center (NOAA/NGDC), as DBDBV is the bathymetry in ETOPO2, and is also contained in TerrainBase held at Bruce Gittings' Data Catalogue.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6HPRIGhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6HPRIG
This dataset contains the data and code necessary to replicate work in the following paper: Narayan, Sneha, Jake Orlowitz, Jonathan Morgan, Benjamin Mako Hill, and Aaron Shaw. 2017. “The Wikipedia Adventure: Field Evaluation of an Interactive Tutorial for New Users.” in Proceedings of the 20th ACM Conference on Computer-Supported Cooperative Work & Social Computing (CSCW '17). New York, New York: ACM Press. http://dx.doi.org/10.1145/2998181.2998307 The published paper contains two studies. Study 1 is a descriptive analysis of a survey of Wikipedia editors who played a gamified tutorial. Study 2 is a field experiment that evaluated the same the tutorial. These data are the data used in the field experiment described in Study 2. Description of Files This dataset contains the following files beyond this README: twa.RData — An RData file that includes all variables used in Study 2. twa_analysis.R — A GNU R script that includes all the code used to generate the tables and plots related to Study 2 in the paper. The RData file contains one variable (d) which is an R dataframe (i.e., table) that includes the following columns: userid (integer): The unique numerical ID representing each user on in our sample. These are 8-digit integers and describe public accounts on Wikipedia. sample.date (date string): The day the user was recruited to the study. Dates are formatted in “YYYY-MM-DD” format. In the case of invitees, it is the date their invitation was sent. For users in the control group, these is the date that they would have been invited to the study. edits.all (integer): The total number of edits made by the user on Wikipedia in the 180 days after they joined the study. Edits to user's user pages, user talk pages and subpages are ignored. edits.ns0 (integer): The total number of edits made by user to article pages on Wikipedia in the 180 days after they joined the study. edits.talk (integer): The total number of edits made by user to talk pages on Wikipedia in the 180 days after they joined the study. Edits to a user's user page, user talk page and subpages are ignored. treat (logical): TRUE if the user was invited, FALSE if the user was in control group. play (logical): TRUE if the user played the game. FALSE if the user did not. All users in control are listed as FALSE because any user who had not been invited to the game but played was removed. twa.level (integer): Takes a value 0 of if the user has not played the game. Ranges from 1 to 7 for those who did, indicating the highest level they reached in the game. quality.score (float). This is the average word persistence (over a 6 revision window) over all edits made by this userid. Our measure of word persistence (persistent word revision per word) is a measure of edit quality developed by Halfaker et al. that tracks how long words in an edit persist after subsequent revisions are made to the wiki-page. For more information on how word persistence is calculated, see the following paper: Halfaker, Aaron, Aniket Kittur, Robert Kraut, and John Riedl. 2009. “A Jury of Your Peers: Quality, Experience and Ownership in Wikipedia.” In Proceedings of the 5th International Symposium on Wikis and Open Collaboration (OpenSym '09), 1–10. New York, New York: ACM Press. doi:10.1145/1641309.1641332. Or this page: https://meta.wikimedia.org/wiki/Research:Content_persistence How we created twa.RData The files twa.RData combines datasets drawn from three places: A dataset created by Wikimedia Foundation staff that tracked the details of the experiment and how far people got in the game. The variables userid, sample.date, treat, play, and twa.level were all generated in a dataset created by WMF staff when The Wikipedia Adventure was deployed. All users in the sample created their accounts within 2 days before the date they were entered into the study. None of them had received a Teahouse invitation, a Level 4 user warning, or been blocked from editing at the time that they entered the study. Additionally, all users made at least one edit after the day they were invited. Users were sorted randomly into treatment and control groups, based on which they either received or did not receive an invite to play The Wikipedia Adventure. Edit and text persistence data drawn from public XML dumps created on May 21st, 2015. We used publicly available XML dumps to generate the outcome variables, namely edits.all, edits.ns0, edits.talk and quality.score. We first extracted all edits made by users in our sample during the six month period since they joined the study, excluding edits made to user pages or user talk pages using. We parsed the XML dumps using the Python based wikiq and MediaWikiUtilities software online at: http://projects.mako.cc/source/?p=mediawiki_dump_tools https://github.com/mediawiki-utilities/python-mediawiki-utilities We obtained the XML dumps from: https://dumps.wikimedia.org/enwiki/ A list of edits made by users in our study that were subsequently deleted, created on...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source - All data was collected from the NBA.com website and Basketball-Reference.com - To view the raw data, and steps I took to clean and format it, you can click the link below https://docs.google.com/spreadsheets/d/1bJnc1n-pXVjtqKul1NnjOq0mYl9-7FZy_CbM2gTmTLA/edit?usp=sharing
Context All data is from the 2022-2023, 82 game regular season
Inspiration I gathered this data to perform an analysis with the goal being to answer the questions: - From where did the Boston Celtics shoot the highest field goal percent? - When did the Boston Celtics shoot the highest field goal percent? - Under what conditions did the Boston Celtics shoot the highest field goal percent?
Facebook
TwitterTHE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The Department of Statistics (DOS) carried out four rounds of the 2012 Employment and Unemployment Survey (EUS) during 2012. The survey rounds covered a total sample of about fifty three thousand households Nation-wide (53.4 thousands). The sampled households were selected using a stratified cluster sampling design.
It is worthy to mention that the DOS employed new technology in the data collection and processing. Data was collected using the electronic questionnaire instead of a hard copy, namely a hand held device (PDA).
The raw survey data provided by the Statistical Agency were cleaned and harmonized by the Economic Research Forum, in the context of a major project that started in 2009. During which extensive efforts have been exerted to acquire, clean, harmonize, preserve and disseminate micro data of existing labor force surveys in several Arab countries.
Covering a representative sample on the national level (Kingdom), governorates, and the three Regions (Central, North and South).
1- Household/family. 2- Individual/person.
The survey covered a national sample of households and all individuals permanently residing in surveyed households.
Sample survey data [ssd]
THE CLEANED AND HARMONIZED VERSION OF THE SURVEY DATA PRODUCED AND PUBLISHED BY THE ECONOMIC RESEARCH FORUM REPRESENTS 100% OF THE ORIGINAL SURVEY DATA COLLECTED BY THE DEPARTMENT OF STATISTICS OF THE HASHEMITE KINGDOM OF JORDAN
The sample of this survey is based on the frame provided by the data of the Population and Housing Census, 2004. The Kingdom was divided into strata, where each city with a population of 100,000 persons or more was considered as a large city. The total number of these cities is 6. Each governorate (except for the 6 large cities) was divided into rural and urban areas. The rest of the urban areas in each governorate were considered as an independent stratum. The same was applied to rural areas where they were considered as an independent stratum. The total number of strata was 30.
And because of the existence of significant variation in the social and economic characteristics in large cities, in particular, and in urban areas in general, each stratum of the large cities and urban strata was divided into four sub-stratums according to the socio- economic characteristics provided by the population and housing census 2004 aiming at providing homogeneous strata.
The sample of this survey was designed using a stratified cluster sampling method. The sample is considered representative on the Kingdom, rural, urban, regions and governorates levels, however, it does not represent the non-Jordanians.
The frame excludes the population living in remote areas (most of whom are nomads). In addition to that, the frame does not include collective dwellings, such as hotels, hospitals, work camps, prisons and alike. However, it is worth noting that the collective households identified in the harmonized data, through a variable indicating the household type, are those reported without heads in the raw data, and in which the relationship of all household members to head was reported "other".
This sample is also not representative for the non-Jordanian population.
Face-to-face [f2f]
The questionnaire was designed electronically on the PDA and revised by the DOS technical staff. It is divided into a number of main topics, each containing a clear and consistent group of questions, and designed in a way that facilitates the electronic data entry and verification. The questionnaire includes the characteristics of household members in addition to the identification information, which reflects the administrative as well as the statistical divisions of the Kingdom.
A tabulation results plan has been set based on the previous Employment and Unemployment Surveys while the required programs were prepared and tested. When all prior data processing steps were completed, the actual survey results were tabulated using an ORACLE package. The tabulations were then thoroughly checked for consistency of data. The final report was then prepared, containing detailed tabulations as well as the methodology of the survey.
The results of the fieldwork indicated that the number of successfully completed interviews was 48880 (with around 91% response rate).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Main Objective: To find out the best model to predict the churners. Models used : Logistic Regression, Decision Tree & Random Forest , ROC and AUC.
Steps Involved:
- Read the data. Column "Churn" is the dependent variable
- Data Cleaning involved checking for missing data, changing categorical vector to factor vector.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8836e58ca07a00d93a07b8a3b736f9e1%2FPicture1.png?generation=1690174421898952&alt=media" alt="">
- Run Logistic Regression
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F054c6b826756c1c8e1902f6ff860ec2d%2FPicture2.png?generation=1690174475710853&alt=media" alt="">
- Used the stepwise function which will include all the independent variables and start removing the insignificant variables one after the other
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ae4608af11eac1b929d5b9cac1b5b93%2FPicture3.png?generation=1690174732587813&alt=media" alt="">
- Variables that has been removed are "AccountWeek" , "DayCalls", "DataUsage", "MonthlyCharge“
- Checked for multicollinearity in the model by using the VIF function but find none as the absolute values are less than 5
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F030da350104c4abf010513b9cd73b8c4%2FPicture4.png?generation=1690174808380692&alt=media" alt="">
- Split the data into train and test (80/20) by using createDataPartition function
- Use the train data to create the model and the test data for prediction
- Created a new column called "class" which has the predicted class in the "test" data frame.
- Use confusion Matrix to find out the sensitivity , specificity and accuracy of the model.
- Accuracy = 85.89% (prediction of both churners and non churners), Sensitivity = 11.45% (prediction of churners out of the total number of churners), Specificity = 98.42% (prediction of non churners out of the total number of non churners)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F138c869a88d29a4c693734e6cba6465c%2FPicture5.png?generation=1690175038748853&alt=media" alt="">
- Sensitivity or True Positive rate is important as we want to predict the total number of churners.
- In the dependent variable Churn, there are 2 levels of observations (1 = churners and 0 = non churners)
- A lower sensitivity rate (11.45% in this case) can sometimes be due to imbalanced data
- It is clearly visible that there are significantly more non churners (2280 nos) in the dataset as opposed to churners (387 nos)
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F36c00c4af6bc998c72f2662b6dfc8fdd%2FPicture6.png?generation=1690175163699036&alt=media" alt="">
- We will use the function “ovun.sample” from package “ROSE” to balance the data by oversampling, undersampling and both sampling
- Observations: 2280*2 = 4560 for oversampling, 387*2 = 774 for undersampling & (4560+774)/2 = 2667 for both sampling
- Ovesampling data is stored in over_data , undersampling data in under_data, both sampling data in both_data.
- We will use Confusion Matrix to find out the sensitivity , accuracy and specificity for over_data, under_data, both_data. We will set the seed(1234) for all three
- Confusion matrix provides us the Accuracy (79.13%), Sensitivity (75%) and Specificity (79.82%) for oversampling in Logistic Regression
- Confusion matrix provides us the Accuracy (79.13%), Sensitivity (72.92%) and Specificity (80.18%) for under sampling in Logistic Regression
- Confusion matrix provides us the Accuracy (79.13%), Sensitivity (75%) and Specificity (79.82%) for both sampling in Logistic Regression
- WE USE DECISION TREE FOR PREDICTION FOR OVERSAMPLING, UNDERSAMPLING AND BOTH SAMPLING
- We will use the library(“rpart”) for this purpose. We will set the seed(1234) for all three.
- Confusion matrix provides us the Accuracy (87.84%), Sensitivity (82.29%) and Specificity (88.77%) for oversampling in Decision Tree
- Confusion matrix provides us the Accuracy (90.24%), Sensitivity (83.33%) and Specificity (91.40%) for under - sampling in Decision Tree
- Confusion matrix provides us the Accuracy (87.84%), Sensitivity (84.38%) and Specificity (88.42%) for both sampling in Decision Tree
- WE USE RANDOM FOREST FOR PREDICTION FOR OVERSAMPLING, UNDERSAMPLING AND BOTH SAMPLING
- We will use the library(“randomForest”) for this purpose. We will set the seed(1234) for all three.
- Confusion matrix provides us the Accuracy (93.09%), Sensitivity (66.67%) and Specificity (97.54%) for oversampling in Random Forest
- Confusion matrix provides us the Accuracy (88.74%), Sensitivity (81.25%) and Specificity (90%) for under sampling in Random Forest
- Confusion matrix provides us the Accuracy (92.79%),...
Facebook
TwitterLentago09indivdataCalculated from the individual response data. Measures of peak preference and selectivity for individual females. Width corresponds to tolerance. Average responsiveness corresponds to responsiveness. Coeff of variation squared refers to the strength of preference.Ptelea09indivdataLentPlast09responsedataxlsRaw response data collected in the lab. Female ID refers to the individual within treatment. Stimulus freq refers to the frequency of the stimulus. # responses refers to the number of times the female responded to a given stimulus frequency.ptelea09responsedataRaw response data collected in the lab. Female ID refers to the individual within treatment. Stimulus freq refers to the frequency of the stimulus. # responses refers to the number of times the female responded to a given stimulus frequency.
Facebook
Twitterhttps://doi.org/10.5061/dryad.kwh70rzd7
This dataset contains the data, CAD, and simulation model generated from the experiments using the variable-stiffness morphing wheel. The presented dataset is necessary for generating figures and results for the paper titled 'Variable-stiffness morphing wheel inspired by the surface tension of a liquid drop'.
The measured data necessary to reproduce the figures in the paper titled 'Variable-stiffness morphing wheel inspired by the surface tension of a liquid drop'.
The shape and size information of the components and platform used in the paper
Facebook
TwitterThis repository provides the raw data, analysis code, and results generated during a systematic evaluation of the impact of selected experimental protocol choices on the metagenomic sequencing analysis of microbiome samples. Briefly, a full factorial experimental design was implemented varying biological sample (n=5), operator (n=2), lot (n=2), extraction kit (n=2), 16S variable region (n=2), and reference database (n=3), and the main effects were calculated and compared between parameters (bias effects) and samples (real biological differences). A full description of the effort is provided in the associated publication.