100+ datasets found

e
Data Papers as a New Form of Knowledge Organization in the Field of Research...
b2find.eudat.eu
Updated Jun 9, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Data Papers as a New Form of Knowledge Organization in the Field of Research Data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/479135c6-f0f0-5cbc-8dbc-0d65326abf42
Explore at:
Dataset updated
Jun 9, 2019
Description
In order to analyse specific features of data papers, we established a representative sample of data journals, based on lists from the European FOSTER Plus project , the German wiki forschungsdaten.org hosted by the University of Konstanz and two French research organizations.The complete list consists of 82 data journals, i.e. journals which publish data papers. They represent less than 0,5% of academic and scholarly journals. For each of these 82 data journals, we gathered information about the discipline, the global business model, the publisher, peer reviewing etc. The analysis is partly based on data from ProQuest’s Ulrichsweb database, enriched and completed by information available on the journals’ home pages.One part of the data journals are presented as “pure” data journals stricto sensu , i.e. journals which publish exclusively or mainly data papers. We identified 28 journals of this category (34%). For each journal, we assessed through direct search on the journals’ homepages (information about the journal, author’s guidelines etc.) the use of identifiers and metadata, the mode of selection and the business model, and we assessed different parameters of the data papers themselves, such as length, structure, linking etc.The results of this analysis are compared with other research journals (“mixed” data journals) which publish data papers along with regular research articles, in order to identify possible differences between both journal categories, on the level of data papers as well as on the level of the regular research papers. Moreover, the results are discussed against concepts of knowledge organization.
Number of alternative data sets used by hedge fund managers globally 2020
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of alternative data sets used by hedge fund managers globally 2020 [Dataset]. https://www.statista.com/statistics/1169968/alternative-data-hedge-fund-managers-global/
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
Worldwide
Description
In 2020, more than ** percent of hedge fund managers classified as alternative data market leaders used ***** or more alternative data sets globally, while only ***** percent of the rest of the market used at least ***** alternative data sets. This highlights the difference between the level of alternative data experience between the two groups. Using *** or more alternative data sets was the most popular approach across both groups with ** percent of market leaders and ** percent of the rest of the market doing this.
f
A performance comparison between data structures.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Jul 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Onimaru, Koh; Nishimura, Osamu; Kuraku, Shigehiro (2020). A performance comparison between data structures. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000557273
Explore at:
Dataset updated
Jul 23, 2020
Authors
Onimaru, Koh; Nishimura, Osamu; Kuraku, Shigehiro
Description
A performance comparison between data structures.
Amount of data created, consumed, and stored 2010-2023, with forecasts to...
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Amount of data created, consumed, and stored 2010-2023, with forecasts to 2028 [Dataset]. https://www.statista.com/statistics/871513/worldwide-data-created/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
May 2024
Area covered
Worldwide
Description
The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching *** zettabytes in 2024. Over the next five years up to 2028, global data creation is projected to grow to more than *** zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected, caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also growing Only a small percentage of this newly created data is kept though, as just * percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of **** percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached *** zettabytes.
Global funders who require data archiving as a condition of grants
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Hahnel (2023). Global funders who require data archiving as a condition of grants [Dataset]. http://doi.org/10.6084/m9.figshare.1281141.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1281141.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Mark Hahnel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Open access and open data are becoming more prominent on the global research agenda. Funders are increasingly requiring grantees to deposit their raw research data in appropriate public archives or stores in order to facilitate the validation of results and further work by other researchers.

While the rise of open access has fundamentally changed the academic publishing landscape, the policies around data are reigniting the conversation around what universities can and should be doing to protect the assets generated at their institution. The main difference between an open access and open data policy is that there is not already a precedent or status quo of how academia deals with the dissemination of research that is not in the form of a traditional ‘paper’ publication.

As governments and funders of research see the benefit of open content, the creation of recommendations, mandates and enforcement of mandates are coming thick and fast.
d
Data from: Data and code from: Topographic wetness index as a proxy for soil...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data and code from: Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization [Dataset]. https://catalog.data.gov/dataset/data-and-code-from-topographic-wetness-index-as-a-proxy-for-soil-moisture-in-a-hillslope-c-e5e42
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
This dataset contains all data and code necessary to reproduce the analysis presented in the manuscript: Winzeler, H.E., Owens, P.R., Read Q.D.., Libohova, Z., Ashworth, A., Sauer, T. 2022. 2022. Topographic wetness index as a proxy for soil moisture in a hillslope catena: flow algorithms and map generalization. Land 11:2018. DOI: 10.3390/land11112018. There are several steps to this analysis. The relevant scripts for each are listed below. The first step is to use the raw digital elevation data (DEM) to produce different versions of the topographic wetness index (TWI) for the study region (Calculating TWI). Then, these TWI output files are processed, along with soil moisture (volumetric water content or VWC) time series data from a number of sensors located within the study region, to create analysis-ready data objects (Processing TWI and VWC). Next, models are fit relating TWI to soil moisture (Model fitting) and results are plotted (Visualizing main results). A number of additional analyses were also done (Additional analyses). Input data The DEM of the study region is archived in this dataset as SourceDem.zip. This contains the DEM of the study region (DEM1.sgrd) and associated auxiliary files all called DEM1.* with different extensions. In addition, the DEM is provided as a .tif file called USGS_one_meter_x39y400_AR_R6_WashingtonCO_2015.tif. The remaining data and code files are archived in the repository created with a GitHub release on 2022-10-11, twi-moisture-0.1.zip. The data are found in a subfolder called data. 2017_LoggerData_HEW.csv through 2021_HEW.csv: Soil moisture (VWC) logger data for each year 2017-2021 (5 files total). 2882174.csv: weather data from a nearby station. DryPeriods2017-2021.csv: starting and ending days for dry periods 2017-2021. LoggerLocations.csv: Geographic locations and metadata for each VWC logger. Logger_Locations_TWI_2017-2021.xlsx: 546 topographic wetness indexes calculated at each VWC logger location. note: This is intermediate input created in the first step of the pipeline. Code pipeline To reproduce the analysis in the manuscript run these scripts in the following order. The scripts are all found in the root directory of the repository. See the manuscript for more details on the methods. Calculating TWI TerrainAnalysis.R: Taking the DEM file as input, calculates 546 different topgraphic wetness indexes using a variety of different algorithms. Each algorithm is run multiple times with different input parameters, as described in more detail in the manuscript. After performing this step, it is necessary to use the SAGA-GIS GUI to extract the TWI values for each of the sensor locations. The output generated in this way is included in this repository as Logger_Locations_TWI_2017-2021.xlsx. Therefore it is not necessary to rerun this step of the analysis but the code is provided for completeness. Processing TWI and VWC read_process_data.R: Takes raw TWI and moisture data files and processes them into analysis-ready format, saving the results as CSV. qc_avg_moisture.R: Does additional quality control on the moisture data and averages it across different time periods. Model fitting Models were fit regressing soil moisture (average VWC for a certain time period) against a TWI index, with and without soil depth as a covariate. In each case, for both the model without depth and the model with depth, prediction performance was calculated with and without spatially-blocked cross-validation. Where cross validation wasn't used, we simply used the predictions from the model fit to all the data. fit_combos.R: Models were fit to each combination of soil moisture averaged over 57 months (all months from April 2017-December 2021) and 546 TWI indexes. In addition models were fit to soil moisture averaged over years, and to the grand mean across the full study period. fit_dryperiods.R: Models were fit to soil moisture averaged over previously identified dry periods within the study period (each 1 or 2 weeks in length), again for each of the 546 indexes. fit_summer.R: Models were fit to the soil moisture average for the months of June-September for each of the five years, again for each of the 546 indexes. Visualizing main results Preliminary visualization of results was done in a series of RMarkdown notebooks. All the notebooks follow the same general format, plotting model performance (observed-predicted correlation) across different combinations of time period and characteristics of the TWI indexes being compared. The indexes are grouped by SWI versus TWI, DEM filter used, flow algorithm, and any other parameters that varied. The notebooks show the model performance metrics with and without the soil depth covariate, and with and without spatially-blocked cross-validation. Crossing those two factors, there are four values for model performance for each combination of time period and TWI index presented. performance_plots_bymonth.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by month across the five years of data to show within-year trends. performance_plots_byyear.Rmd: Using the results from the models fit to each month of data separately, prediction performance was averaged by year to show trends across multiple years. performance_plots_dry_periods.Rmd: Prediction performance was presented for the models fit to the previously identified dry periods. performance_plots_summer.Rmd: Prediction performance was presented for the models fit to the June-September moisture averages. Additional analyses Some additional analyses were done that may not be published in the final manuscript but which are included here for completeness. 2019dryperiod.Rmd: analysis, done separately for each day, of a specific dry period in 2019. alldryperiodsbyday.Rmd: analysis, done separately for each day, of the same dry periods discussed above. best_indices.R: after fitting models, this script was used to quickly identify some of the best-performing indexes for closer scrutiny. wateryearfigs.R: exploratory figures showing median and quantile interval of VWC for sensors in low and high TWI locations for each water year. Resources in this dataset:Resource Title: Digital elevation model of study region. File Name: SourceDEM.zipResource Description: .zip archive containing digital elevation model files for the study region. See dataset description for more details.Resource Title: twi-moisture-0.1: Archived git repository containing all other necessary data and code . File Name: twi-moisture-0.1.zipResource Description: .zip archive containing all data and code, other than the digital elevation model archived as a separate file. This file was generated by a GitHub release made on 2022-10-11 of the git repository hosted at https://github.com/qdread/twi-moisture (private repository). See dataset description and README file contained within this archive for more details.
Bed Surface Profiling (BSP) data used for calculating pickup probability
zenodo.org
bin, txt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chengxiao Lu; Nian-sheng Cheng; Chengxiao Lu; Nian-sheng Cheng (2025). Bed Surface Profiling (BSP) data used for calculating pickup probability [Dataset]. http://doi.org/10.5281/zenodo.15088930
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15088930
Dataset updated
Apr 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Chengxiao Lu; Nian-sheng Cheng; Chengxiao Lu; Nian-sheng Cheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Bed Surface Profiling (BSP) data

The Bed Surface Profiling (BSP) data scanned by an underwater 3D laser scanner for different flow conditions (Case Nos. 1-12). The sampling rate was 10 Hz. Each line of data corresponds to the xyz point cloud data, where x-direction corresponds to streamwise direction, y-direction to trnasverse dirction, and z-dirction to vertical direction.

Code

A python code is included to process the BSP data for calculating sediment pickup probability for each case. This code analyzes 12 case files (Case1.txt to Case12.txt) containing measurement data, processes them using dynamic denoising filters, and calculates a derived metric (p_value) for each case. The results are saved in an Excel file.

Key Components

Dynamic Denoising Filter (dynamic_denoise_filter):

Implements two filtering schemes based on parameter d:

3-Window Rule (d=0.00185): Checks immediate neighbors (previous and next columns) to correct 0, 1, or -1 values.

11-Window Rule (d=0.00380): Uses a larger window (11 columns) with different criteria:

For zeros: Requires immediate neighbors (columns 4 and 6) to be zero.

For ±1: Requires all non-center columns (first 5 and last 5) to match the target value.

Case Processing (process_case):

Data Loading: Reads CSV-like files, extracts y (2nd column) and z (3rd column) values.

Derivative Calculation: Computes dz as the finite difference of z values scaled by d.

Marker Matrix: Creates a matrix m marking regions where dz exceeds ±0.

Denoising: Applies dynamic_denoise_filter to clean the marker matrix.

p-value Calculation:

Computes y_diff as half the absolute difference between adjacent y values.

Identifies valid transitions (1 → -1) in the denoised matrix.

Aggregates weighted counts and normalizes by the total y range.

Parameter Settings:

**d Values**:

0.00185 for Cases 1–6

0.00380 for Cases 7–12

Critical Value (cr): Set to 0 for thresholding dz.

Data Directory: Configured as D:/data_ana_ca.

Output:

Results are saved in analysis_result_ca_cr.xlsx with columns Case and p_value.

Usage

Input: 12 text files (Case1.txt to Case12.txt) with comma-separated data.

Output: Excel file containing calculated p-values.

Runtime Monitoring: Prints processing time for each case during execution.

The code efficiently handles large datasets using vectorized operations and sliding window techniques, ensuring scalability for similar analytical tasks.
R
Scripts linked to: "Identification of buffered data in time series...
entrepot.recherche.data.gouv.fr
b2find.eudat.eu
text/markdown, txt +1
Updated Jun 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nelly MOULIN; Nelly MOULIN; GRESSELIN Frederic; DARDAILLON Bruno; THOMAS Zahra; GRESSELIN Frederic; DARDAILLON Bruno; THOMAS Zahra (2024). Scripts linked to: "Identification of buffered data in time series preprocessing" [Dataset]. http://doi.org/10.57745/ULM3BC
Explore at:
txt(1776), text/markdown(2227), txt(2549), txt(337), txt(17330), txt(1677), type/x-r-syntax(8403), txt(21370), txt(534), txt(2273)Available download formats
Unique identifier
https://doi.org/10.57745/ULM3BC
Dataset updated
Jun 24, 2024
Dataset provided by
Recherche Data Gouv
Authors
Nelly MOULIN; Nelly MOULIN; GRESSELIN Frederic; DARDAILLON Bruno; THOMAS Zahra; GRESSELIN Frederic; DARDAILLON Bruno; THOMAS Zahra
License
https://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Description
In the frame of the QUAE project, an identification procedure was develop to sort singular behaviours in river temperature time series. This procedure was conceived as a tool to indentify particular behaviours in time series despite non continuous measurements and regardless the type of measurement (temperature, streamflow...). Three types of singularities are identified: extreme values (in some cases similar as outliers), roughened data (such as the difference between water temperature and air temperature) and buffered data (such as signals caused by groundwater inflows).
Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned...
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
Comparison Between 2019-2020 NSDUH State Prevalence Estimates
catalog.data.gov
healthdata.gov
+1more
Updated Sep 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). Comparison Between 2019-2020 NSDUH State Prevalence Estimates [Dataset]. https://catalog.data.gov/dataset/comparison-between-2019-2020-nsduh-state-prevalence-estimates
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Description
State estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
O
Womply State-level Business Revenue
data.ct.gov
datasets.ai
+2more
application/rdfxml +5
Updated May 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opportunity Insights (2022). Womply State-level Business Revenue [Dataset]. https://data.ct.gov/Business/Womply-State-level-Business-Revenue/kypk-e3qu
Explore at:
csv, application/rdfxml, application/rssxml, tsv, xml, jsonAvailable download formats
Dataset updated
May 9, 2022
Dataset authored and provided by
Opportunity Insights
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Small business transactions and revenue data aggregated from several credit card processors, collected by Womply and compiled by Opportunity Insights. Transactions and revenue are reported based on the ZIP code where the business is located.

Data provided for CT (FIPS code 9), MA (25), NJ (34), NY (36), and RI (44).

Data notes from Opportunity Insights: Seasonally adjusted change since January 2020. Data is indexed in 2019 and 2020 as the change relative to the January index period. We then seasonally adjust by dividing year-over-year, which represents the difference between the change since January observed in 2020 compared to the change since January observed since 2019. We account for differences in the dates of federal holidays between 2019 and 2020 by shifting the 2019 reference data to align the holidays before performing the year-over-year division.

Small businesses are defined as those with annual revenue below the Small Business Administration’s thresholds. Thresholds vary by 6 digit NAICS code ranging from a maximum number of employees between 100 to 1500 to be considered a small business depending on the industry.

County-level and metro-level data and breakdowns by High/Middle/Low income ZIP codes have been temporarily removed since the August 21st 2020 update due to revisions in the structure of the raw data we receive. We hope to add them back to the OI Economic Tracker soon.

More detailed documentation on Opportunity Insights data can be found here: https://github.com/OpportunityInsights/EconomicTracker/blob/main/docs/oi_tracker_data_documentation.pdf
a
External Evaluation of the In Their Hands Programme (Kenya)., Round 1 -...
microdataportal.aphrc.org
Updated Oct 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
African Population and Health Research Centre (2021). External Evaluation of the In Their Hands Programme (Kenya)., Round 1 - Kenya [Dataset]. https://microdataportal.aphrc.org/index.php/catalog/117
Explore at:
Dataset updated
Oct 19, 2021
Dataset authored and provided by
African Population and Health Research Centre
Time period covered
2018
Area covered
Kenya
Description
Abstract

Background: Adolescent girls in Kenya are disproportionately affected by early and unintended pregnancies, unsafe abortion and HIV infection. The In Their Hands (ITH) programme in Kenya aims to increase adolescents' use of high-quality sexual and reproductive health (SRH) services through targeted interventions. ITH Programme aims to promote use of contraception and testing for sexually transmitted infections (STIs) including HIV or pregnancy, for sexually active adolescent girls, 2) provide information, products and services on the adolescent girl's terms; and 3) promote communities support for girls and boys to access SRH services.

Objectives: The objectives of the evaluation are to assess: a) to what extent and how the new Adolescent Reproductive Health (ARH) partnership model and integrated system of delivery is working to meet its intended objectives and the needs of adolescents; b) adolescent user experiences across key quality dimensions and outcomes; c) how ITH programme has influenced adolescent voice, decision-making autonomy, power dynamics and provider accountability; d) how community support for adolescent reproductive and sexual health initiatives has changed as a result of this programme.

Methodology ITH programme is being implemented in two phases, a formative planning and experimentation in the first year from April 2017 to March 2018, and a national roll out and implementation from April 2018 to March 2020. This second phase is informed by an Annual Programme Review and thorough benchmarking and assessment which informed critical changes to performance and capacity so that ITH is fit for scale. It is expected that ITH will cover approximately 250,000 adolescent girls aged 15-19 in Kenya by April 2020. The programme is implemented by a consortium of Marie Stopes Kenya (MSK), Well Told Story, and Triggerise. ITH's key implementation strategies seek to increase adolescent motivation for service use, create a user-defined ecosystem and platform to provide girls with a network of accessible subsidized and discreet SRH services; and launch and sustain a national discourse campaign around adolescent sexuality and rights. The 3-year study will employ a mixed-methods approach with multiple data sources including secondary data, and qualitative and quantitative primary data with various stakeholders to explore their perceptions and attitudes towards adolescents SRH services. Quantitative data analysis will be done using STATA to provide descriptive statistics and statistical associations / correlations on key variables. All qualitative data will be analyzed using NVIVO software.

Study Duration: 36 months - between 2018 and 2020.

Geographic coverage

Narok and Homabay counties

Analysis unit

Households

Universe

All adolescent girls aged 15-19 years resident in the household.

Sampling procedure

The sampling of adolescents for the household survey was based on expected changes in adolescent's intention to use contraception in future. According to the Kenya Demographic and Health Survey 2014, 23.8% of adolescents and young women reported not intending to use contraception in future. This was used as a baseline proportion for the intervention as it aimed to increase demand and reduce the proportion of sexually active adolescents who did not intend to use contraception in the future. Assuming that the project was to achieve an impact of at least 2.4 percentage points in the intervention counties (i.e. a reduction by 10%), a design effect of 1.5 and a non- response rate of 10%, a sample size of 1885 was estimated using Cochran's sample size formula for categorical data was adequate to detect this difference between baseline and end line time points. Based on data from the 2009 Kenya census, there were approximately 0.46 adolescents girls per a household, which meant that the study was to include approximately 4876 households from the two counties at both baseline and end line surveys.

We collected data among a representative sample of adolescent girls living in both urban and rural ITH areas to understand adolescents' access to information, use of SRH services and SRH-related decision making autonomy before the implementation of the intervention. Depending on the number of ITH health facilities in the two study counties, Homa Bay and Narok that, we sampled 3 sub-Counties in Homa Bay: West Kasipul, Ndhiwa and Kasipul; and 3 sub-Counties in Narok, Narok Town, Narok South and Narok East purposively. In each of the ITH intervention counties, there were sub-counties that had been prioritized for the project and our data collection focused on these sub-counties selected for intervention. A stratified sampling procedure was used to select wards with in the sub-counties and villages from the wards. Then households were selected from each village after all households in the villages were listed. The purposive selection of sub-counties closer to ITH intervention facilities meant that urban and semi-urban areas were oversampled due to the concentration of health facilities in urban areas.

Qualitative Sampling

Focus Group Discussion participants were recruited from the villages where the ITH adolescent household survey was conducted in both counties. A convenience sample of consenting adults living in the villages were invited to participate in the FGDS. The discussion was conducted in local languages. A facilitator and note-taker trained on how to use the focus group guide, how to facilitate the group to elicit the information sought, and how to take detailed notes. All focus group discussions took place in the local language and were tape-recorded, and the consent process included permission to tape-record the session. Participants were identified only by their first names and participants were asked not to share what was discussed outside of the focus group. Participants were read an informed consent form and asked to give written consent. In-depth interviews were conducted with purposively selected sample of consenting adolescent girls who participated in the adolescent survey. We conducted a total of 45 In-depth interviews with adolescent girls (20 in Homa Bay County and 25 in Narok County respectively). In addition, 8 FGDs (4 each per county) were conducted with mothers of adolescent girls who are usual residents of the villages which had been identified for the interviews and another 4 FGDs (2 each per county) with CHVs.

Sampling deviation

N/A

Mode of data collection

Face-to-face [f2f] for quantitative data collection and Focus Group Discussions and In Depth Interviews for qualitative data collection

Research instrument

The questionnaire covered; socio-demographic and household information, SRH knowledge and sources of information, sexual activity and relationships, family planning knowledge, access, choice and use when needed, exposure to family planning messages and voice and decision making autonomy and quality of care for those who visited health facilities in the 12 months before the survey. The questionnaire was piloted before the data collection and the questions reviewed for appropriateness, comprehension and flow. The questionnaire was piloted among a sample of 42 adolescent girls (two each per field interviewer) 15-19 from a community outside the study counties.

The questionnaire was originally developed in English and later translated into Kiswahili. The questionnaire was programmed using ODK-based Survey CTO platform for data collection and management and was administered through face-to-face interview.

Cleaning operations

The survey tools were programmed using the ODK-based SurveyCTO platform for data collection and management. During programming, consistency checks were in-built into the data capture software which ensured that there were no cases of missing or implausible information/values entered into the database by the field interviewers. For example, the application included controls for variables ranges, skip patterns, duplicated individuals, and intra- and inter-module consistency checks. This reduced or eliminated errors usually introduced at the data capture stage. Once programmed, the survey tools were tested by the programming team who in conjunction with the project team conducted further testing on the application's usability, in-built consistency checks (skips, variable ranges, duplicating individuals etc.), and inter-module consistency checks. Any issues raised were documented and tracked on the Issue Tracker and followed up to full and timely resolution. After internal testing was done, the tools were availed to the project and field teams to perform user acceptance testing (UAT) so as to verify and validate that the electronic platform worked exactly as expected, in terms of usability, questions design, checks and skips etc.

Data cleaning was performed to ensure that data were free of errors and that indicators generated from these data were accurate and consistent. This process begun on the first day of data collection as the first records were uploaded into the database. The data manager used data collected during pilot testing to begin writing scripts in Stata 14 to check the variables in the data in 'real-time'. This ensured the resolutions of any inconsistencies that could be addressed by the data collection teams during the fieldwork activities. The Stata 14 scripts that perform real-time checks and clean data also wrote to a .rtf file that detailed every check performed against each variable, any inconsistencies encountered, and all steps that were taken to address these inconsistencies. The .rtf files also reported when a variable was
C
Replicate data for "High-resolution data assimilation for two maritime...
dataverse.csuc.cat
Updated Sep 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DIEGO SAÚL CARRIÓ CARRIÓ; DIEGO SAÚL CARRIÓ CARRIÓ (2025). Replicate data for "High-resolution data assimilation for two maritime extreme weather events: a comparison between 3D-Var and EnKF" [Dataset]. http://doi.org/10.34810/data2515
Explore at:
text/x-python-script(33409), application/x-ipynb+json(3437174), application/x-ipynb+json(857870), application/x-ipynb+json(290182), application/x-ipynb+json(2505690), application/x-ipynb+json(382816), text/x-python-script(33971), application/x-ipynb+json(738038), txt(5687), bin(5279), text/x-python-script(34134), wps(705), application/x-ipynb+json(28994825), application/x-ipynb+json(13488733), application/x-ipynb+json(2056887), text/x-python-script(34016), application/x-ipynb+json(745692), application/x-ipynb+json(1728732), application/x-ipynb+json(1849669), application/x-ipynb+json(6702949), application/x-ipynb+json(442316)Available download formats
Unique identifier
https://doi.org/10.34810/data2515
Dataset updated
Sep 3, 2025
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
DIEGO SAÚL CARRIÓ CARRIÓ; DIEGO SAÚL CARRIÓ CARRIÓ
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset provides the key configuration files and scripts required to reproduce the numerical weather simulations presented in our study “High-resolution data assimilation for two maritime extreme weather events: a comparison between 3D-Var and EnKF” (https://doi.org/10.5194/nhess-2024-177 ). The experiments were conducted using the Weather Research and Forecasting (WRF) model v4.0. Included in this dataset are: Namelist files (namelist.wps, namelist.input) that define the model setup and allow replication of the simulation environment used in the paper. Python scripts used to generate the figures presented in the publication, enabling users to reproduce the main visual results directly. Due to storage limitations, the full model outputs from the numerical simulations (several terabytes in size) cannot be hosted in this repository. Instead, we provide the essential configuration files and analysis scripts so that researchers can rerun the simulations on their own computational resources and replicate both the workflow and figures described in the paper.
Z
Harmonized Vegetation Continuous Fields (VCF)
data.niaid.nih.gov
zenodo.org
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Remelgado, Ruben (2024). Harmonized Vegetation Continuous Fields (VCF) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8217454
Explore at:
Dataset updated
Jul 11, 2024
Dataset provided by
German Centre for Integrative Biodiversity Research (iDiv)
Authors
Remelgado, Ruben
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Motivation

Song’s Vegetation Continuous fields (VCF) product, based on AVHRR satellite data, is the longest time-series of its type, but lacks updates past 2016 due to the extensive degradation of the sensor. We used machine learning to extend this time-series using data from the Copernicus Land Cover dataset, which provides per-pixel proportions of different land cover classes between 2015 and 2019. In addition, we included MODIS VCF data.

Content

This repository contains the infrastructure used to model Song-like VCF data past 2016. This infrastructure contains a yaml file that configures the modelling framework (e.g. variables, directories, hyper-parameter tuning), and that interacts with a standardized folder structure.

Modelling approach

Song's VCF dataset includes data on generic categories, namely “tree cover”, “non-tree vegetation”, and “non vegetated”. Given the Copernicus dataset has a higher thematic detail, we first aggregated these data into comparable classes. We created a “Non-tree vegetation” layer (i.e. total per-pixel proportion of crops, grasses, shrubs, and mosses), and a “Non Vegetated” layer (i.e. total per-pixel proportion of bare land, permanent water, urban, and snow). Independent data on “Tree cover” was already present.

We then constructed a Random Forest Regression (RFReg) model to predict Song-like VCF layers between 2016 and 2019. The predictions were informed by variables on topography, climate, and fires (which limit the density of vegetation), and by variables on differences between the Copernicus VCF and MODIS-based VCF data. Because MODIS data is available past 2016, its inclusion informs our models on how MODIS data, and their differences compared to Copernicus data, relate to the values reported in Song's data.

Sampling scheme

For each VCF category, we collected samples on a country-by-country basis. Within each country, we estimated the difference in percent cover between the Song's and Copernicus VCF data, and sampled across a gradient of differences, from -100% (no cover in AVHRR and full cover in Copernicus) to +100% (full cover in AVHRR and no cover in Copernicus). We iterated through this range in intervals of 10% and sampled across a gradient of “tree cover”, “non-tree vegetation”, and “non vegetated”, in intervals of 10% from 0% to 100%. We collected at least one sample per 50 km2 in 2016, the last year where all VCF-related variables (Song's, Copernicus, MODIS) are available simultaneously. The amount of samples attributed to each range of differences is proportional to the area covered by this range within the country of reference. The sampling approach was repeated for each VCF class, and the outputs were later combined into a single set of samples that exclude duplicates, resulting in 238,052 samples.

Validation

The model outputs were validated using leave-one-out cross-validation. For each VCF class, the validation framework iterates through each country where samples were collected, excluding it for validation and using the remaining samples to train a RFReg models.This resulted in R2 values of 0.91, 0.87 and 0.91 for “tree cover”, “non-tree vegetation”, and “non vegetated”. respectively. The RMSE values were of 2.31%, 3.05%, and 2.25%.

The model was applied to data from 2015, which was not used to neither predict nor validate our models. A comparison between the 2015 Song data against our predictions, which consist of 8,764,232 pixels, yielded R2 values of 0.94, 0.91, and 0.97. The RMSE were 6.65%, 8.92%, and 5.96%. Additionally, we compared changes between 2015 and 2016, resulting in RMSE values of 2.83%, 3.69%, and 2.57%.

Post-processing

When observing annual VCF time-series based on Song's data, we noted that our predictions were the most plausible for “tree cover” and “non-tree vegetation”. In turn, our “non vegetated” are seemingly underestimated (see "temporal_trend_check.png"), reporting large year-to-year decreases om cover (-3.05% between 2016 and 2017, compared to -0.14% for "tree cover" and -0.26% for “non-tree vegetation”). To address this issue, we recommend deriving data on “non-vegetated” cover by computing the difference between 100% and the sum of "tree cover” and “non-tree vegetation”.
f
Supplementary data: Healthcare cost comparison between first-line ibrutinib...
becaris.figshare.com
docx
Updated Oct 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lindsey Fitzgerald; Sabyasachi Ghosh; Alex Bokun; Angela Lax; Fan Mu; Eric Wu; Yilu Lin; Lizheng Shi; Zaina P Qureshi; Solomon A Graf (2025). Supplementary data: Healthcare cost comparison between first-line ibrutinib and acalabrutinib in chronic lymphocytic leukemia patients in the Veterans Affairs [Dataset]. http://doi.org/10.6084/m9.figshare.30354418.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30354418.v1
Dataset updated
Oct 14, 2025
Dataset provided by
Becaris
Authors
Lindsey Fitzgerald; Sabyasachi Ghosh; Alex Bokun; Angela Lax; Fan Mu; Eric Wu; Yilu Lin; Lizheng Shi; Zaina P Qureshi; Solomon A Graf
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
These are peer-reviewed supplementary materials for the article 'Healthcare cost comparison between first-line ibrutinib and acalabrutinib in chronic lymphocytic leukemia patients in the Veterans Affairs' published in the Journal of Comparative Effectiveness Research.Supplemental Table 1: Administrative codes for comorbiditiesAim: Bruton’s tyrosine kinase inhibitors (BTKis), including ibrutinib and acalabrutinib, transformed the treatment landscape of chronic lymphocytic leukemia (CLL) and small lymphocytic lymphoma (SLL) by improving outcomes compared with chemoimmunotherapy. Real-world economic comparisons between BTKis are needed in diverse populations. This study aimed to compare healthcare costs in the Veterans Health Administration (VHA) among patients with CLL/SLL treated with, and remaining persistent on, first-line (1L) ibrutinib versus acalabrutinib monotherapy for 12 months. Materials & methods: This retrospective study used VHA electronic medical record data from January 2006 to July 2024. Eligible patients initiated 1L ibrutinib or acalabrutinib monotherapy on or after November 2019 and remained on continuous treatment for ≥12 months. All-cause and CLL/SLL-related costs were assessed over 12 months of follow-up. Generalized linear models were used to estimate adjusted costs and compare differences between treatment cohorts. Results: A total of 1059 patients were included (ibrutinib: n = 732; acalabrutinib: n = 327). During the 12-month follow-up of continuous 1L treatment, the annual adjusted all-cause total healthcare cost difference between ibrutinib and acalabrutinib was -$2422 (p = 0.46) (adjusted medical cost difference: $5259, p = 0.03; adjusted pharmacy cost difference: -$5886, p = 0.02). The annual adjusted CLL/SLL-related total healthcare cost difference between ibrutinib and acalabrutinib was -$3793 (p = 0.15) (adjustedmedical cost difference: $2085, p = 0.05; adjusted pharmacy cost difference: -$5860, p = 0.02). Conclusion: Among VHA patientswith CLL/SLL who initiated and remained on treatment with 1L BTKi monotherapy for 12 months, annual all-cause and CLL/SLL-related total healthcare costs were similar between ibrutinib and acalabrutinib. Pharmacy costs were lower for ibrutinib, while medical costs were lower for acalabrutinib, resulting in overall comparable total costs.
Net international investment position - annual data
data.europa.eu
db.nomics.world
+3more
csv, html, tsv, xml
Updated Nov 12, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eurostat (2014). Net international investment position - annual data [Dataset]. https://data.europa.eu/data/datasets/hzlqr3eegfxliowk6wvx2q?locale=en
Explore at:
xml(8872), tsv(5070), csv(10252), xml(11019), htmlAvailable download formats
Dataset updated
Nov 12, 2014
Dataset authored and provided by
Eurostathttps://ec.europa.eu/eurostat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The international investment position (IIP) is a statistical statement that shows at a point in time the value and composition of: -financial assets of residents of an economy that are claims on non-residents and gold bullion held as reserve assets, and -liabilities of residents of an economy to non-residents. The difference between an economy’s external financial assets and liabilities is the economy’s net IIP, which may be positive or negative. Respectively the net international investment position (NIIP) provides an aggregate view of the net financial position (assets minus liabilities) of a country vis-à-vis the rest of the world. It allows for a stock-flow analysis of external position of the country. The MIP scoreboard indicator is the net international investment position expressed in percent of GDP. The indicator is based on the Eurostat data from the Balance of payment statistics. These data are quaterly reported by the EU Member States. Definitions are based on the Sixth Edition of the IMF's Balance of Payments and International Investment Position Manual (BPM6). The indicative threshold is -35%. The MIP scoreboard indicator is calculated as: [NIIPt/GDPt]*100.
Data from: SEFSC CAGES Alabama Fish Length Data with CPUE
gbif.org
data.ioos.us
+3more
Updated Jun 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harmon Brown; Harmon Brown (2020). SEFSC CAGES Alabama Fish Length Data with CPUE [Dataset]. http://doi.org/10.15468/vs1k7e
Explore at:
Unique identifier
https://doi.org/10.15468/vs1k7e
Dataset updated
Jun 16, 2020
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
United States Geological Survey
Authors
Harmon Brown; Harmon Brown
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered

Description
The CAGES program (Comparative Assessment of Gulf Estuarine Systems) is designed to examine the differences between estuarine ecosystems and investigate why some are more productive than others. The program focuses on estuarine areas important to commercial fisheries and includes data on commercial finfish and invertebrate species, as well as other species commonly captured in trawl sampling. This Louisiana dataset is a subset of the CAGES Relational Database which is a compilation of fishery-independent data contributed by natural resource agencies of the Gulf States (Texas, Louisiana, Mississippi, Alabama, and Florida). This data set includes data from trawls collected by the Alabama Department of Conservation and Natural Resources (ADCNR) and Gulf Coast Research Laboratory (GCRL), as well as associated hydrographic data. Trawl data were used to calculate CPUE (catch per unit effort) at each station for the years 1981 through 2007.
o
132kV Circuit Operational Data Half Hourly
ukpowernetworks.opendatasoft.com
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). 132kV Circuit Operational Data Half Hourly [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-132kv-circuit-operational-data-half-hourly/
Explore at:
Dataset updated
Oct 2, 2025
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionUK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is distributing this electricity across our regions through circuits. Electricity enters our network through Super Grid Transformers at substations shared with National Grid we call Grid Supply Points. It is then sent at across our 132 kV Circuits towards our grid substations and primary substations. These circuits can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.

This dataset provides half-hourly current and power flow data across these named circuits from 2021 through to the previous month across our license areas. The data are aligned with the same naming convention as the LTDS for improved interoperability.

Care is taken to protect the private affairs of companies connected to the 132 kV network, resulting in the redaction of certain circuits. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.

To find which circuit you are looking for, use the ‘ltds_line_name’ that can be cross-referenced in the 132kV Circuits Monthly Data, which describes by month what circuits were triaged, if they could be made public, and what the monthly statistics are of that site.

If you want to download all this data, it is perhaps more convenient from our public sharepoint: Sharepoint

This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.

Methodological Approach

The dataset is not derived, it is the measurements from our network stored in our historian.

The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps.

We developed a data redactions process to protect the privacy of companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer.

The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation.

Quality Control Statement

The data is provided "as is".

In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these measurements are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that.

Assurance Statement

Creating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS circuit from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same circuit in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets.

There is potential for human error during the manual data processing. These issues can include missing circuits, incorrectly labelled circuits, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible.

Additional Information

Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary.

Download dataset information: Metadata (JSON)To view this data please register and login.
Z
Network Traffic Analysis: Data and Code
data.niaid.nih.gov
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric (2024). Network Traffic Analysis: Data and Code [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11479410
Explore at:
Dataset updated
Jun 12, 2024
Dataset provided by
Loyola University Chicago
Authors
Moran, Madeline; Honig, Joshua; Ferrell, Nathan; Soni, Shreena; Homan, Sophia; Chan-Tin, Eric
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Code:

Packet_Features_Generator.py & Features.py

To run this code:

pkt_features.py [-h] -i TXTFILE [-x X] [-y Y] [-z Z] [-ml] [-s S] -j

-h, --help show this help message and exit -i TXTFILE input text file -x X Add first X number of total packets as features. -y Y Add first Y number of negative packets as features. -z Z Add first Z number of positive packets as features. -ml Output to text file all websites in the format of websiteNumber1,feature1,feature2,... -s S Generate samples using size s. -j

Purpose:

Turns a text file containing lists of incomeing and outgoing network packet sizes into separate website objects with associative features.

Uses Features.py to calcualte the features.

startMachineLearning.sh & machineLearning.py

To run this code:

bash startMachineLearning.sh

This code then runs machineLearning.py in a tmux session with the nessisary file paths and flags

Options (to be edited within this file):

--evaluate-only to test 5 fold cross validation accuracy

--test-scaling-normalization to test 6 different combinations of scalers and normalizers

Note: once the best combination is determined, it should be added to the data_preprocessing function in machineLearning.py for future use

--grid-search to test the best grid search hyperparameters - note: the possible hyperparameters must be added to train_model under 'if not evaluateOnly:' - once best hyperparameters are determined, add them to train_model under 'if evaluateOnly:'

Purpose:

Using the .ml file generated by Packet_Features_Generator.py & Features.py, this program trains a RandomForest Classifier on the provided data and provides results using cross validation. These results include the best scaling and normailzation options for each data set as well as the best grid search hyperparameters based on the provided ranges.

Data

Encrypted network traffic was collected on an isolated computer visiting different Wikipedia and New York Times articles, different Google search queres (collected in the form of their autocomplete results and their results page), and different actions taken on a Virtual Reality head set.

Data for this experiment was stored and analyzed in the form of a txt file for each experiment which contains:

First number is a classification number to denote what website, query, or vr action is taking place.

The remaining numbers in each line denote:

The size of a packet,

and the direction it is traveling.

negative numbers denote incoming packets

positive numbers denote outgoing packets

Figure 4 Data

This data uses specific lines from the Virtual Reality.txt file.

The action 'LongText Search' refers to a user searching for "Saint Basils Cathedral" with text in the Wander app.

The action 'ShortText Search' refers to a user searching for "Mexico" with text in the Wander app.

The .xlsx and .csv file are identical

Each file includes (from right to left):

The origional packet data,

each line of data organized from smallest to largest packet size in order to calculate the mean and standard deviation of each packet capture,

and the final Cumulative Distrubution Function (CDF) caluclation that generated the Figure 4 Graph.
World Health Survey 2003 - Mauritius
microdata.worldbank.org
datacatalog.ihsn.org
+2more
Updated Oct 17, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Health Organization (WHO) (2013). World Health Survey 2003 - Mauritius [Dataset]. https://microdata.worldbank.org/index.php/catalog/1735
Explore at:
Dataset updated
Oct 17, 2013
Dataset provided by
World Health Organizationhttps://who.int/
Authors
World Health Organization (WHO)
Time period covered
2003
Area covered
Mauritius
Description
Abstract

Different countries have different health outcomes that are in part due to the way respective health systems perform. Regardless of the type of health system, individuals will have health and non-health expectations in terms of how the institution responds to their needs. In many countries, however, health systems do not perform effectively and this is in part due to lack of information on health system performance, and on the different service providers.

The aim of the WHO World Health Survey is to provide empirical data to the national health information systems so that there is a better monitoring of health of the people, responsiveness of health systems and measurement of health-related parameters.

The overall aims of the survey is to examine the way populations report their health, understand how people value health states, measure the performance of health systems in relation to responsiveness and gather information on modes and extents of payment for health encounters through a nationally representative population based community survey. In addition, it addresses various areas such as health care expenditures, adult mortality, birth history, various risk factors, assessment of main chronic health conditions and the coverage of health interventions, in specific additional modules.

The objectives of the survey programme are to: 1. develop a means of providing valid, reliable and comparable information, at low cost, to supplement the information provided by routine health information systems. 2. build the evidence base necessary for policy-makers to monitor if health systems are achieving the desired goals, and to assess if additional investment in health is achieving the desired outcomes. 3. provide policy-makers with the evidence they need to adjust their policies, strategies and programmes as necessary.

Geographic coverage

The survey sampling frame must cover 100% of the country's eligible population, meaning that the entire national territory must be included. This does not mean that every province or territory need be represented in the survey sample but, rather, that all must have a chance (known probability) of being included in the survey sample.

There may be exceptional circumstances that preclude 100% national coverage. Certain areas in certain countries may be impossible to include due to reasons such as accessibility or conflict. All such exceptions must be discussed with WHO sampling experts. If any region must be excluded, it must constitute a coherent area, such as a particular province or region. For example if ¾ of region D in country X is not accessible due to war, the entire region D will be excluded from analysis.

Analysis unit

Households and individuals

Universe

The WHS will include all male and female adults (18 years of age and older) who are not out of the country during the survey period. It should be noted that this includes the population who may be institutionalized for health reasons at the time of the survey: all persons who would have fit the definition of household member at the time of their institutionalisation are included in the eligible population.

If the randomly selected individual is institutionalized short-term (e.g. a 3-day stay at a hospital) the interviewer must return to the household when the individual will have come back to interview him/her. If the randomly selected individual is institutionalized long term (e.g. has been in a nursing home the last 8 years), the interviewer must travel to that institution to interview him/her.

The target population includes any adult, male or female age 18 or over living in private households. Populations in group quarters, on military reservations, or in other non-household living arrangements will not be eligible for the study. People who are in an institution due to a health condition (such as a hospital, hospice, nursing home, home for the aged, etc.) at the time of the visit to the household are interviewed either in the institution or upon their return to their household if this is within a period of two weeks from the first visit to the household.

Kind of data

Sample survey data [ssd]

Sampling procedure

SAMPLING GUIDELINES FOR WHS

Surveys in the WHS program must employ a probability sampling design. This means that every single individual in the sampling frame has a known and non-zero chance of being selected into the survey sample. While a Single Stage Random Sample is ideal if feasible, it is recognized that most sites will carry out Multi-stage Cluster Sampling.

The WHS sampling frame should cover 100% of the eligible population in the surveyed country. This means that every eligible person in the country has a chance of being included in the survey sample. It also means that particular ethnic groups or geographical areas may not be excluded from the sampling frame.

The sample size of the WHS in each country is 5000 persons (exceptions considered on a by-country basis). An adequate number of persons must be drawn from the sampling frame to account for an estimated amount of non-response (refusal to participate, empty houses etc.). The highest estimate of potential non-response and empty households should be used to ensure that the desired sample size is reached at the end of the survey period. This is very important because if, at the end of data collection, the required sample size of 5000 has not been reached additional persons must be selected randomly into the survey sample from the sampling frame. This is both costly and technically complicated (if this situation is to occur, consult WHO sampling experts for assistance), and best avoided by proper planning before data collection begins.

All steps of sampling, including justification for stratification, cluster sizes, probabilities of selection, weights at each stage of selection, and the computer program used for randomization must be communicated to WHO

STRATIFICATION

Stratification is the process by which the population is divided into subgroups. Sampling will then be conducted separately in each subgroup. Strata or subgroups are chosen because evidence is available that they are related to the outcome (e.g. health, responsiveness, mortality, coverage etc.). The strata chosen will vary by country and reflect local conditions. Some examples of factors that can be stratified on are geography (e.g. North, Central, South), level of urbanization (e.g. urban, rural), socio-economic zones, provinces (especially if health administration is primarily under the jurisdiction of provincial authorities), or presence of health facility in area. Strata to be used must be identified by each country and the reasons for selection explicitly justified.

Stratification is strongly recommended at the first stage of sampling. Once the strata have been chosen and justified, all stages of selection will be conducted separately in each stratum. We recommend stratifying on 3-5 factors. It is optimum to have half as many strata (note the difference between stratifying variables, which may be such variables as gender, socio-economic status, province/region etc. and strata, which are the combination of variable categories, for example Male, High socio-economic status, Xingtao Province would be a stratum).

Strata should be as homogenous as possible within and as heterogeneous as possible between. This means that strata should be formulated in such a way that individuals belonging to a stratum should be as similar to each other with respect to key variables as possible and as different as possible from individuals belonging to a different stratum. This maximises the efficiency of stratification in reducing sampling variance.

MULTI-STAGE CLUSTER SELECTION

A cluster is a naturally occurring unit or grouping within the population (e.g. enumeration areas, cities, universities, provinces, hospitals etc.); it is a unit for which the administrative level has clear, nonoverlapping boundaries. Cluster sampling is useful because it avoids having to compile exhaustive lists of every single person in the population. Clusters should be as heterogeneous as possible within and as homogenous as possible between (note that this is the opposite criterion as that for strata). Clusters should be as small as possible (i.e. large administrative units such as Provinces or States are not good clusters) but not so small as to be homogenous.

In cluster sampling, a number of clusters are randomly selected from a list of clusters. Then, either all members of the chosen cluster or a random selection from among them are included in the sample. Multistage sampling is an extension of cluster sampling where a hierarchy of clusters are chosen going from larger to smaller.

In order to carry out multi-stage sampling, one needs to know only the population sizes of the sampling units. For the smallest sampling unit above the elementary unit however, a complete list of all elementary units (households) is needed; in order to be able to randomly select among all households in the TSU, a list of all those households is required. This information may be available from the most recent population census. If the last census was >3 years ago or the information furnished by it was of poor quality or unreliable, the survey staff will have the task of enumerating all households in the smallest randomly selected sampling unit. It is very important to budget for this step if it is necessary and ensure that all households are properly enumerated in order that a representative sample is obtained.

It is always best to have as many clusters in the PSU as possible. The reason for this is that the fewer the number of respondents in each PSU, the lower will be the clustering effect which

Facebook

Twitter

Click to copy link

Link copied

Cite

(2019). Data Papers as a New Form of Knowledge Organization in the Field of Research Data - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/479135c6-f0f0-5cbc-8dbc-0d65326abf42

Data Papers as a New Form of Knowledge Organization in the Field of Research Data - Dataset - B2FIND

Explore at:

Dataset updated

Jun 9, 2019

Description

In order to analyse specific features of data papers, we established a representative sample of data journals, based on lists from the European FOSTER Plus project , the German wiki forschungsdaten.org hosted by the University of Konstanz and two French research organizations.The complete list consists of 82 data journals, i.e. journals which publish data papers. They represent less than 0,5% of academic and scholarly journals. For each of these 82 data journals, we gathered information about the discipline, the global business model, the publisher, peer reviewing etc. The analysis is partly based on data from ProQuest’s Ulrichsweb database, enriched and completed by information available on the journals’ home pages.One part of the data journals are presented as “pure” data journals stricto sensu , i.e. journals which publish exclusively or mainly data papers. We identified 28 journals of this category (34%). For each journal, we assessed through direct search on the journals’ homepages (information about the journal, author’s guidelines etc.) the use of identifiers and metadata, the mode of selection and the business model, and we assessed different parameters of the data papers themselves, such as length, structure, linking etc.The results of this analysis are compared with other research journals (“mixed” data journals) which publish data papers along with regular research articles, in order to identify possible differences between both journal categories, on the level of data papers as well as on the level of the regular research papers. Moreover, the results are discussed against concepts of knowledge organization.

Clear search

Close search

Google apps

Main menu

Data Papers as a New Form of Knowledge Organization in the Field of Research...

Number of alternative data sets used by hedge fund managers globally 2020

A performance comparison between data structures.

Amount of data created, consumed, and stored 2010-2023, with forecasts to...

Global funders who require data archiving as a condition of grants

Data from: Data and code from: Topographic wetness index as a proxy for soil...

Bed Surface Profiling (BSP) data used for calculating pickup probability

Key Components

Usage

Scripts linked to: "Identification of buffered data in time series...

Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned...

Comparison Between 2019-2020 NSDUH State Prevalence Estimates

Womply State-level Business Revenue

External Evaluation of the In Their Hands Programme (Kenya)., Round 1 -...

Abstract

Geographic coverage

Analysis unit

Universe

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Replicate data for "High-resolution data assimilation for two maritime...

Harmonized Vegetation Continuous Fields (VCF)

Supplementary data: Healthcare cost comparison between first-line ibrutinib...

Net international investment position - annual data

Data from: SEFSC CAGES Alabama Fish Length Data with CPUE

132kV Circuit Operational Data Half Hourly

Network Traffic Analysis: Data and Code

World Health Survey 2003 - Mauritius