English: This dataset consists of 50 stories about bad (research) data management based on true stories. The stories were collected and adapted by the Thuringian Competence Network for Research Data Management and were used in 2020 as opening for the RDM Days and for the Data Horror Week. The textstories themselves are free to use under the CC0 licence. This does not include the illustrations, which were used on the webpage or in the card game to visualize the stories. Website with all illustrated stories: Link ------------------------------------------- German: Dieser Datensatz besteht aus 50 Geschichten über schlechtes (Forschungs-)Datenmanagement, die auf wahren Begebenheiten beruhen. Die Geschichten wurden vom Thüringer Kompetenznetz für Forschungsdatenmanagement gesammelt und neu verfasst und dienten im Jahr 2020 als Auftakt für die FDM-Tage und für die Data Horror Week. Die Texte der Geschichten selbst sind unter der CC0-Lizenz frei nutzbar. Davon ausgenommen sind die Illustrationen, die auf der Webseite oder im Kartenspiel zur Visualisierung der Geschichten verwendet wurden. Website mit allen illustrierten Geschichten: Link
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)
In all data sets, missing values are coded as "NA".
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Source Huggingface Hub: link
About this dataset The HANS dataset is an NLI evaluation set that tests specific hypotheses about invalid heuristics that NLI models are likely to learn.
Columns File: validation.csv
Column name Description premise The premise of the example. (string) hypothesis The hypothesis of the example. (string) label The label of the example. (string) parse_premise The parse of the premise. (string) parse_hypothesis The parse of the hypothesis. (string) binary_parse_premise The binary parse of the premise. (string) binary_parse_hypothesis The binary parse of the hypothesis. (string) heuristic The heuristic that the example is based on. (string) subcase The subcase of the heuristic that the example is based on. (string) template The template that the example is based on. (string) File: train.csv
Column name Description premise The premise of the example. (string) hypothesis The hypothesis of the example. (string) label The label of the example. (string) parse_premise The parse of the premise. (string) parse_hypothesis The parse of the hypothesis. (string) binary_parse_premise The binary parse of the premise. (string) binary_parse_hypothesis The binary parse of the hypothesis. (string) heuristic The heuristic that the example is based on. (string) subcase The subcase of the heuristic that the example is based on. (string) template The template that the example is based on. (string)
CC0
Original Data Source: HANS (Invalid NLI Heuristics Benchmark)
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.
This data-set has been used to build a metadata scoring and quality prediction model for OERs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Login Data Set for Risk-Based Authentication
Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.
This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.
The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.
WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.
Overview
The data set contains the following features related to each login attempt on the SSO:
Feature
Data Type
Description
Range or Example
IP Address
String
IP address belonging to the login attempt
0.0.0.0 - 255.255.255.255
Country
String
Country derived from the IP address
US
Region
String
Region derived from the IP address
New York
City
String
City derived from the IP address
Rochester
ASN
Integer
Autonomous system number derived from the IP address
0 - 600000
User Agent String
String
User agent string submitted by the client
Mozilla/5.0 (Windows NT 10.0; Win64; ...
OS Name and Version
String
Operating system name and version derived from the user agent string
Windows 10
Browser Name and Version
String
Browser name and version derived from the user agent string
Chrome 70.0.3538
Device Type
String
Device type derived from the user agent string
(mobile, desktop, tablet, bot, unknown)1
User ID
Integer
Idenfication number related to the affected user account
[Random pseudonym]
Login Timestamp
Integer
Timestamp related to the login attempt
[64 Bit timestamp]
Round-Trip Time (RTT) [ms]
Integer
Server-side measured latency between client and server
1 - 8600000
Login Successful
Boolean
True: Login was successful, False: Login failed
(true, false)
Is Attack IP
Boolean
IP address was found in known attacker data set
(true, false)
Is Account Takeover
Boolean
Login attempt was identified as account takeover by incident response team of the online service
(true, false)
Data Creation
As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.
The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.
The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.
Regarding the Data Values
Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.
You can recognize them by the following values:
ASNs with values >= 500.000
IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)
Study Reproduction
Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.
The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.
See RESULTS.md for more details.
Ethics
By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.
The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.
Publication
You can find more details on our conducted study in the following journal article:
Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security
Bibtex
@article{Wiefling_Pump_2022, author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi}, title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}}, journal = {{ACM} {Transactions} on {Privacy} and {Security}}, doi = {10.1145/3546069}, publisher = {ACM}, year = {2022} }
License
This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:
Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069
Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎
https://www.gnu.org/licenses/gpl-2.0.htmlhttps://www.gnu.org/licenses/gpl-2.0.html
The TMS-EEG signal analyser (TESA) is an open source extension for EEGLAB that includes functions necessary for cleaning and analysing TMS-EEG data. Both EEGLAB and TESA run in Matlab (r2015b or later). The attached files are example data files which can be used with TESA.
To download TESA, visit here:
http://nigelrogasch.github.io/TESA/
To read the TESA user manual, visit here:
https://www.gitbook.com/book/nigelrogasch/tesa-user-manual/details
File info:
example_data.set
WARNING: file size = 1.1 GB. A raw data set for trialling TESA. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.
example_data_epoch_demean.set
File size = 340 MB. A partially processed data file of smaller size corresponding to step 8 of the analysis pipeline in the TESA user manual. Channel locations were loaded, unused electrodes removed, bad electrodes removed, epoched (-1000 to 1000 ms) and demeaned (baseline correct -1000 to 1000). Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.
example_data_epoch_demean_cut_int_ds.set
File size = 69 MB. A further processed data file even smaller in size corresponding to step 11 of the analysis pipeline in the TESA user manual. In addition to the above steps, data around the TMS pulse artifact was removed (-2 to 10 ms), replaced using linear interpolation, and downsampled to 1,000 Hz. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.
Example data info:
Monophasic TMS pulses (current flow = posterior-anterior in brain) were given through a figure-of-eight coil (external diameter = 90 mm) connected to a Magstim 2002 unit (Magstim company, UK). 150 TMS pulses were delivered over the left superior parietal cortex (MNI coordinates: -20, -65, 65) at a rate of 0.2 Hz ± 25% jitter. TMS coil position was determined using frameless stereotaxic neuronavigation (Localite TMS Navigator, Localite, Germany) and intensity was set at resting motor threshold of the first dorsal interosseous muscle (68% maximum stimulator output). EEG was recorded from 62 TMS-specialised, c-ring slit electrodes (EASYCAP, Germany) using a TMS-compatible EEG amplifier (BrainAmp DC, BrainProducts GmbH, Germany). Data from all channels were referenced to the FCz electrode online with the AFz electrode serving as the common ground. EEG signals were digitised at 5 kHz (filtering: DC-1000 Hz) and EEG electrode impedance was kept below 5 kΩ.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.
The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.
prompt=f""" I am participating in an SVG code generation competition. The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows: - Descriptions are generic and do not contain brand names, trademarks, or personal names. - No descriptions include people, even in generic terms. - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters. - Categories cover various domains, with some overlap between public and private test sets. To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style? Requirements: - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**. - Ensure **diversity and creativity** across topics. - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**. - Avoid duplication or overly similar phrasing. Example topics: a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid, purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet, a snowy plain, black and white checkered pants, a starlit night over snow-covered peaks, khaki triangles and azure crescents, a maroon dodecahedron interwoven with teal threads. Please return the 100 topics in csv format. """
prompt = f""" Generate SVG code to visually represent the following text description, while respecting the given constraints. Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs` Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity` Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. Focus on a clear and concise representation of the input description within the given limitations. Always give the complete SVG code with nothing omitted. Never use an ellipsis. The code is scored based on similarity to the description, Visual question anwering and aesthetic components. Please generate a detailed svg code accordingly. input description: {text} """
The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the Bad Axe population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Bad Axe. The dataset can be utilized to understand the population distribution of Bad Axe by age. For example, using this dataset, we can identify the largest age group in Bad Axe.
Key observations
The largest age group in Bad Axe, MI was for the group of age 60 to 64 years years with a population of 278 (9.19%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Bad Axe, MI was the 75 to 79 years years with a population of 59 (1.95%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates
Age groups:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Bad Axe Population by Age. You can refer the same here
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve. The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj. The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 . The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 . The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed. COVID-19 test results by date of specimen collection, including total, positive, negative, and indeterminate for molecular and antigen tests. Molecular tests reported include polymerase chain reaction (PCR) and nucleic acid amplicfication (NAAT) tests. Test results may be reported several days after the result. Data are incomplete for the most recent days. Data from previous dates are routinely updated. Records with a null date field summarize tests reported that were missing the date of collection. Starting in July 2020, this dataset will be updated every weekday.
This database was prepared using a combination of materials that include aerial photographs, topographic maps (1:24,000 and 1:250,000), field notes, and a sample catalog. Our goal was to translate sample collection site locations at Yellowstone National Park and surrounding areas into a GIS database. This was achieved by transferring site locations from aerial photographs and topographic maps into layers in ArcMap. Each field site is located based on field notes describing where a sample was collected. Locations were marked on the photograph or topographic map by a pinhole or dot, respectively, with the corresponding station or site numbers. Station and site numbers were then referenced in the notes to determine the appropriate prefix for the station. Each point on the aerial photograph or topographic map was relocated on the screen in ArcMap, on a digital topographic map, or an aerial photograph. Several samples are present in the field notes and in the catalog but do not correspond to an aerial photograph or could not be found on the topographic maps. These samples are marked with “No” under the LocationFound field and do not have a corresponding point in the SampleSites feature class. Each point represents a field station or collection site with information that was entered into an attributes table (explained in detail in the entity and attribute metadata sections). Tabular information on hand samples, thin sections, and mineral separates were entered by hand. The Samples table includes everything transferred from the paper records and relates to the other tables using the SampleID and to the SampleSites feature class using the SampleSite field.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include Stata syntax (dummy_dataset_create.do) that creates a panel dataset for negative binomial time series regression analyses, as described in our paper "Examining methodology to identify patterns of consulting in primary care for different groups of patients before a diagnosis of cancer: an exemplar applied to oesophagogastric cancer". We also include a sample dataset for clarity (dummy_dataset.dta), and a sample of that data in a spreadsheet (Appendix 2).
The variables contained therein are defined as follows:
case: binary variable for case or control status (takes a value of 0 for controls and 1 for cases).
patid: a unique patient identifier.
time_period: A count variable denoting the time period. In this example, 0 denotes 10 months before diagnosis with cancer, and 9 denotes the month of diagnosis with cancer,
ncons: number of consultations per month.
period0 to period9: 10 unique inflection point variables (one for each month before diagnosis). These are used to test which aggregation period includes the inflection point.
burden: binary variable denoting membership of one of two multimorbidity burden groups.
We also include two Stata do-files for analysing the consultation rate, stratified by burden group, using the Maximum likelihood method (1_menbregpaper.do and 2_menbregpaper_bs.do).
Note: In this example, for demonstration purposes we create a dataset for 10 months leading up to diagnosis. In the paper, we analyse 24 months before diagnosis. Here, we study consultation rates over time, but the method could be used to study any countable event, such as number of prescriptions.
The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
This file is an example data set from the Central Valley of California from a drought study corresponding to “recent non-drought conditions” (Scenario 1 in Petrie et al., in review). In 2014, following an 8-year period with 7 below-normal to critically-dry water years, the bioenergetic model TRUEMET was used to assess the impacts of drought on wintering waterfowl habitat and bioenergetics in the Central Valley of California. The goal of the study was to assess whether available foraging habitats could provide enough food to support waterfowl populations (ducks and geese) under a variety of climate and population level scenarios. This information could then be used by managers to adapt their waterfowl habitat management plans to drought conditions. The study area spanned the Central Valley and included the Sacramento Valley in the north, the San Joaquin Valley in the south, and Suisun Marsh and Sacramento-San Joaquin River Delta (Delta) east of San Francisco Bay. The data set consists of two foraging guilds (ducks and geese/swans) and five forage types: harvested corn, rice (flooded), rice (unflooded), wetland invertebrates and wetland moist soil seeds. For more background on the data set, see Petrie et al. in review.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
False positive detections, such as species misidentifications, occur in ecological data, although many models do not account for them. Consequently, these models are expected to generate biased inference. The main challenge in an analysis of data with false positives is to distinguish false positive and false negative processes while modeling realistic levels of heterogeneity in occupancy and detection probabilities without restrictive assumptions about parameter spaces. Building on previous attempts to account for false positive and false negative detections in occupancy models, we present hierarchical Bayesian models that utilize a subset of data with either confirmed detections of a species’ presence (CP model) or both confirmed presences and confirmed absences (CACP model). We demonstrate that our models overcome the challenges associated with false positive data by evaluating model performance in Monte Carlo simulations of a variety of scenarios. Our models also have the ability to improve inference by incorporating previous knowledge through informative priors. We describe an example application of the CP model to quantify the relationship between songbird occupancy and residential development, plus we provide instructions for ecologists to use the CACP and CP models in their own research. Monte Carlo simulation results indicated that, when data contained false positive detections, the CACP and CP models generated more accurate and precise posterior probability distributions than a model that assumed data did not have false positive errors. For the scenarios we expect to be most generally applicable, those with heterogeneity in occupancy and detection, the CACP and CP models generated essentially unbiased posterior occupancy probabilities. The CACP model with vague priors generated unbiased posterior distributions for covariate coefficients. The CP model generated unbiased posterior distributions for covariate coefficients with vague or informative priors, depending on the function relating covariates to occupancy probabilities. We conclude that the CACP and CP models generate accurate inference in situations with false positive data for which previous models were not suitable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for longer-term time series analysis, cloudless imagery and statistical accuracy.
GeoMAD has two main components: Geomedian and Median Absolute Deviations (MADs)
The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop more complex algorithms.
For each pixel, invalid data is discarded, and remaining observations are mathematically summarised using the geomedian statistic. Flyover coverage provided by collecting data over a period of time also helps scope intermittently cloudy areas.
Variations between the geomedian and the individual measurements are captured by the three Median Absolute Deviation (MAD) layers. These are higher-order statistical measurements calculating variation relative to the geomedian. The MAD layers can be used on their own or together with geomedian to gain insights about the land surface and understand change over time.Key PropertiesGeographic Coverage: Continental Africa - approximately 37° North to 35° SouthTemporal Coverage: 2017 – 2022*Spatial Resolution: 10 x 10 meterUpdate Frequency: Annual from 2017 - 2022Product Type: Surface Reflectance (SR)Product Level: Analysis Ready (ARD)Number of Bands: 14 BandsParent Dataset: Sentinel-2 Level-2A Surface ReflectanceSource Data Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Service Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)*Time is enabled on this service using UTC – Coordinated Universal Time. To assure you are seeing the correct year for each annual slice of data, the time zone must be set specifically to UTC in the Map Viewer settings each time this layer is opened in a new map. More information on this setting can be found here: Set the map time zone.ApplicationsGeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for:Longer-term time series analysisCloud-free imageryStatistical accuracyAvailable BandsBand IDDescriptionValue rangeData typeNo data valueB02Geomedian B02 (Blue)1 - 10000uint160B03Geomedian B03 (Green)1 - 10000uint160B04Geomedian B04 (Red)1 - 10000uint160B05Geomedian B05 (Red edge 1)1 - 10000uint160B06Geomedian B06 (Red edge 2)1 - 10000uint160B07Geomedian B07 (Red edge 3)1 - 10000uint160B08Geomedian B08 (Near infrared (NIR) 1)1 - 10000uint160B8AGeomedian B8A (NIR 2)1 - 10000uint160B11Geomedian B11 (Short-wave infrared (SWIR) 1)1 - 10000uint160B12Geomedian B12 (SWIR 2)1 - 10000uint160SMADSpectral Median Absolute Deviation0 - 1float32NaNEMADEuclidean Median Absolute Deviation0 - 31623float32NaNBCMADBray-Curtis Median Absolute Deviation0 - 1float32NaNCOUNTNumber of clear observations1 - 65535uint160Bands can be subdivided as follows:
Geomedian — 10 bands: The geomedian is calculated using the spectral bands of data collected during the specified time period. Surface reflectance values have been scaled between 1 and 10000 to allow for more efficient data storage as unsigned 16-bit integers (uint16). Note parent datasets often contain more bands, some of which are not used in GeoMAD. The geomedian band IDs correspond to bands in the parent Sentinel-2 Level-2A data. For example, the Annual GeoMAD band B02 contains the annual geomedian of the Sentinel-2 B02 band. Median Absolute Deviations (MADs) — 3 bands: Deviations from the geomedian are quantified through median absolute deviation calculations. The GeoMAD service utilises three MADs, each stored in a separate band: Euclidean MAD (EMAD), spectral MAD (SMAD), and Bray-Curtis MAD (BCMAD). Each MAD is calculated using the same ten bands as in the geomedian. SMAD and BCMAD are normalised ratios, therefore they are unitless and their values always fall between 0 and 1. EMAD is a function of surface reflectance but is neither a ratio nor normalised, therefore its valid value range depends on the number of bands used in the geomedian calculation.Count — 1 band: The number of clear satellite measurements of a pixel for that calendar year. This is around 60 annually, but doubles at areas of overlap between scenes. “Count” is not incorporated in either the geomedian or MADs calculations. It is intended for metadata analysis and data validation.ProcessingAll clear observations for the given time period are collated from the parent dataset. Cloudy pixels are identified and excluded. The geomedian and MADs calculations are then performed by the hdstats package. Annual GeoMAD datasets for the period use hdstats version 0.2.More details on this dataset can be found here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains sample output data for TELL. The sample dataset includes four years of sample future data (2039, 2059, 2079, and 2099) that comes from IM3's future WRF runs under the RCP 8.5 climate scenario with SSP5 population forcing. Note that the GCAM-USA output used in this simulation is sample data only. As such the quantitative results from this set of sample output should not be considered valid.
Data files and results for massypup64 (http://www.lababi.bioprocess.org/index.php/14-sample-data-articles/78-massypup): * Proteomics * Metabolomics * Data Mining
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
English: This dataset consists of 50 stories about bad (research) data management based on true stories. The stories were collected and adapted by the Thuringian Competence Network for Research Data Management and were used in 2020 as opening for the RDM Days and for the Data Horror Week. The textstories themselves are free to use under the CC0 licence. This does not include the illustrations, which were used on the webpage or in the card game to visualize the stories. Website with all illustrated stories: Link ------------------------------------------- German: Dieser Datensatz besteht aus 50 Geschichten über schlechtes (Forschungs-)Datenmanagement, die auf wahren Begebenheiten beruhen. Die Geschichten wurden vom Thüringer Kompetenznetz für Forschungsdatenmanagement gesammelt und neu verfasst und dienten im Jahr 2020 als Auftakt für die FDM-Tage und für die Data Horror Week. Die Texte der Geschichten selbst sind unter der CC0-Lizenz frei nutzbar. Davon ausgenommen sind die Illustrationen, die auf der Webseite oder im Kartenspiel zur Visualisierung der Geschichten verwendet wurden. Website mit allen illustrierten Geschichten: Link