100+ datasets found

o
Research Data ScaryTales
explore.openaire.eu
zenodo.org
Updated Oct 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roman Gerlach; Jessica Rex; Kevin Lang; Nadine Neute; Annett Schröter; Volker Schwartze (2020). Research Data ScaryTales [Dataset]. http://doi.org/10.5281/zenodo.4066680
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4066680
Dataset updated
Oct 30, 2020
Authors
Roman Gerlach; Jessica Rex; Kevin Lang; Nadine Neute; Annett Schröter; Volker Schwartze
Description
English: This dataset consists of 50 stories about bad (research) data management based on true stories. The stories were collected and adapted by the Thuringian Competence Network for Research Data Management and were used in 2020 as opening for the RDM Days and for the Data Horror Week. The textstories themselves are free to use under the CC0 licence. This does not include the illustrations, which were used on the webpage or in the card game to visualize the stories. Website with all illustrated stories: Link ------------------------------------------- German: Dieser Datensatz besteht aus 50 Geschichten über schlechtes (Forschungs-)Datenmanagement, die auf wahren Begebenheiten beruhen. Die Geschichten wurden vom Thüringer Kompetenznetz für Forschungsdatenmanagement gesammelt und neu verfasst und dienten im Jahr 2020 als Auftakt für die FDM-Tage und für die Data Horror Week. Die Texte der Geschichten selbst sind unter der CC0-Lizenz frei nutzbar. Davon ausgenommen sind die Illustrationen, die auf der Webseite oder im Kartenspiel zur Visualisierung der Geschichten verwendet wurden. Website mit allen illustrierten Geschichten: Link
Z
Missing data in the analysis of multilevel and dependent data (Examples)
data.niaid.nih.gov
zenodo.org
Updated Jul 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Robitzsch (2023). Missing data in the analysis of multilevel and dependent data (Examples) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7773613
Explore at:
Dataset updated
Jul 20, 2023
Dataset provided by
Alexander Robitzsch
Oliver Lüdtke
Simon Grund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").

The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:

ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)

In all data sets, missing values are coded as "NA".
o
HANS (Invalid NLI Heuristics Benchmark)
opendatabay.com
.other
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). HANS (Invalid NLI Heuristics Benchmark) [Dataset]. https://www.opendatabay.com/data/ai-ml/5e1b9fbc-840e-4463-b252-cd2a2db4c6c8
Explore at:
.otherAvailable download formats
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Source Huggingface Hub: link

About this dataset The HANS dataset is an NLI evaluation set that tests specific hypotheses about invalid heuristics that NLI models are likely to learn.

Columns File: validation.csv

Column name Description premise The premise of the example. (string) hypothesis The hypothesis of the example. (string) label The label of the example. (string) parse_premise The parse of the premise. (string) parse_hypothesis The parse of the hypothesis. (string) binary_parse_premise The binary parse of the premise. (string) binary_parse_hypothesis The binary parse of the hypothesis. (string) heuristic The heuristic that the example is based on. (string) subcase The subcase of the heuristic that the example is based on. (string) template The template that the example is based on. (string) File: train.csv

Column name Description premise The premise of the example. (string) hypothesis The hypothesis of the example. (string) label The label of the example. (string) parse_premise The parse of the premise. (string) parse_hypothesis The parse of the hypothesis. (string) binary_parse_premise The binary parse of the premise. (string) binary_parse_hypothesis The binary parse of the hypothesis. (string) heuristic The heuristic that the example is based on. (string) subcase The subcase of the heuristic that the example is based on. (string) template The template that the example is based on. (string)

License

CC0

Original Data Source: HANS (Invalid NLI Heuristics Benchmark)
F
OER sample data-set
data.uni-hannover.de
csv
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
L3S (2022). OER sample data-set [Dataset]. https://data.uni-hannover.de/dataset/oer-sample-data-set
Explore at:
csv(6260265)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
L3S
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This data-set includes information about a sample of 8,887 of Open Educational Resources (OERs) from SkillsCommons website. It contains title, description, URL, type, availability date, issued date, subjects, and the availability of following metadata: level, time_required to finish, and accessibility.

This data-set has been used to build a metadata scoring and quality prediction model for OERs.
Z
Data from: Login Data Set for Risk-Based Authentication
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lo Iacono, Luigi (2022). Login Data Set for Risk-Based Authentication [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6782155
Explore at:
Dataset updated
Jun 30, 2022
Dataset provided by
Jørgensen, Paul René
Lo Iacono, Luigi
Thunem, Sigurd
Wiefling, Stephan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Login Data Set for Risk-Based Authentication

Synthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.

This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.

The users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.

WARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.

Overview

The data set contains the following features related to each login attempt on the SSO:

Feature Data Type Description Range or Example IP Address String IP address belonging to the login attempt 0.0.0.0 - 255.255.255.255 Country String Country derived from the IP address US Region String Region derived from the IP address New York City String City derived from the IP address Rochester ASN Integer Autonomous system number derived from the IP address 0 - 600000 User Agent String String User agent string submitted by the client Mozilla/5.0 (Windows NT 10.0; Win64; ... OS Name and Version String Operating system name and version derived from the user agent string Windows 10 Browser Name and Version String Browser name and version derived from the user agent string Chrome 70.0.3538 Device Type String Device type derived from the user agent string (mobile, desktop, tablet, bot, unknown)1 User ID Integer Idenfication number related to the affected user account [Random pseudonym] Login Timestamp Integer Timestamp related to the login attempt [64 Bit timestamp] Round-Trip Time (RTT) [ms] Integer Server-side measured latency between client and server 1 - 8600000 Login Successful Boolean True: Login was successful, False: Login failed (true, false) Is Attack IP Boolean IP address was found in known attacker data set (true, false) Is Account Takeover Boolean Login attempt was identified as account takeover by incident response team of the online service (true, false)

Data Creation

As the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.

The timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.

The country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.

The device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.

The RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.

Regarding the Data Values

Due to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.

You can recognize them by the following values:

ASNs with values >= 500.000

IP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)

Study Reproduction

Based on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.

The calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.

See RESULTS.md for more details.

Ethics

By using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.

The synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.

Publication

You can find more details on our conducted study in the following journal article:

Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022) Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono. ACM Transactions on Privacy and Security

Bibtex

@article{Wiefling_Pump_2022, author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi}, title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}}, journal = {{ACM} {Transactions} on {Privacy} and {Security}}, doi = {10.1145/3546069}, publisher = {ACM}, year = {2022} }

License

This data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:

Stephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069

Few (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.↩︎
m
TESA example data
bridges.monash.edu
researchdata.edu.au
bin
Updated Sep 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nigel Rogasch (2016). TESA example data [Dataset]. http://doi.org/10.4225/03/5719CEBC59438
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.4225/03/5719CEBC59438
Dataset updated
Sep 12, 2016
Dataset provided by
Monash University
Authors
Nigel Rogasch
License
https://www.gnu.org/licenses/gpl-2.0.htmlhttps://www.gnu.org/licenses/gpl-2.0.html
Description
The TMS-EEG signal analyser (TESA) is an open source extension for EEGLAB that includes functions necessary for cleaning and analysing TMS-EEG data. Both EEGLAB and TESA run in Matlab (r2015b or later). The attached files are example data files which can be used with TESA.

To download TESA, visit here:

http://nigelrogasch.github.io/TESA/

To read the TESA user manual, visit here:

https://www.gitbook.com/book/nigelrogasch/tesa-user-manual/details

File info:

example_data.set

WARNING: file size = 1.1 GB. A raw data set for trialling TESA. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.

example_data_epoch_demean.set

File size = 340 MB. A partially processed data file of smaller size corresponding to step 8 of the analysis pipeline in the TESA user manual. Channel locations were loaded, unused electrodes removed, bad electrodes removed, epoched (-1000 to 1000 ms) and demeaned (baseline correct -1000 to 1000). Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.

example_data_epoch_demean_cut_int_ds.set

File size = 69 MB. A further processed data file even smaller in size corresponding to step 11 of the analysis pipeline in the TESA user manual. In addition to the above steps, data around the TMS pulse artifact was removed (-2 to 10 ms), replaced using linear interpolation, and downsampled to 1,000 Hz. Load the data file in to EEGLAB using the existing EEGLAB data set functions. Note that both the .fdt and .set files are required.

Example data info:

Monophasic TMS pulses (current flow = posterior-anterior in brain) were given through a figure-of-eight coil (external diameter = 90 mm) connected to a Magstim 2002 unit (Magstim company, UK). 150 TMS pulses were delivered over the left superior parietal cortex (MNI coordinates: -20, -65, 65) at a rate of 0.2 Hz ± 25% jitter. TMS coil position was determined using frameless stereotaxic neuronavigation (Localite TMS Navigator, Localite, Germany) and intensity was set at resting motor threshold of the first dorsal interosseous muscle (68% maximum stimulator output). EEG was recorded from 62 TMS-specialised, c-ring slit electrodes (EASYCAP, Germany) using a TMS-compatible EEG amplifier (BrainAmp DC, BrainProducts GmbH, Germany). Data from all channels were referenced to the FCz electrode online with the AFz electrode serving as the common ground. EEG signals were digitised at 5 kHz (filtering: DC-1000 Hz) and EEG electrode impedance was kept below 5 kΩ.

SVG Code Generation Sample Training Data

kaggle.com

Updated May 3, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Vinothkumar Sekar (2025). SVG Code Generation Sample Training Data [Dataset]. https://www.kaggle.com/datasets/vinothkumarsekar89/svg-generation-sample-training-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 3, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Vinothkumar Sekar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

This training data was generated using GPT-4o as part of the 'Drawing with LLM' competition (https://www.kaggle.com/competitions/drawing-with-llms). It can be used to fine-tune small language models for the competition or serve as an augmentation dataset alongside other data sources.

The dataset is generated in two steps using the GPT-4o model. - In the first step, topic descriptions relevant to the competition are generated using a specific prompt. By running this prompt multiple times, over 3,000 descriptions were collected.

 
prompt=f""" I am participating in an SVG code generation competition.
  
   The competition involves generating SVG images based on short textual descriptions of everyday objects and scenes, spanning a wide range of categories. The key guidelines are as follows:
  
   - Descriptions are generic and do not contain brand names, trademarks, or personal names.
   - No descriptions include people, even in generic terms.
   - Descriptions are concise—each is no more than 200 characters, with an average length of about 50 characters.
   - Categories cover various domains, with some overlap between public and private test sets.
  
   To train a small LLM model, I am preparing a synthetic dataset. Could you generate 100 unique topics aligned with the competition style?
  
   Requirements:
   - Each topic should range between **20 and 200 characters**, with an **average around 60 characters**.
   - Ensure **diversity and creativity** across topics.
   - **50% of the topics** should come from the categories of **landscapes**, **abstract art**, and **fashion**.
   - Avoid duplication or overly similar phrasing.
  
   Example topics:
                 a purple forest at dusk, gray wool coat with a faux fur collar, a lighthouse overlooking the ocean, burgundy corduroy, pants with patch pockets and silver buttons, orange corduroy overalls, a purple silk scarf with tassel trim, a green lagoon under a cloudy sky, crimson rectangles forming a chaotic grid,  purple pyramids spiraling around a bronze cone, magenta trapezoids layered on a translucent silver sheet,  a snowy plain, black and white checkered pants,  a starlit night over snow-covered peaks, khaki triangles and azure crescents,  a maroon dodecahedron interwoven with teal threads.
  
   Please return the 100 topics in csv format.
   """

In the second step, SVG code is generated by prompting the GPT-4o model. The following prompt is used to query the model to generate svg.

 
  prompt = f"""
      Generate SVG code to visually represent the following text description, while respecting the given constraints.
      
      Allowed Elements: `svg`, `path`, `circle`, `rect`, `ellipse`, `line`, `polyline`, `polygon`, `g`, `linearGradient`, `radialGradient`, `stop`, `defs`
      Allowed Attributes: `viewBox`, `width`, `height`, `fill`, `stroke`, `stroke-width`, `d`, `cx`, `cy`, `r`, `x`, `y`, `rx`, `ry`, `x1`, `y1`, `x2`, `y2`, `points`, `transform`, `opacity`
      

      Please ensure that the generated SVG code is well-formed, valid, and strictly adheres to these constraints. 
      Focus on a clear and concise representation of the input description within the given limitations. 
      Always give the complete SVG code with nothing omitted. Never use an ellipsis.

      The code is scored based on similarity to the description, Visual question anwering and aesthetic components.
      Please generate a detailed svg code accordingly.

      input description: {text}
      """

The raw SVG output is then cleaned and sanitized using a competition-specific sanitization class. After that, the cleaned SVG is scored using the SigLIP model to evaluate text-to-SVG similarity. Only SVGs with a score above 0.5 are included in the dataset. On average, out of three SVG generations, only one meets the quality threshold after the cleaning, sanitization, and scoring process.

N
Bad Axe, MI Age Group Population Dataset: A Complete Breakdown of Bad Axe...
neilsberg.com
csv, json
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2024). Bad Axe, MI Age Group Population Dataset: A Complete Breakdown of Bad Axe Age Demographics from 0 to 85 Years and Over, Distributed Across 18 Age Groups // 2024 Edition [Dataset]. https://www.neilsberg.com/research/datasets/aa7666ef-4983-11ef-ae5d-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Jul 24, 2024
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bad Axe, Michigan
Variables measured
Population Under 5 Years, Population over 85 years, Population Between 5 and 9 years, Population Between 10 and 14 years, Population Between 15 and 19 years, Population Between 20 and 24 years, Population Between 25 and 29 years, Population Between 30 and 34 years, Population Between 35 and 39 years, Population Between 40 and 44 years, and 9 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the age groups. For age groups we divided it into roughly a 5 year bucket for ages between 0 and 85. For over 85, we aggregated data into a single group for all ages. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Bad Axe population distribution across 18 age groups. It lists the population in each age group along with the percentage population relative of the total population for Bad Axe. The dataset can be utilized to understand the population distribution of Bad Axe by age. For example, using this dataset, we can identify the largest age group in Bad Axe.

Key observations

The largest age group in Bad Axe, MI was for the group of age 60 to 64 years years with a population of 278 (9.19%), according to the ACS 2018-2022 5-Year Estimates. At the same time, the smallest age group in Bad Axe, MI was the 75 to 79 years years with a population of 59 (1.95%). Source: U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2018-2022 5-Year Estimates

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Variables / Data Columns

Age Group: This column displays the age group in consideration

Population: The population for the specific age group in the Bad Axe is shown in this column.

% of Total Population: This column displays the population of each age group as a proportion of Bad Axe total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Bad Axe Population by Age. You can refer the same here
f
DQD results of Format 1.
plos.figshare.com
xls
Updated Jan 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Finster; Maxim Moinat; Elham Taghizadeh (2025). DQD results of Format 1. [Dataset]. http://doi.org/10.1371/journal.pone.0311511.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0311511.t002
Dataset updated
Jan 6, 2025
Dataset provided by
PLOS ONE
Authors
Melissa Finster; Maxim Moinat; Elham Taghizadeh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe German Health Data Lab is going to provide access to German statutory health insurance claims data ranging from 2009 to the present for research purposes. Due to evolving data formats within the German Health Data Lab, there is a need to standardize this data into a Common Data Model to facilitate collaborative health research and minimize the need for researchers to adapt to multiple data formats. For this purpose we selected transforming the data to the Observational Medical Outcomes Partnership Common Data Model.MethodsWe developed an Extract, Transform, and Load (ETL) pipeline for two distinct German Health Data Lab data formats: Format 1 (2009-2016) and Format 3 (2019 onwards). Due to the identical format structure of Format 1 and Format 2 (2017 -2018), the ETL pipeline of Format 1 can be applied on Format 2 as well. Our ETL process, supported by Observational Health Data Sciences and Informatics tools, includes specification development, SQL skeleton creation, and concept mapping. We detail the process characteristics and present a quality assessment that includes field coverage and concept mapping accuracy using example data.ResultsFor Format 1, we achieved a field coverage of 92.7%. The Data Quality Dashboard showed 100.0% conformance and 80.6% completeness, although plausibility checks were disabled. The mapping coverage for the Condition domain was low at 18.3% due to invalid codes and missing mappings in the provided example data. For Format 3, the field coverage was 86.2%, with Data Quality Dashboard reporting 99.3% conformance and 75.9% completeness. The Procedure domain had very low mapping coverage (2.2%) due to the use of mocked data and unmapped local concepts The Condition domain results with 99.8% of unique codes mapped. The absence of real data limits the comprehensive assessment of quality.ConclusionThe ETL process effectively transforms the data with high field coverage and conformance. It simplifies data utilization for German Health Data Lab users and enhances the use of OHDSI analysis tools. This initiative represents a significant step towards facilitating cross-border research in Europe by providing publicly available, standardized ETL processes (https://github.com/FraunhoferMEVIS/ETLfromHDLtoOMOP) and evaluations of their performance.
d
COVID-19 Test Results by Date of Specimen Collection (By County) - ARCHIVE
catalog.data.gov
data.ct.gov
+1more
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2023). COVID-19 Test Results by Date of Specimen Collection (By County) - ARCHIVE [Dataset]. https://catalog.data.gov/dataset/covid-19-test-results-by-date-of-specimen-collection-by-county
Explore at:
Dataset updated
Aug 12, 2023
Dataset provided by
data.ct.gov
Description
DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve. The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj. The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 . The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 . The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed. COVID-19 test results by date of specimen collection, including total, positive, negative, and indeterminate for molecular and antigen tests. Molecular tests reported include polymerase chain reaction (PCR) and nucleic acid amplicfication (NAAT) tests. Test results may be reported several days after the result. Data are incomplete for the most recent days. Data from previous dates are routinely updated. Records with a null date field summarize tests reported that were missing the date of collection. Starting in July 2020, this dataset will be updated every weekday.
d
Yellowstone Sample Collection - database
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Yellowstone Sample Collection - database [Dataset]. https://catalog.data.gov/dataset/yellowstone-sample-collection-database
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This database was prepared using a combination of materials that include aerial photographs, topographic maps (1:24,000 and 1:250,000), field notes, and a sample catalog. Our goal was to translate sample collection site locations at Yellowstone National Park and surrounding areas into a GIS database. This was achieved by transferring site locations from aerial photographs and topographic maps into layers in ArcMap. Each field site is located based on field notes describing where a sample was collected. Locations were marked on the photograph or topographic map by a pinhole or dot, respectively, with the corresponding station or site numbers. Station and site numbers were then referenced in the notes to determine the appropriate prefix for the station. Each point on the aerial photograph or topographic map was relocated on the screen in ArcMap, on a digital topographic map, or an aerial photograph. Several samples are present in the field notes and in the catalog but do not correspond to an aerial photograph or could not be found on the topographic maps. These samples are marked with “No” under the LocationFound field and do not have a corresponding point in the SampleSites feature class. Each point represents a field station or collection site with information that was entered into an attributes table (explained in detail in the entity and attribute metadata sections). Tabular information on hand samples, thin sections, and mineral separates were entered by hand. The Samples table includes everything transferred from the paper records and relates to the other tables using the SampleID and to the SampleSites feature class using the SampleSite field.
m
Example Stata syntax and data construction for negative binomial time series...
data.mendeley.com
Updated Nov 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Price (2022). Example Stata syntax and data construction for negative binomial time series regression [Dataset]. http://doi.org/10.17632/3mj526hgzx.2
Explore at:
Unique identifier
https://doi.org/10.17632/3mj526hgzx.2
Dataset updated
Nov 2, 2022
Authors
Sarah Price
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We include Stata syntax (dummy_dataset_create.do) that creates a panel dataset for negative binomial time series regression analyses, as described in our paper "Examining methodology to identify patterns of consulting in primary care for different groups of patients before a diagnosis of cancer: an exemplar applied to oesophagogastric cancer". We also include a sample dataset for clarity (dummy_dataset.dta), and a sample of that data in a spreadsheet (Appendix 2).

The variables contained therein are defined as follows:

case: binary variable for case or control status (takes a value of 0 for controls and 1 for cases).

patid: a unique patient identifier.

time_period: A count variable denoting the time period. In this example, 0 denotes 10 months before diagnosis with cancer, and 9 denotes the month of diagnosis with cancer,

ncons: number of consultations per month.

period0 to period9: 10 unique inflection point variables (one for each month before diagnosis). These are used to test which aggregation period includes the inflection point.

burden: binary variable denoting membership of one of two multimorbidity burden groups.

We also include two Stata do-files for analysing the consultation rate, stratified by burden group, using the Maximum likelihood method (1_menbregpaper.do and 2_menbregpaper_bs.do).

Note: In this example, for demonstration purposes we create a dataset for 10 months leading up to diagnosis. In the paper, we analyse 24 months before diagnosis. Here, we study consultation rates over time, but the method could be used to study any countable event, such as number of prescriptions.
w
Synthetic Data for an Imaginary Country, Sample, 2023 - World
microdata.worldbank.org
Updated Jul 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Development Data Group, Data Analytics Unit (2023). Synthetic Data for an Imaginary Country, Sample, 2023 - World [Dataset]. https://microdata.worldbank.org/index.php/catalog/5906
Explore at:
Dataset updated
Jul 7, 2023
Dataset authored and provided by
Development Data Group, Data Analytics Unit
Time period covered
2023
Area covered
World, World
Description
Abstract

The dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.

The full-population dataset (with about 10 million individuals) is also distributed as open data.

Geographic coverage

The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.

Analysis unit

Household, Individual

Universe

The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.

Kind of data

ssd

Sampling procedure

The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.

Mode of data collection

other

Research instrument

The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.

Cleaning operations

The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.

Response rate

This is a synthetic dataset; the "response rate" is 100%.
B
Data Cleaning Sample
borealisdata.ca
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/ZCN177
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Sample data for exercises in Further Adventures in Data Cleaning.
d
Example data file for TRUEMET Version 2.2
catalog.data.gov
datasets.ai
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Example data file for TRUEMET Version 2.2 [Dataset]. https://catalog.data.gov/dataset/example-data-file-for-truemet-version-2-2
Explore at:
Dataset updated
Feb 21, 2025
Dataset provided by
U.S. Fish and Wildlife Service
Description
This file is an example data set from the Central Valley of California from a drought study corresponding to “recent non-drought conditions” (Scenario 1 in Petrie et al., in review). In 2014, following an 8-year period with 7 below-normal to critically-dry water years, the bioenergetic model TRUEMET was used to assess the impacts of drought on wintering waterfowl habitat and bioenergetics in the Central Valley of California. The goal of the study was to assess whether available foraging habitats could provide enough food to support waterfowl populations (ducks and geese) under a variety of climate and population level scenarios. This information could then be used by managers to adapt their waterfowl habitat management plans to drought conditions. The study area spanned the Central Valley and included the Sacramento Valley in the north, the San Joaquin Valley in the south, and Suisun Marsh and Sacramento-San Joaquin River Delta (Delta) east of San Francisco Bay. The data set consists of two foraging guilds (ducks and geese/swans) and five forage types: harvested corn, rice (flooded), rice (unflooded), wetland invertebrates and wetland moist soil seeds. For more background on the data set, see Petrie et al. in review.
Data from: Occupancy models for data with false positive and false negative...
search.datacite.org
data.niaid.nih.gov
+1more
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paige F. B. Ferguson; Michael J. Conroy; Jeffrey Hepinstall-Cymerman; Paige F.B. Ferguson (2016). Data from: Occupancy models for data with false positive and false negative errors and heterogeneity across sites and surveys [Dataset]. http://doi.org/10.5061/dryad.t68v8
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.t68v8
Dataset updated
2016
Dataset provided by
DataCitehttps://www.datacite.org/
Dryad
Authors
Paige F. B. Ferguson; Michael J. Conroy; Jeffrey Hepinstall-Cymerman; Paige F.B. Ferguson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
False positive detections, such as species misidentifications, occur in ecological data, although many models do not account for them. Consequently, these models are expected to generate biased inference. The main challenge in an analysis of data with false positives is to distinguish false positive and false negative processes while modeling realistic levels of heterogeneity in occupancy and detection probabilities without restrictive assumptions about parameter spaces. Building on previous attempts to account for false positive and false negative detections in occupancy models, we present hierarchical Bayesian models that utilize a subset of data with either confirmed detections of a species’ presence (CP model) or both confirmed presences and confirmed absences (CACP model). We demonstrate that our models overcome the challenges associated with false positive data by evaluating model performance in Monte Carlo simulations of a variety of scenarios. Our models also have the ability to improve inference by incorporating previous knowledge through informative priors. We describe an example application of the CP model to quantify the relationship between songbird occupancy and residential development, plus we provide instructions for ecologists to use the CACP and CP models in their own research. Monte Carlo simulation results indicated that, when data contained false positive detections, the CACP and CP models generated more accurate and precise posterior probability distributions than a model that assumed data did not have false positive errors. For the scenarios we expect to be most generally applicable, those with heterogeneity in occupancy and detection, the CACP and CP models generated essentially unbiased posterior occupancy probabilities. The CACP model with vague priors generated unbiased posterior distributions for covariate coefficients. The CP model generated unbiased posterior distributions for covariate coefficients with vague or informative priors, depending on the function relating covariates to occupancy probabilities. We conclude that the CACP and CP models generate accurate inference in situations with false positive data for which previous models were not suitable.
a
Digital Earth Africa's Sentinel-2 Annual GeoMAD
afrigeo.africageoportal.com
deafrica.africageoportal.com
+4more
Updated Sep 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Africa GeoPortal (2021). Digital Earth Africa's Sentinel-2 Annual GeoMAD [Dataset]. https://afrigeo.africageoportal.com/datasets/africageoportal::digital-earth-africas-sentinel-2-annual-geomad
Explore at:
Dataset updated
Sep 23, 2021
Dataset authored and provided by
Africa GeoPortal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered

Description
GeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for longer-term time series analysis, cloudless imagery and statistical accuracy.

GeoMAD has two main components: Geomedian and Median Absolute Deviations (MADs)

The geomedian component combines measurements collected over the specified timeframe to produce one representative, multispectral measurement for every pixel unit of the African continent. The end result is a comprehensive dataset that can be used to generate true-colour images for visual inspection of anthropogenic or natural landmarks. The full spectral dataset can be used to develop more complex algorithms.

For each pixel, invalid data is discarded, and remaining observations are mathematically summarised using the geomedian statistic. Flyover coverage provided by collecting data over a period of time also helps scope intermittently cloudy areas.

Variations between the geomedian and the individual measurements are captured by the three Median Absolute Deviation (MAD) layers. These are higher-order statistical measurements calculating variation relative to the geomedian. The MAD layers can be used on their own or together with geomedian to gain insights about the land surface and understand change over time.Key PropertiesGeographic Coverage: Continental Africa - approximately 37° North to 35° SouthTemporal Coverage: 2017 – 2022*Spatial Resolution: 10 x 10 meterUpdate Frequency: Annual from 2017 - 2022Product Type: Surface Reflectance (SR)Product Level: Analysis Ready (ARD)Number of Bands: 14 BandsParent Dataset: Sentinel-2 Level-2A Surface ReflectanceSource Data Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)Service Coordinate System: WGS 84 / NSIDC EASE-Grid 2.0 Global (EPSG:6933)*Time is enabled on this service using UTC – Coordinated Universal Time. To assure you are seeing the correct year for each annual slice of data, the time zone must be set specifically to UTC in the Map Viewer settings each time this layer is opened in a new map. More information on this setting can be found here: Set the map time zone.ApplicationsGeoMAD is the Digital Earth Africa (DE Africa) surface reflectance geomedian and triple Median Absolute Deviation data service. It is a cloud-free composite of satellite data compiled over specific timeframes. This service is ideal for:Longer-term time series analysisCloud-free imageryStatistical accuracyAvailable BandsBand IDDescriptionValue rangeData typeNo data valueB02Geomedian B02 (Blue)1 - 10000uint160B03Geomedian B03 (Green)1 - 10000uint160B04Geomedian B04 (Red)1 - 10000uint160B05Geomedian B05 (Red edge 1)1 - 10000uint160B06Geomedian B06 (Red edge 2)1 - 10000uint160B07Geomedian B07 (Red edge 3)1 - 10000uint160B08Geomedian B08 (Near infrared (NIR) 1)1 - 10000uint160B8AGeomedian B8A (NIR 2)1 - 10000uint160B11Geomedian B11 (Short-wave infrared (SWIR) 1)1 - 10000uint160B12Geomedian B12 (SWIR 2)1 - 10000uint160SMADSpectral Median Absolute Deviation0 - 1float32NaNEMADEuclidean Median Absolute Deviation0 - 31623float32NaNBCMADBray-Curtis Median Absolute Deviation0 - 1float32NaNCOUNTNumber of clear observations1 - 65535uint160Bands can be subdivided as follows:

Geomedian — 10 bands: The geomedian is calculated using the spectral bands of data collected during the specified time period. Surface reflectance values have been scaled between 1 and 10000 to allow for more efficient data storage as unsigned 16-bit integers (uint16). Note parent datasets often contain more bands, some of which are not used in GeoMAD. The geomedian band IDs correspond to bands in the parent Sentinel-2 Level-2A data. For example, the Annual GeoMAD band B02 contains the annual geomedian of the Sentinel-2 B02 band. Median Absolute Deviations (MADs) — 3 bands: Deviations from the geomedian are quantified through median absolute deviation calculations. The GeoMAD service utilises three MADs, each stored in a separate band: Euclidean MAD (EMAD), spectral MAD (SMAD), and Bray-Curtis MAD (BCMAD). Each MAD is calculated using the same ten bands as in the geomedian. SMAD and BCMAD are normalised ratios, therefore they are unitless and their values always fall between 0 and 1. EMAD is a function of surface reflectance but is neither a ratio nor normalised, therefore its valid value range depends on the number of bands used in the geomedian calculation.Count — 1 band: The number of clear satellite measurements of a pixel for that calendar year. This is around 60 annually, but doubles at areas of overlap between scenes. “Count” is not incorporated in either the geomedian or MADs calculations. It is intended for metadata analysis and data validation.ProcessingAll clear observations for the given time period are collated from the parent dataset. Cloudy pixels are identified and excluded. The geomedian and MADs calculations are then performed by the hdstats package. Annual GeoMAD datasets for the period use hdstats version 0.2.More details on this dataset can be found here.
Z
TELL sample output data
data.niaid.nih.gov
zenodo.org
Updated Mar 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Casey McGrath (2022). TELL sample output data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6338471
Explore at:
Dataset updated
Mar 10, 2022
Dataset provided by
Casey D Burleyson
Casey McGrath
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains sample output data for TELL. The sample dataset includes four years of sample future data (2039, 2059, 2079, and 2099) that comes from IM3's future WRF runs under the RCP 8.5 climate scenario with SSP5 population forcing. Note that the GCAM-USA output used in this simulation is sample data only. As such the quantitative results from this set of sample output should not be considered valid.
o
massypup64-example-data
explore.openaire.eu
zenodo.org
Updated Apr 12, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Winkler (2018). massypup64-example-data [Dataset]. http://doi.org/10.5281/zenodo.3228323
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3228323
Dataset updated
Apr 12, 2018
Authors
Robert Winkler
Description
Data files and results for massypup64 (http://www.lababi.bioprocess.org/index.php/14-sample-data-articles/78-massypup): * Proteomics * Metabolomics * Data Mining
H
Political Analysis Using R: Example Code and Data, Plus Data for Practice...
dataverse.harvard.edu
search.dataone.org
pdf +3
Updated Apr 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2020). Political Analysis Using R: Example Code and Data, Plus Data for Practice Problems [Dataset]. http://doi.org/10.7910/DVN/ARKOTI
Explore at:
text/plain; charset=us-ascii(37689), text/plain; charset=us-ascii(30653), tsv(20661), tsv(21181186), text/plain; charset=us-ascii(6433), tsv(5718), txt(7602), text/plain; charset=us-ascii(4064), text/plain; charset=us-ascii(5766), text/plain; charset=us-ascii(72975), tsv(1481311), tsv(107836), tsv(16246), tsv(70000), tsv(4410707), text/plain; charset=us-ascii(2327), text/plain; charset=us-ascii(3960), pdf(74002), tsv(53640), text/plain; charset=us-ascii(1938), text/plain; charset=us-ascii(9623), text/plain; charset=us-ascii(7228), text/plain; charset=us-ascii(12959), tsv(230955), tsv(731386), text/plain; charset=us-ascii(4725), text/plain; charset=us-ascii(6166), text/plain; charset=us-ascii(5591), tsv(20444), tsv(4131), text/plain; charset=us-ascii(4729), text/plain; charset=us-ascii(1293)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/ARKOTI
Dataset updated
Apr 28, 2020
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Roman Gerlach; Jessica Rex; Kevin Lang; Nadine Neute; Annett Schröter; Volker Schwartze (2020). Research Data ScaryTales [Dataset]. http://doi.org/10.5281/zenodo.4066680

Research Data ScaryTales

Explore at:

30 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.4066680

Dataset updated

Oct 30, 2020

Authors

Roman Gerlach; Jessica Rex; Kevin Lang; Nadine Neute; Annett Schröter; Volker Schwartze

Description

English: This dataset consists of 50 stories about bad (research) data management based on true stories. The stories were collected and adapted by the Thuringian Competence Network for Research Data Management and were used in 2020 as opening for the RDM Days and for the Data Horror Week. The textstories themselves are free to use under the CC0 licence. This does not include the illustrations, which were used on the webpage or in the card game to visualize the stories. Website with all illustrated stories: Link ------------------------------------------- German: Dieser Datensatz besteht aus 50 Geschichten über schlechtes (Forschungs-)Datenmanagement, die auf wahren Begebenheiten beruhen. Die Geschichten wurden vom Thüringer Kompetenznetz für Forschungsdatenmanagement gesammelt und neu verfasst und dienten im Jahr 2020 als Auftakt für die FDM-Tage und für die Data Horror Week. Die Texte der Geschichten selbst sind unter der CC0-Lizenz frei nutzbar. Davon ausgenommen sind die Illustrationen, die auf der Webseite oder im Kartenspiel zur Visualisierung der Geschichten verwendet wurden. Website mit allen illustrierten Geschichten: Link

Clear search

Close search

Google apps

Main menu

Research Data ScaryTales

Missing data in the analysis of multilevel and dependent data (Examples)

HANS (Invalid NLI Heuristics Benchmark)

License

OER sample data-set

Data from: Login Data Set for Risk-Based Authentication

TESA example data

SVG Code Generation Sample Training Data

Bad Axe, MI Age Group Population Dataset: A Complete Breakdown of Bad Axe...

About this dataset

Content

Inspiration

Recommended for further research

DQD results of Format 1.

COVID-19 Test Results by Date of Specimen Collection (By County) - ARCHIVE

Yellowstone Sample Collection - database

Example Stata syntax and data construction for negative binomial time series...

Synthetic Data for an Imaginary Country, Sample, 2023 - World

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Data Cleaning Sample

Example data file for TRUEMET Version 2.2

Data from: Occupancy models for data with false positive and false negative...

Digital Earth Africa's Sentinel-2 Annual GeoMAD

TELL sample output data

massypup64-example-data

Political Analysis Using R: Example Code and Data, Plus Data for Practice...

Research Data ScaryTales