Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) âą y: Vector of binary responses (1: preterm birth, 0: control) âą x: Matrix of covariates; one row for each simulated individual âą z: Matrix of standardized pollution exposures âą n: Number of simulated individuals âą m: Number of exposure time periods (e.g., weeks of pregnancy) âą p: Number of columns in the covariate design matrix âą alpha_true: Vector of âtrueâ critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterThe integration of FAIR (Findability, Accessibility, Interoperability, and Reusability) standards in both open science and open education holds transformative potential for advancing STEM education and broadening participation in scientific research. Despite the well-documented benefits of the open practices associated with FAIR standards, adoption remains limited. Science gateways are a particularly attractive area for FAIR implementation, as they form a distributed network of user-friendly web portals supporting access for domain researchers to high-performance computing, scientific data, software, and open educational resources. This paper introduces the Open Metadata Exchange (OME), a decentralized network of repositories enabling the sharing of content metadata and resources. The OME network provides a novel solution to FAIRification challenges associated with the distributed nature of science gateways by directly addressing resource persistence in the context of sustainability and decommissioning.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mouse ID, diet group, colon region, study cohort, analytical batch, and sample weight data
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a specialized subset of the OpenCitations Meta RDF data, focusing exclusively on data related to page numbers of bibliographic resources, known as manifestations (http://purl.org/spar/fabio/Manifestation). It contains all the bibliographic metadata and its provenance information, structured specifically around manifestations (page numbers), in JSON-LD format.
The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).
After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data.
At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format.
For example, data related to the entity is located in the folder /br/06250/10000/1000/1000.zip, while information about provenance in /br/06250/10000/1000/prov/se.zip
Additional information about OpenCitations Meta at the official webpage.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a specialized subset of the OpenCitations Meta RDF data, focusing exclusively on data related to agent roles of bibliographic resources (http://purl.org/spar/pro/RoleInTime). These agents can be authors, editors, or publishers. It contains all the metadata and its provenance information, structured specifically around agent roles, in JSON-LD format.
The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).
After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data.
At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format.
For example, data related to the entity is located in the folder /ar/06250/10000/1000/1000.zip, while information about provenance in /ar/06250/10000/1000/prov/se.zip
Additional information about OpenCitations Meta at the official webpage.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This is the dataset about all the notebooks in Meta Kaggle Code Dataset . The original dataset is owned by kaggle team and I am just trying to extract meta data about meta kaggle code . My dataset contains following columns and hence , giving their description . If you have a feedback , you can view either Discussions or you can create a new topic as well . I hope you like the dataset , and you will utilize it for the Meta Kaggle Hackethon .
Cheers , ayush
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Metadata of Machine Learning videos on YouTube.
This dataset contains meta data of 500 videos on machine learning. Simply first 500 videos when your search machine learning in youtube search.
Data scraped from https://wiki.digitalmethods.net/Dmi/ToolDatabase . Cover Photo: Photo by Rachit Tank on Unsplash.
Motivation : Dataset by Gabriel Preda
Using this dataset, analyse popularity of machine learning videos and channel with their like, dislike counts.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is a specialized subset of the OpenCitations Meta RDF data, focusing exclusively on data related to identifiers (http://purl.org/spar/datacite/Identifier) of bibliographic resources. It contains all the metadata and its provenance information, structured specifically around identifiers, in JSON-LD format.
The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).
After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data.
At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format.
For example, data related to the entity is located in the folder /id/06250/10000/1000/1000.zip, while information about provenance in /id/06250/10000/1000/prov/se.zip
Additional information about OpenCitations Meta at the official webpage.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is related to the manuscript "An empirical meta-analysis of the life sciences linked open data on the web" published at Nature Scientific Data. If you use the dataset, please cite the manuscript as follows:Kamdar, M.R., Musen, M.A. An empirical meta-analysis of the life sciences linked open data on the web. Sci Data 8, 24 (2021). https://doi.org/10.1038/s41597-021-00797-yWe have extracted schemas from more than 80 publicly available biomedical linked data graphs in the Life Sciences Linked Open Data (LSLOD) cloud into an LSLOD schema graph and conduct an empirical meta-analysis to evaluate the extent of semantic heterogeneity across the LSLOD cloud. The dataset published here contains the following files:- The set of Linked Data Graphs from the LSLOD cloud from which schemas are extracted.- Refined Sets of extracted classes, object properties, data properties, and datatypes, shared across the Linked Data Graphs on LSLOD cloud. Where the schema element is reused from a Linked Open Vocabulary or an ontology, it is explicitly indicated.- The LSLOD Schema Graph, which contains all the above extracted schema elements interlinked with each other based on the underlying content. Sample instances and sample assertions are also provided along with broad level characteristics of the modeled content. The LSLOD Schema Graph is saved as a JSON Pickle File. To read the JSON object in this Pickle file use the Python command as follows:with open('LSLOD-Schema-Graph.json.pickle' , 'rb') as infile: x = pickle.load(infile, encoding='iso-8859-1')Check the Referenced Link for more details on this research, raw data files, and code references.
Facebook
Twitterhttps://www.caida.org/about/legal/aua/public_aua/https://www.caida.org/about/legal/aua/public_aua/
Meta data for all passive monthly traces, incl. chicago and sanjose monitors. This includes the files used to generate the public trace stats.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Compared to the previous version, this release includes metadata related to citing and cited bibliographic resources added in the November 2024 version of Crossref, as well as the November 2024 dump of JaLC (Japan Link Center).In this version, we have focused on correcting a specific type of error, namely the erroneous duplication of resources with the same identifier. We have successfully merged:100% of duplicated identifiers (datacite:Identifier)100% of duplicated responsible agents (foaf:Agent)70% of duplicated bibliographic resources (fabio:Expression)This dataset contains all the bibliographic metadata and its provenance information (in JSON-LD format) included in OpenCitations Meta. The data and the provenance are organized through a complex structure of folders and subfolders, allowing you to quickly find any entity from its URI. The first level consists of the following folders, provided compressed and separately:[folder "ar"]: contains the data and provenance of the responsible agent type entities (http://purl.org/spar/pro/RoleInTime);[folder "br"]: contains the data and provenance of the entities of type bibliographic resource (http:///purl.org/spar/fabio/Expression);[folder "id"]: contains the data and provenance of the identifier entities (http://purl.org/spar/datacite/Identifier);[folder "ra"]: contains the data and provenance of the responsible agent type entities (http://xmlns.com/foaf/0.1/Agent);[folder "re"]: contains the data and provenance of resource embodiment entities (http://purl.org/spar/fabio/Manifestation).The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data.At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format.For example, data related to the entity is located in the folder /br/06250/10000/1000/1000.zip, while information about provenance in /br/06250/10000/1000/prov/1000.zipThis version of the dataset contains:121,302,680 bibliographic entities368,061,399 authors, 2,718,222 editors, and 101,612,475 publishers (counted by their roles, without disambiguating individual698,995 publication venuesThe compressed archives total 47GB, using the tar.gz compression algorithm, and expand to 145G when decompressed. The JSON-LD files inside the archives are further compressed using the zip algorithm. It is recommended to process these inner files as compressed without extracting them, to manage data more efficiently.Additional information about OpenCitations Meta at the official webpage.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionA special dataset that contains metadata for all the published datasets. Dataset profile fields conform to Dublin Core standard.Other
You can download metadata for individual datasets, via the links provided in descriptions.
Definitions of key terms related to this dataset can be found in the Open Data Portal Glossary: https://ukpowernetworks.opendatasoft.com/pages/glossary/
Facebook
Twitterhttp://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
A dataset of all the meta-data for all of the datasets available through the data.gov.uk service. This is provided as a zipped CSV or JSON file. It is published nightly.
Updates: 27 Sep 2017: we've moved all the previous dumps to an S3 bucket at https://dgu-ckan-metadata-dumps.s3-eu-west-1.amazonaws.com/ - This link is now listed here as a data file.
From 13/10/16 we added .v2.jsonl dump, which is set to replace the .json dump (which will be discontinued after a 3 month transition). This is produced using 'ckanapi dump'. It provides an enhanced version of each dataset ('validated', or what you get from package_show in CKAN API v3 - the old json was the unvalidated version). This now includes full details of the organization the dataset is in, rather than just the owner_id. Plus it includes the results of the archival & qa for each dataset and resource, showing whether the link is broken, detected format and stars of openness. It also benefits from being json lines http://jsonlines.org/ format, so you don't need to load the whole thing into memory to parse the json - just a line at a time.
On 12/1/2015 the organizations of the CSV was changed:
Before this date, each dataset was one line, and resources added as numbered columns. Since a dataset may have up to 300 resources, it ends up with 1025 columns, which is wider than many versions of Excel and Libreoffice will open. And the uncompressed size of 170Mb is more than most will deal with too. It is suggested you load it into a database, ahandle it with a python or ruby script, or use tools such as Refine or Google Fusion Tables.
After this date, the datasets are provided in one CSV and resources in another. On occasions that you want to join them, you can join them using the (dataset) "Name" column. These are now manageable in spreadsheet software.
You can also use the standard CKAN API if you want to search or get a small section of the data. Please respect the traffic limits in the API: http://data.gov.uk/terms-and-conditions
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The cante2midi dataset contains 20 tracks taken from the corpus and includes a large variety of styles and complexity with respect to melodic ornamentation. We provide note-level transcriptions of the singing voice melody in a MIDI-like format, where each note is defined by onset time, duration and a quantized MIDI pitch. In addition, we provide a number of low-level descriptors and the fundamental frequency corresponding to the predominant melody for each track. The meta-information includes editoral meta-data and the musicBrainz IDs.
Content:
README (5KB): Text file containing detailed descriptions of manual and automatic annotations.
meta-data (10KB): XML file containing meta-information: Source (anthology name, CD no. and track no.) and editorial meta-data (artist name, title, style and musicBrainzID).
manual transcriptions (82KB): MIDI (.mid) and text files (.notes) containing manual note-level transcriptions of the singing voice.
automatic transcriptions (75KB): Text files (.notes) and MIDI files (.mid) containing automatic note-level transcriptions of the singing voice.
Bark band energies (39.9MB): Text files (.csv) containing the frame-wise extracted bark band energies.
predominant melody (6.2MB): Text files (.csv) containing the frame-wise extracted predominant melody.
low-level descriptors (7.9MB)Text files (.csv) containing a set of frame-wise extracted low-level features.
MFCCs (17.8MB): Text files (.csv) containing the frame-wise extracted mel-frequency cepstral coefficients (MFCCs).
Magnitude spectrum (709.1MB, optional): Text files (.csv) containing the frame-wise extracted magnitudes of the discrete fourier transform (DFT)
Publications
This work has been accepted for publication in the ACM Journal of Computation and Cultural heritage and is currently available in arXiv.
N. Kroher, J. M. DĂaz-Båñez, J. Mora and E. GĂłmez (2015): Corpus COFLA: A research corpus for the Computational study of Flamenco Music. arXiv:1510.04029 [cs.SD cs.IR].
https://doi.org/10.1145/2875428
Conditions of use
The provided datasets are offered free of charge for internal non-commercial use. We do not grant any rights for redistribution or modification. All data collections were gathered by the COFLA team.
© COFLA 2015. All rights reserved.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains all the bibliographic metadata and its provenance information (in JSON-LD format) included in OpenCitations Meta. The data and the provenance are organized through a complex structure of folders and subfolders, allowing you to quickly find any entity from its URI. The first level consists of the following folders, provided compressed and separately:
The inner folders are named through the supplier prefix of the contained entities. It is a prefix that allows you to recognize the entity membership index (e.g., OpenCitations Meta corresponds to 06*0).
After that, the folders have numeric names, which refer to the range of contained entities. For example, the 10000 folder contains entities from 1 to 10000. Inside, you can find the zipped RDF data.
At the same level, additional folders containing the provenance are named with the same criteria already seen. Then, the 1000 folder includes the provenance of the entities from 1 to 1000. The provenance is located inside a folder called prov, also in zipped JSON-LD format.
For example, data related to the entity is located in the folder /br/06250/10000/1000/1000.zip, while information about provenance in /br/06250/10000/1000/prov/1000.zip
This version of the dataset contains:
The compressed archives total 46.5 GB, using the 7-zip compression algorithm, and expand to 66 GB when decompressed. The JSON-LD files inside the archives are further compressed using the zip algorithm. It is recommended to process these inner files as compressed without extracting them, to manage data more efficiently.
Additional information about OpenCitations Meta at the https://download.opencitations.net/#meta" target="_blank" rel="noopener">official webpage.
Facebook
TwitterThis metadata set describes the CSW interface of the metadata catalogue of the spatial data infrastructure of the Federal Maritime and Hydrographic Agency (GDI-BSH).
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The corpusCOFLA is a collection of more than 1500 flamenco recordings which are representative of what is considered classical flamenco. All contained tracks are taken from 12 commercially available flamenco anthologies in order to minimize a possible bias towards geographic location, singer or record label. We provide the editorial meta-information together with the musicBrainz IDs for all tracks as well as the anthologies as XML documents.
Content:
corpus meta data (619KB): XML file containing editorial meta-information for all tracks: source (anthology, CD number, track number), artist, title, style and musicBrainzID.
anthology meta data (3KB): XML file containing editorial meta-information for all anthologies comprising the corpus: name, record label, year edition, year re-edition, number of CDs
Version 1 (released Nov 23rd, 2017):
the anthology âAntologĂa del Cante Flamenco. FlamencologĂa.â is no longer commercially available and has been removed from the corpus
in the corpus meta-data, a field âstyle_annotatedâ has been added, which contains unified styles annotations
singer names have been assigned unique identifiers
Publications
This work has been accepted for publication in the ACM Journal of Computation and Cultural heritage and is currently available in arXiv.
N. Kroher, J. M. DĂaz-Båñez, J. Mora and E. GĂłmez (2015): Corpus COFLA: A research corpus for the Computational study of Flamenco Music. arXiv:1510.04029 [cs.SD cs.IR].
https://doi.org/10.1145/2875428
Conditions of use
The provided datasets are offered free of charge for internal non-commercial use. We do not grant any rights for redistribution or modification. All data collections were gathered by the COFLA team.
© COFLA 2015. All rights reserved.
Facebook
TwitterThis dataset contains supplementary information for a manuscript describing the ESS-DIVE (Environmental Systems Science Data Infrastructure for a Virtual Ecosystem) data repository's community data and metadata reporting formats. The purpose of creating the ESS-DIVE reporting formats was to provide guidelines for formatting some of the diverse data types that can be found in the ESS-DIVE repository. The 6 teams of community partners who developed the reporting formats included scientists and engineers from across the Department of Energy National Lab network. Additionally, during the development process, 247 individuals representing 128 institutions provided input on the formats. The primary files in this dataset are 10 data and metadata crosswalk for ESS-DIVEâs reporting formats (all files ending in _crosswalk.csv). The crosswalks compare elements used in each of the reporting formats to other related standards and data resources (e.g., repositories, datasets, data systems). This dataset also contains additional files recommended by ESS-DIVEâs file-level metadata reporting format. Each data file has an associated dictionary (files ending in _dd.csv) which provide a brief description of each standard or data resource consulted in the data reporting format development process. The flmd.csv file describes each file contained within the dataset.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The cante100 dataset contains 100 tracks taken from the corpus. We defined 10 style families of which 10 tracks each are included. Apart from the style family, we manually annotated the sections of the track in which the vocals are present. In addition, we provide a number of low-level descriptors and the fundamental frequency corresponding to the predominant melody for each track. The meta-information includes editoral meta-data and the musicBrainz ID.
Content:
README (5KB): Text file containing detailed descriptions of manual and automatic annotations.
meta-data (59KB): XML file containing meta-information: Source (anthology name, CD no. and track no.), editorial meta-data (artist name, title, style, musicBrainzID) and the manually annotated style family.
vocal sections (8.9MB): Text file (.csv) containing frame-wise vocal section annotations.
automatic transcriptions (375KB): Text files (.notes) and MIDI files (.mid) containing automatic note-level transcriptions of the singing voice.
Bark band energies (216.6MB): Text files (.csv) containing the frame-wise extracted bark band energies.
predominant melody (33.5MB): Text files (.csv) containing the frame-wise extracted predominant melody.
low-level descriptors (42.9MB): Text files (.csv) containing a set of frame-wise extracted low-level features.
MFCCs (97.1MB): Text files (.csv) containing the frame-wise extracted mel-frequency cepstral coefficients (MFCCs).
Magnitude spectrum (3.85GB): Text files (.csv) containing the frame-wise extracted magnitudes of the discrete fourier transform (DFT).
Publications
This work has been accepted for publication in the ACM Journal of Computation and Cultural heritage and is currently available in arXiv.
N. Kroher, J. M. DĂaz-Båñez, J. Mora and E. GĂłmez (2015): Corpus COFLA: A research corpus for the Computational study of Flamenco Music. arXiv:1510.04029 [cs.SD cs.IR].
https://doi.org/10.1145/2875428
Conditions of use
The provided datasets are offered free of charge for internal non-commercial use. We do not grant any rights for redistribution or modification. All data collections were gathered by the COFLA team.
© COFLA 2015. All rights reserved.
Facebook
TwitterThe data are described in detail in the uploaded file "Science hub metadata.docx". This dataset is associated with the following publication: Zhang, Y., J. Bash, S. Roselle, A. Shatas, A. Repinsky, R. Mathur, C. Hogrefe, J. Piziali, T. Jacobs, and A. Gilliland. Unexpected air quality impacts from implementation of green infrastructure in urban environments: a Kansas City Case Study. ENVIRONMENTAL SCIENCE & TECHNOLOGY. American Chemical Society, Washington, DC, USA, 744(20): 140960, (2020).
Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) âą y: Vector of binary responses (1: preterm birth, 0: control) âą x: Matrix of covariates; one row for each simulated individual âą z: Matrix of standardized pollution exposures âą n: Number of simulated individuals âą m: Number of exposure time periods (e.g., weeks of pregnancy) âą p: Number of columns in the covariate design matrix âą alpha_true: Vector of âtrueâ critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).