Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterThe U.S. National Centers for Environmental Prediction (NCEP) routinely ran the Medium Range Forecast (MRF) model twice daily (00 and 12 UTC) during the ACE-1 period. This dataset is the derived zonal means of model output fields. The forecasts from 12 to 48 hours (in 12 hour steps) are available in native CRAY format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The southern portion of this range was adopted from the United States Fish and Wildlife Service (USFWS): https://ecos.fws.gov/ecp/species/8193.
Experimental population: The FWS may designate a population of a listed species as experimental if it will be released into suitable natural habitat outside the species’ current range. An experimental population is a special designation for a group of plants or animals that will be reintroduced in an area that is geographically isolated from other populations of the species. With the experimental population designation, the specified population is treated as threatened under the ESA, regardless of the species’ designation elsewhere in its range.
CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.
The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.
California Natural Diversity Database,
Terrestrial Species Monitoring [ds2826],
North American Bat Monitoring Data Portal,
VertNet,
Breeding Bird Survey,
Wildlife Insights,
eBird,
iNaturalist,
other available CDFW or partner data.
Facebook
TwitterThis dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.
Dataset description
Parquet file, with:
The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.
Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.
File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.
The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.
The data subset used in this work comprises the following:
From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).
The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.
Facebook
Twitterhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf
This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.
Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.
The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.
An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.
The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.
We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.
Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Monitoring change in genetic diversity in wildlife populations across multiple scales could facilitate prioritization of conservation efforts. We used microsatellite genotypes from 7,080 previously collected genetic samples from across the greater sage-grouse (Centrocercus urophasianus) range to develop a modelling framework for estimating genetic diversity within a recently developed hierarchically nested monitoring framework (clusters). The majority of these genetic samples (n=6560) were used in previous research (Oyler-McCance et al. 2014; Cross et. al 2018; Row et. al. 2018). Genetic diversity values associated with clusters across multiple scales could facilitate the identification of areas with low genetic diversity and inform the potential management or conservation priority and response. We also report the data used to define genetic diversity thresholds of conservation concern and a full reporting of the genetic diversity estimates associated with the evaluated clusters.
Facebook
TwitterNote: Updates to this data product are discontinued. Dozens of definitions are currently used by Federal and State agencies, researchers, and policymakers. The ERS Rural Definitions data product allows users to make comparisons among nine representative rural definitions. Methods of designating the urban periphery range from the use of municipal boundaries to definitions based on counties. Definitions based on municipal boundaries may classify as rural much of what would typically be considered suburban. Definitions that delineate the urban periphery based on counties may include extensive segments of a county that many would consider rural. We have selected a representative set of nine alternative rural definitions and compare social and economic indicators from the 2000 decennial census across the nine definitions. We chose socioeconomic indicators (population, education, poverty, etc.) that are commonly used to highlight differences between urban and rural areas.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.
The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.
California Natural Diversity Database,
Terrestrial Species Monitoring [ds2826],
North American Bat Monitoring Data Portal,
VertNet,
Breeding Bird Survey,
Wildlife Insights,
eBird,
iNaturalist,
other available CDFW or partner data.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.
This dataset is well-suited for a range of use cases, including:
This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR. The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS. California Natural Diversity Database, Terrestrial Species Monitoring [ds2826], North American Bat Monitoring Data Portal, VertNet, Breeding Bird Survey, Wildlife Insights, eBird, iNaturalist, other available CDFW or partner data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.
The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.
California Natural Diversity Database,
Terrestrial Species Monitoring [ds2826],
North American Bat Monitoring Data Portal,
VertNet,
Breeding Bird Survey,
Wildlife Insights,
eBird,
iNaturalist,
other available CDFW or partner data.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.
Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.
Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.
Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.
Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.
Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.
Column Definitions:
Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Understanding species abundances and distributions, especially at local to landscape scales, is critical for land managers and conservationists to prioritize management decisions and informs the effort and expense that may be required. The metrics of range size and local abundance reflect aspects of the biology and ecology of a given species, and together with its per capita (or per unit area) effects on other members of the community comprise a well-accepted theoretical paradigm describing invasive species. Although these metrics are readily calculated from vegetation monitoring data, they have not generally (and effect in particular) been applied to native species. We describe how metrics defining invasions may be more broadly applied to both native and invasive species in vegetation management, supporting their relevance to local scales of species conservation and management. We then use a sample monitoring dataset to compare range size, local abundance and effect as well as summary calculations of landscape penetration (range size × local abundance) and impact (landscape penetration × effect) for native and invasive species in the mixed-grass plant community of western North Dakota, USA. This paper uses these summary statistics to quantify the impact for 13 of 56 commonly encountered species, with statistical support for effects of 6 of the 13 species. Our results agree with knowledge of invasion severity and natural history of native species in the region. We contend that when managers are using invasion metrics in monitoring, extending them to common native species is biologically and ecologically informative, with little additional investment. Resources in this dataset:Resource Title: Supporting Data (xlsx). File Name: Espeland-Sylvain-BiodivConserv-2019-raw-data.xlsxResource Description: Occurrence data per quadrangle, site, and transect. Species Codes and habitat identifiers are defined in a separate sheet.Resource Title: Data Dictionary. File Name: Espeland-Sylvain-BiodivConserv-2019-data-dictionary.csvResource Description: Details Species and Habitat codes for abundance data collected.Resource Title: Supporting Data (csv). File Name: Espeland-Sylvain-BiodivConserv-2019-raw-data.csvResource Description: Occurrence data per quadrangle, site, and transect.Resource Title: Supplementary Table S1.1. File Name: 10531_2019_1701_MOESM1_ESM.docxResource Description: Scientific name, common name, life history group, family, status (N= native, I= introduced), percent of plots present, and average cover when present of 56 vascular plant species recorded in 1196 undisturbed plots in federally-managed grasslands of western North Dakota. Life history groups: C3 = cool season perennial grass, C4 = warm season perennial grass, SE = sedge, SH = shrub, PF= perennial forb, BF = biennial forb, APF = annual, biennial, or perennial forb.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The credit card dataset comprises various attributes that capture essential information about individual transactions. Each entry in the dataset is uniquely identified by an 'ID', which aids in precise record-keeping and analysis. The 'V1-V28' features encompass a wide range of transaction-related details, including time, location, type, and several other parameters. These attributes collectively provide a comprehensive snapshot of each transaction. 'Amount' denotes the monetary value involved in the transaction, indicating the specific charge or credit associated with the card. Lastly, the 'Class' attribute plays a pivotal role in fraud detection, categorizing transactions into distinct classes like 'legitimate' and 'fraudulent'. This classification is instrumental in identifying potentially suspicious activities, helping financial institutions safeguard against fraudulent transactions. Together, these attributes form a crucial dataset for studying and mitigating risks associated with credit card transactions.
This is likely a unique identifier for a specific credit card transaction. It helps in keeping track of individual transactions and distinguishing them from one another.
These are possibly features or attributes associated with the credit card transaction. They might include information such as time, amount, location, type of transaction, and various other details that can be used for analysis and fraud detection.
This refers to the monetary value involved in the credit card transaction. It indicates how much money was either charged or credited to the card during that particular transaction.
This is an important attribute indicating the category or type of the transaction. It typically classifies transactions into different groups, like 'fraudulent' or 'legitimate'. This classification is crucial for identifying potentially suspicious or fraudulent activities.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR. The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS. California Natural Diversity Database, Terrestrial Species Monitoring [ds2826], North American Bat Monitoring Data Portal, VertNet, Breeding Bird Survey, Wildlife Insights, eBird, iNaturalist, other available CDFW or partner data.
Facebook
TwitterThe pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).