62 datasets found
  1. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  2. u

    NCEP Medium Range Forecast Model Zonal Means

    • data.ucar.edu
    • ckanprod.data-commons.k8s.ucar.edu
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NCEP Medium Range Forecast Model Zonal Means [Dataset]. http://doi.org/10.26023/3A4F-AHJ7-MQ0W
    Explore at:
    Dataset updated
    Oct 7, 2025
    Time period covered
    Oct 30, 1995 - Jan 6, 1996
    Area covered
    Description

    The U.S. National Centers for Environmental Prediction (NCEP) routinely ran the Medium Range Forecast (MRF) model twice daily (00 and 12 UTC) during the ACE-1 period. This dataset is the derived zonal means of model output fields. The forecasts from 12 to 48 hours (in 12 hour steps) are available in native CRAY format.

  3. Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  4. California Condor Range - CWHR B109 [ds916]

    • data.ca.gov
    • data.cnra.ca.gov
    • +7more
    Updated Nov 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). California Condor Range - CWHR B109 [ds916] [Dataset]. https://data.ca.gov/dataset/california-condor-range-cwhr-b109-ds916
    Explore at:
    ashx, arcgis geoservices rest api, csv, zip, geojson, kml, htmlAvailable download formats
    Dataset updated
    Nov 7, 2025
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Description

    The southern portion of this range was adopted from the United States Fish and Wildlife Service (USFWS): https://ecos.fws.gov/ecp/species/8193.

    Experimental population: The FWS may designate a population of a listed species as experimental if it will be released into suitable natural habitat outside the species’ current range. An experimental population is a special designation for a group of plants or animals that will be reintroduced in an area that is geographically isolated from other populations of the species. With the experimental population designation, the specified population is treated as threatened under the ESA, regardless of the species’ designation elsewhere in its range.

    CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.

    The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.

    • California Natural Diversity Database,

    • Terrestrial Species Monitoring [ds2826],

    • North American Bat Monitoring Data Portal,

    • VertNet,

    • Breeding Bird Survey,

    • Wildlife Insights,

    • eBird,

    • iNaturalist,

    • other available CDFW or partner data.

  5. PSYCHE-D: predicting change in depression severity using person-generated...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay (2024). PSYCHE-D: predicting change in depression severity using person-generated health data (DATASET) [Dataset]. http://doi.org/10.5281/zenodo.5085146
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariko Makhmutova; Mariko Makhmutova; Raghu Kainkaryam; Raghu Kainkaryam; Marta Ferreira; Marta Ferreira; Jae Min; Jae Min; Martin Jaggi; Martin Jaggi; Ieuan Clay; Ieuan Clay
    Description

    This dataset is made available under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). See LICENSE.pdf for details.

    Dataset description

    Parquet file, with:

    • 35694 rows
    • 154 columns

    The file is indexed on [participant]_[month], such that 34_12 means month 12 from participant 34. All participant IDs have been replaced with randomly generated integers and the conversion table deleted.

    Column names and explanations are included as a separate tab-delimited file. Detailed descriptions of feature engineering are available from the linked publications.

    File contains aggregated, derived feature matrix describing person-generated health data (PGHD) captured as part of the DiSCover Project (https://clinicaltrials.gov/ct2/show/NCT03421223). This matrix focuses on individual changes in depression status over time, as measured by PHQ-9.

    The DiSCover Project is a 1-year long longitudinal study consisting of 10,036 individuals in the United States, who wore consumer-grade wearable devices throughout the study and completed monthly surveys about their mental health and/or lifestyle changes, between January 2018 and January 2020.

    The data subset used in this work comprises the following:

    • Wearable PGHD: step and sleep data from the participants’ consumer-grade wearable devices (Fitbit) worn throughout the study
    • Screener survey: prior to the study, participants self-reported socio-demographic information, as well as comorbidities
    • Lifestyle and medication changes (LMC) survey: every month, participants were requested to complete a brief survey reporting changes in their lifestyle and medication over the past month
    • Patient Health Questionnaire (PHQ-9) score: every 3 months, participants were requested to complete the PHQ-9, a 9-item questionnaire that has proven to be reliable and valid to measure depression severity

    From these input sources we define a range of input features, both static (defined once, remain constant for all samples from a given participant throughout the study, e.g. demographic features) and dynamic (varying with time for a given participant, e.g. behavioral features derived from consumer-grade wearables).

    The dataset contains a total of 35,694 rows for each month of data collection from the participants. We can generate 3-month long, non-overlapping, independent samples to capture changes in depression status over time with PGHD. We use the notation ‘SM0’ (sample month 0), ‘SM1’, ‘SM2’ and ‘SM3’ to refer to relative time points within each sample. Each 3-month sample consists of: PHQ-9 survey responses at SM0 and SM3, one set of screener survey responses, LMC survey responses at SM3 (as well as SM1, SM2, if available), and wearable PGHD for SM3 (and SM1, SM2, if available). The wearable PGHD includes data collected from 8 to 14 days prior to the PHQ-9 label generation date at SM3. Doing this generates a total of 10,866 samples from 4,036 unique participants.

  6. ECMWF ERA5: ensemble means of surface level analysis parameter data

    • catalogue.ceda.ac.uk
    • data-search.nerc.ac.uk
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    European Centre for Medium-Range Weather Forecasts (ECMWF) (2025). ECMWF ERA5: ensemble means of surface level analysis parameter data [Dataset]. https://catalogue.ceda.ac.uk/uuid/d8021685264e43c7a0868396a5f582d0
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
    Authors
    European Centre for Medium-Range Weather Forecasts (ECMWF)
    License

    https://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdfhttps://artefacts.ceda.ac.uk/licences/specific_licences/ecmwf-era-products.pdf

    Area covered
    Earth
    Variables measured
    cloud_area_fraction, sea_ice_area_fraction, air_pressure_at_mean_sea_level, lwe_thickness_of_atmosphere_mass_content_of_water_vapor
    Description

    This dataset contains ERA5 surface level analysis parameter data ensemble means (see linked dataset for spreads). ERA5 is the 5th generation reanalysis project from the European Centre for Medium-Range Weather Forecasts (ECWMF) - see linked documentation for further details. The ensemble means and spreads are calculated from the ERA5 10 member ensemble, run at a reduced resolution compared with the single high resolution (hourly output at 31 km grid spacing) 'HRES' realisation, for which these data have been produced to provide an uncertainty estimate. This dataset contains a limited selection of all available variables and have been converted to netCDF from the original GRIB files held on the ECMWF system. They have also been translated onto a regular latitude-longitude grid during the extraction process from the ECMWF holdings. For a fuller set of variables please see the linked Copernicus Data Store (CDS) data tool, linked to from this record.

    Note, ensemble standard deviation is often referred to as ensemble spread and is calculated as the standard deviation of the 10-members in the ensemble (i.e., including the control). It is not the sample standard deviation, and thus were calculated by dividing by 10 rather than 9 (N-1). See linked datasets for ensemble member and ensemble mean data.

    The ERA5 global atmospheric reanalysis of the covers 1979 to 2 months behind the present month. This follows on from the ERA-15, ERA-40 rand ERA-interim re-analysis projects.

    An initial release of ERA5 data (ERA5t) is made roughly 5 days behind the present date. These will be subsequently reviewed ahead of being released by ECMWF as quality assured data within 3 months. CEDA holds a 6 month rolling copy of the latest ERA5t data. See related datasets linked to from this record. However, for the period 2000-2006 the initial ERA5 release was found to suffer from stratospheric temperature biases and so new runs to address this issue were performed resulting in the ERA5.1 release (see linked datasets). Note, though, that Simmons et al. 2020 (technical memo 859) report that "ERA5.1 is very close to ERA5 in the lower and middle troposphere." but users of data from this period should read the technical memo 859 for further details.

  7. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loist, Skadi; Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Film University Babelsberg KONRAD WOLF
    Authors
    Loist, Skadi; Samoilova, Evgenia (Zhenya)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  8. Job Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ravender Singh Rana (2023). Job Dataset [Dataset]. https://www.kaggle.com/datasets/ravindrasinghrana/job-description-dataset
    Explore at:
    zip(479575920 bytes)Available download formats
    Dataset updated
    Sep 17, 2023
    Authors
    Ravender Singh Rana
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Job Dataset

    This dataset provides a comprehensive collection of synthetic job postings to facilitate research and analysis in the field of job market trends, natural language processing (NLP), and machine learning. Created for educational and research purposes, this dataset offers a diverse set of job listings across various industries and job types.

    Descriptions for each of the columns in the dataset:

    1. Job Id: A unique identifier for each job posting.
    2. Experience: The required or preferred years of experience for the job.
    3. Qualifications: The educational qualifications needed for the job.
    4. Salary Range: The range of salaries or compensation offered for the position.
    5. Location: The city or area where the job is located.
    6. Country: The country where the job is located.
    7. Latitude: The latitude coordinate of the job location.
    8. Longitude: The longitude coordinate of the job location.
    9. Work Type: The type of employment (e.g., full-time, part-time, contract).
    10. Company Size: The approximate size or scale of the hiring company.
    11. Job Posting Date: The date when the job posting was made public.
    12. Preference: Special preferences or requirements for applicants (e.g., Only Male or Only Female, or Both)
    13. Contact Person: The name of the contact person or recruiter for the job.
    14. Contact: Contact information for job inquiries.
    15. Job Title: The job title or position being advertised.
    16. Role: The role or category of the job (e.g., software developer, marketing manager).
    17. Job Portal: The platform or website where the job was posted.
    18. Job Description: A detailed description of the job responsibilities and requirements.
    19. Benefits: Information about benefits offered with the job (e.g., health insurance, retirement plans).
    20. Skills: The skills or qualifications required for the job.
    21. Responsibilities: Specific responsibilities and duties associated with the job.
    22. Company Name: The name of the hiring company.
    23. Company Profile: A brief overview of the company's background and mission.

    Potential Use Cases:

    • Building predictive models to forecast job market trends.
    • Enhancing job recommendation systems for job seekers.
    • Developing NLP models for resume parsing and job matching.
    • Analyzing regional job market disparities and opportunities.
    • Exploring salary prediction models for various job roles.

    Acknowledgements:

    We would like to express our gratitude to the Python Faker library for its invaluable contribution to the dataset generation process. Additionally, we appreciate the guidance provided by ChatGPT in fine-tuning the dataset, ensuring its quality, and adhering to ethical standards.

    Note:

    Please note that the examples provided are fictional and for illustrative purposes. You can tailor the descriptions and examples to match the specifics of your dataset. It is not suitable for real-world applications and should only be used within the scope of research and experimentation. You can also reach me via email at: rrana157@gmail.com

  9. U

    Genotypes and cluster definitions for a range-wide greater sage-grouse...

    • data.usgs.gov
    • s.cnmilf.com
    • +1more
    Updated Jul 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shawna Zimmerman; Cameron Aldridge; Michael O'Donnell; David Edmunds; Peter Coates; Brian Prochazka; Jennifer Fike; Todd Cross; Bradley Fedy; Sara Oyler-McCance (2022). Genotypes and cluster definitions for a range-wide greater sage-grouse dataset collected 2005-2017 [Dataset]. http://doi.org/10.5066/P98Q5F6R
    Explore at:
    Dataset updated
    Jul 29, 2022
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Shawna Zimmerman; Cameron Aldridge; Michael O'Donnell; David Edmunds; Peter Coates; Brian Prochazka; Jennifer Fike; Todd Cross; Bradley Fedy; Sara Oyler-McCance
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    2005 - 2017
    Description

    Monitoring change in genetic diversity in wildlife populations across multiple scales could facilitate prioritization of conservation efforts. We used microsatellite genotypes from 7,080 previously collected genetic samples from across the greater sage-grouse (Centrocercus urophasianus) range to develop a modelling framework for estimating genetic diversity within a recently developed hierarchically nested monitoring framework (clusters). The majority of these genetic samples (n=6560) were used in previous research (Oyler-McCance et al. 2014; Cross et. al 2018; Row et. al. 2018). Genetic diversity values associated with clusters across multiple scales could facilitate the identification of areas with low genetic diversity and inform the potential management or conservation priority and response. We also report the data used to define genetic diversity thresholds of conservation concern and a full reporting of the genetic diversity estimates associated with the evaluated clusters.

  10. Rural Definitions

    • catalog.data.gov
    • gimi9.com
    • +1more
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Economic Research Service, Department of Agriculture (2025). Rural Definitions [Dataset]. https://catalog.data.gov/dataset/rural-definitions
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Economic Research Servicehttp://www.ers.usda.gov/
    Description

    Note: Updates to this data product are discontinued. Dozens of definitions are currently used by Federal and State agencies, researchers, and policymakers. The ERS Rural Definitions data product allows users to make comparisons among nine representative rural definitions. Methods of designating the urban periphery range from the use of municipal boundaries to definitions based on counties. Definitions based on municipal boundaries may classify as rural much of what would typically be considered suburban. Definitions that delineate the urban periphery based on counties may include extensive segments of a county that many would consider rural. We have selected a representative set of nine alternative rural definitions and compare social and economic indicators from the 2000 decennial census across the nine definitions. We chose socioeconomic indicators (population, education, poverty, etc.) that are commonly used to highlight differences between urban and rural areas.

  11. Southern Long-Toed Salamander Range - CWHR A003B [ds2844]

    • data.ca.gov
    • data.cnra.ca.gov
    • +5more
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). Southern Long-Toed Salamander Range - CWHR A003B [ds2844] [Dataset]. https://data.ca.gov/dataset/southern-long-toed-salamander-range-cwhr-a003b-ds2844
    Explore at:
    geojson, zip, arcgis geoservices rest api, kml, csv, html, ashxAvailable download formats
    Dataset updated
    Oct 27, 2025
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.

    The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.

    • California Natural Diversity Database,

    • Terrestrial Species Monitoring [ds2826],

    • North American Bat Monitoring Data Portal,

    • VertNet,

    • Breeding Bird Survey,

    • Wildlife Insights,

    • eBird,

    • iNaturalist,

    • other available CDFW or partner data.

  12. Dictionary of English Words and Definitions

    • kaggle.com
    zip
    Updated Sep 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AnthonyTherrien (2024). Dictionary of English Words and Definitions [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/dictionary-of-english-words-and-definitions
    Explore at:
    zip(6401928 bytes)Available download formats
    Dataset updated
    Sep 22, 2024
    Authors
    AnthonyTherrien
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Overview

    This dataset consists of 42,052 English words and their corresponding definitions. It is a comprehensive collection of words ranging from common terms to more obscure vocabulary. The dataset is ideal for Natural Language Processing (NLP) tasks, educational tools, and various language-related applications.

    Key Features:

    • Words: A diverse set of English words, including both rare and frequently used terms.
    • Definitions: Each word is accompanied by a detailed definition that explains its meaning and contextual usage.

    Total Number of Words: 42,052

    Applications

    This dataset is well-suited for a range of use cases, including:

    • Natural Language Processing (NLP): Enhance text understanding models by providing contextual meaning and word associations.
    • Vocabulary Building: Create educational tools or games that help users expand their vocabulary.
    • Lexical Studies: Perform academic research on word usage, trends, and lexical semantics.
    • Dictionary and Thesaurus Development: Serve as a resource for building dictionary or thesaurus applications, where users can search for words and definitions.

    Data Structure

    • Word: The column containing the English word.
    • Definition: The column providing a comprehensive definition of the word.

    Potential Use Cases

    • Language Learning: This dataset can be used to develop applications or tools aimed at enhancing vocabulary acquisition for language learners.
    • NLP Model Training: Useful for tasks such as word embeddings, definition generation, and contextual learning.
    • Research: Analyze word patterns, rare vocabulary, and trends in the English language.

    This version focuses on providing essential information while emphasizing the total number of words and potential applications of the dataset. Let me know if you'd like any further adjustments!

  13. Blunt-Nosed Leopard Lizard Range - CWHR R019 [ds1726]

    • data-cdfw.opendata.arcgis.com
    • data.cnra.ca.gov
    • +4more
    Updated Oct 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). Blunt-Nosed Leopard Lizard Range - CWHR R019 [ds1726] [Dataset]. https://data-cdfw.opendata.arcgis.com/datasets/CDFW::blunt-nosed-leopard-lizard-range-cwhr-r019-ds1726
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR. The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS. California Natural Diversity Database, Terrestrial Species Monitoring [ds2826], North American Bat Monitoring Data Portal, VertNet, Breeding Bird Survey, Wildlife Insights, eBird, iNaturalist, other available CDFW or partner data.

  14. Lesser Sandhill Crane Range - CWHR B150B [ds3227]

    • data.ca.gov
    • data.cnra.ca.gov
    • +4more
    Updated Oct 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). Lesser Sandhill Crane Range - CWHR B150B [ds3227] [Dataset]. https://data.ca.gov/dataset/lesser-sandhill-crane-range-cwhr-b150b-ds3227
    Explore at:
    zip, geojson, arcgis geoservices rest api, kml, html, csv, ashxAvailable download formats
    Dataset updated
    Oct 27, 2025
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR.

    The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS.

    • California Natural Diversity Database,

    • Terrestrial Species Monitoring [ds2826],

    • North American Bat Monitoring Data Portal,

    • VertNet,

    • Breeding Bird Survey,

    • Wildlife Insights,

    • eBird,

    • iNaturalist,

    • other available CDFW or partner data.

  15. Top 2500 Kaggle Datasets

    • kaggle.com
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saket Kumar (2024). Top 2500 Kaggle Datasets [Dataset]. http://doi.org/10.34740/kaggle/dsv/7637365
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Saket Kumar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset compiles the top 2500 datasets from Kaggle, encompassing a diverse range of topics and contributors. It provides insights into dataset creation, usability, popularity, and more, offering valuable information for researchers, analysts, and data enthusiasts.

    Research Analysis: Researchers can utilize this dataset to analyze trends in dataset creation, popularity, and usability scores across various categories.

    Contributor Insights: Kaggle contributors can explore the dataset to gain insights into factors influencing the success and engagement of their datasets, aiding in optimizing future submissions.

    Machine Learning Training: Data scientists and machine learning enthusiasts can use this dataset to train models for predicting dataset popularity or usability based on features such as creator, category, and file types.

    Market Analysis: Analysts can leverage the dataset to conduct market analysis, identifying emerging trends and popular topics within the data science community on Kaggle.

    Educational Purposes: Educators and students can use this dataset to teach and learn about data analysis, visualization, and interpretation within the context of real-world datasets and community-driven platforms like Kaggle.

    Column Definitions:

    Dataset Name: Name of the dataset. Created By: Creator(s) of the dataset. Last Updated in number of days: Time elapsed since last update. Usability Score: Score indicating the ease of use. Number of File: Quantity of files included. Type of file: Format of files (e.g., CSV, JSON). Size: Size of the dataset. Total Votes: Number of votes received. Category: Categorization of the dataset's subject matter.

  16. u

    Data from: Range size, local abundance and effect inform species...

    • agdatacommons.nal.usda.gov
    • catalog.data.gov
    txt
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erin K. Espeland; Zachary A. Sylvain (2025). Data from: Range size, local abundance and effect inform species descriptions at scales relevant for local conservation practice [Dataset]. http://doi.org/10.15482/USDA.ADC/1503833
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Erin K. Espeland; Zachary A. Sylvain
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Understanding species abundances and distributions, especially at local to landscape scales, is critical for land managers and conservationists to prioritize management decisions and informs the effort and expense that may be required. The metrics of range size and local abundance reflect aspects of the biology and ecology of a given species, and together with its per capita (or per unit area) effects on other members of the community comprise a well-accepted theoretical paradigm describing invasive species. Although these metrics are readily calculated from vegetation monitoring data, they have not generally (and effect in particular) been applied to native species. We describe how metrics defining invasions may be more broadly applied to both native and invasive species in vegetation management, supporting their relevance to local scales of species conservation and management. We then use a sample monitoring dataset to compare range size, local abundance and effect as well as summary calculations of landscape penetration (range size × local abundance) and impact (landscape penetration × effect) for native and invasive species in the mixed-grass plant community of western North Dakota, USA. This paper uses these summary statistics to quantify the impact for 13 of 56 commonly encountered species, with statistical support for effects of 6 of the 13 species. Our results agree with knowledge of invasion severity and natural history of native species in the region. We contend that when managers are using invasion metrics in monitoring, extending them to common native species is biologically and ecologically informative, with little additional investment. Resources in this dataset:Resource Title: Supporting Data (xlsx). File Name: Espeland-Sylvain-BiodivConserv-2019-raw-data.xlsxResource Description: Occurrence data per quadrangle, site, and transect. Species Codes and habitat identifiers are defined in a separate sheet.Resource Title: Data Dictionary. File Name: Espeland-Sylvain-BiodivConserv-2019-data-dictionary.csvResource Description: Details Species and Habitat codes for abundance data collected.Resource Title: Supporting Data (csv). File Name: Espeland-Sylvain-BiodivConserv-2019-raw-data.csvResource Description: Occurrence data per quadrangle, site, and transect.Resource Title: Supplementary Table S1.1. File Name: 10531_2019_1701_MOESM1_ESM.docxResource Description: Scientific name, common name, life history group, family, status (N= native, I= introduced), percent of plots present, and average cover when present of 56 vascular plant species recorded in 1196 undisturbed plots in federally-managed grasslands of western North Dakota. Life history groups: C3 = cool season perennial grass, C4 = warm season perennial grass, SE = sedge, SH = shrub, PF= perennial forb, BF = biennial forb, APF = annual, biennial, or perennial forb.

  17. High-fidelity Fraudulent Activity Dataset 2023

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahzad Aslam (2023). High-fidelity Fraudulent Activity Dataset 2023 [Dataset]. https://www.kaggle.com/datasets/zeesolver/credit-card
    Explore at:
    zip(149953614 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Shahzad Aslam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    Context

    The credit card dataset comprises various attributes that capture essential information about individual transactions. Each entry in the dataset is uniquely identified by an 'ID', which aids in precise record-keeping and analysis. The 'V1-V28' features encompass a wide range of transaction-related details, including time, location, type, and several other parameters. These attributes collectively provide a comprehensive snapshot of each transaction. 'Amount' denotes the monetary value involved in the transaction, indicating the specific charge or credit associated with the card. Lastly, the 'Class' attribute plays a pivotal role in fraud detection, categorizing transactions into distinct classes like 'legitimate' and 'fraudulent'. This classification is instrumental in identifying potentially suspicious activities, helping financial institutions safeguard against fraudulent transactions. Together, these attributes form a crucial dataset for studying and mitigating risks associated with credit card transactions.

    Column Details

    ID:

    This is likely a unique identifier for a specific credit card transaction. It helps in keeping track of individual transactions and distinguishing them from one another.

    V1-V28:

    These are possibly features or attributes associated with the credit card transaction. They might include information such as time, amount, location, type of transaction, and various other details that can be used for analysis and fraud detection.

    Amount:

    This refers to the monetary value involved in the credit card transaction. It indicates how much money was either charged or credited to the card during that particular transaction.

    Class:

    This is an important attribute indicating the category or type of the transaction. It typically classifies transactions into different groups, like 'fraudulent' or 'legitimate'. This classification is crucial for identifying potentially suspicious or fraudulent activities.

  18. Black Rail Range - CWHR B143 [ds595]

    • data-cdfw.opendata.arcgis.com
    • data.cnra.ca.gov
    • +6more
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). Black Rail Range - CWHR B143 [ds595] [Dataset]. https://data-cdfw.opendata.arcgis.com/datasets/CDFW::black-rail-range-cwhr-b143-ds595
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset authored and provided by
    California Department of Fish and Wildlifehttps://wildlife.ca.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    CWHR species range datasets represent the maximum current geographic extent of each species within California. Ranges were originally delineated at a scale of 1:5,000,000 by species-level experts more than 30 years ago and have gradually been revised at a scale of 1:1,000,000. Species occurrence data are used in defining species ranges, but range polygons may extend beyond the limits of extant occurrence data for a particular species. When drawing range boundaries, CDFW seeks to err on the side of commission rather than omission. This means that CDFW may include areas within a range based on expert knowledge or other available information, despite an absence of confirmed occurrences, which may be due to a lack of survey effort. The degree to which a range polygon is extended beyond occurrence data will vary among species, depending upon each species’ vagility, dispersal patterns, and other ecological and life history factors. The boundary line of a range polygon is drawn with consideration of these factors and is aligned with standardized boundaries including watersheds (NHD), ecoregions (USDA), or other ecologically meaningful delineations such as elevation contour lines. While CWHR ranges are meant to represent the current range, once an area has been designated as part of a species’ range in CWHR, it will remain part of the range even if there have been no documented occurrences within recent decades. An area is not removed from the range polygon unless experts indicate that it has not been occupied for a number of years after repeated surveys or is deemed no longer suitable and unlikely to be recolonized. It is important to note that range polygons typically contain areas in which a species is not expected to be found due to the patchy configuration of suitable habitat within a species’ range. In this regard, range polygons are coarse generalizations of where a species may be found. This data is available for download from the CDFW website: https://www.wildlife.ca.gov/Data/CWHR. The following data sources were collated for the purposes of range mapping and species habitat modeling by RADMAP. Each focal taxon’s location data was extracted (when applicable) from the following list of sources. BIOS datasets are bracketed with their “ds” numbers and can be located on CDFW’s BIOS viewer: https://wildlife.ca.gov/Data/BIOS. California Natural Diversity Database, Terrestrial Species Monitoring [ds2826], North American Bat Monitoring Data Portal, VertNet, Breeding Bird Survey, Wildlife Insights, eBird, iNaturalist, other available CDFW or partner data.

  19. D

    Street Network Database SND

    • data.seattle.gov
    • s.cnmilf.com
    • +2more
    csv, xlsx, xml
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Street Network Database SND [Dataset]. https://data.seattle.gov/dataset/Street-Network-Database-SND/cmy3-6chy
    Explore at:
    xml, csv, xlsxAvailable download formats
    Dataset updated
    Feb 3, 2025
    Description

    The pathway representation consists of segments and intersection elements. A segment is a linear graphic element that represents a continuous physical travel path terminated by path end (dead end) or physical intersection with other travel paths. Segments have one street name, one address range and one set of segment characteristics. A segment may have none or multiple alias street names. Segment types included are Freeways, Highways, Streets, Alleys (named only), Railroads, Walkways, and Bike lanes. SNDSEG_PV is a linear feature class representing the SND Segment Feature, with attributes for Street name, Address Range, Alias Street name and segment Characteristics objects. Part of the Address Range and all of Street name objects are logically shared with the Discrete Address Point-Master Address File layer.

    Appropriate uses include: Cartography - Used to depict the City's transportation network location and connections, typically on smaller scaled maps or images where a single line representation is appropriate. Used to depict specific classifications of roadway use, also typically at smaller scales. Used to label transportation network feature names typically on larger scaled maps. Used to label address ranges with associated transportation network features typically on larger scaled maps. Geocode reference - Used as a source for derived reference data for address validation and theoretical address location Address Range data repository - This data store is the City's address range repository defining address ranges in association with transportation network features. Polygon boundary reference - Used to define various area boundaries is other feature classes where coincident with the transportation network. Does not contain polygon features. Address based extracts - Used to create flat-file extracts typically indexed by address with reference to business data typically associated with transportation network features. Thematic linear location reference - By providing unique, stable identifiers for each linear feature, thematic data is associated to specific transportation network features via these identifiers. Thematic intersection location reference - By providing unique, stable identifiers for each intersection feature, thematic data is associated to specific transportation network features via these identifiers. Network route tracing - Used as source for derived reference data used to determine point to point travel paths or determine optimal stop allocation along a travel path. Topological connections with segments - Used to provide a specific definition of location for each transportation network feature. Also provides a specific definition of connection between each transportation network feature. (defines where the streets are and the relationship between them ie. 4th Ave is west of 5th Ave and 4th Ave does intersect with Cherry St) Event location reference - Used as source for derived reference data used to locate event and linear referencing.
    Data source is TRANSPO.SNDSEG_PV. Updated weekly.

  20. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
Organization logo

Meta data and supporting documentation

Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Search
Clear search
Close search
Google apps
Main menu