17 datasets found
  1. Simulation Data Set

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  2. Data from: Graph-based deep learning models for thermodynamic property...

    • figshare.com
    csv
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bowen Deng; Thijs Stuyver (2024). Graph-based deep learning models for thermodynamic property prediction: The interplay between target definition, data distribution, featurization, and model architecture [Dataset]. http://doi.org/10.6084/m9.figshare.27262947.v3
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Bowen Deng; Thijs Stuyver
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This folder contains the formation energy of BDE-db, QM9, PC9, QMugs, and QMugs1.1 datasets by filtering (The training, test, and validation sets were randomly split in a ratio of 0.8, 0.1, and 0.1, respectively). The filtered process is described in the article "Graph-based deep learning models for thermodynamic property prediction: The interplay between target definition, data distribution, featurization, and model architecture" and the code can be found at https://github.com/chimie-paristech-CTM/thermo_GNN.After application of the filter procedure described in the article, final versions of the QM9 (127,007 data points), BDE-db (289,639 data points), PC9 (96,634 data points), QMugs (636,821 data points) and QMugs1.1 (70,546 data points) were obtained and used throughout this study.

  3. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  4. MNIST-Federated-Learning

    • zenodo.org
    csv, zip
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ferraguig Lynda; Ferraguig Lynda; Benoit Alexandre; Benoit Alexandre; Bettinelli Mickael; Bettinelli Mickael; Lin-Kwong-Chon Christophe; Lin-Kwong-Chon Christophe (2023). MNIST-Federated-Learning [Dataset]. http://doi.org/10.5281/zenodo.8104408
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Jul 3, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ferraguig Lynda; Ferraguig Lynda; Benoit Alexandre; Benoit Alexandre; Bettinelli Mickael; Bettinelli Mickael; Lin-Kwong-Chon Christophe; Lin-Kwong-Chon Christophe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please find below the descriptions of the three configurations for partitioning the MNIST Train dataset into 10 clients and the MNIST Train data:

    1. Balanced Distribution: In the first configuration, the MNIST dataset is partitioned among 10 clients in a balanced manner. This means that the data samples from each class are evenly distributed among the clients. Each client receives a roughly equal number of images from each digit class, ensuring that the distribution of samples across clients is proportional and representative of the overall dataset. [ Config 1]
    2. Heterogeneous Distribution (One Class per Client): In the second configuration, the MNIST dataset is partitioned in a heterogeneous manner, where each client is assigned a single digit class exclusively. This means that one client will only receive images of the digit '0', another client will receive images of the digit '1', and so on. In this setup, each client becomes an expert in classifying a specific digit, allowing for specialized training and evaluation. [ Config 2]
    3. Mixed Distribution: In the third configuration, the MNIST dataset is partitioned using a mixed distribution approach. This means that the data samples from all digit classes are distributed among the 10 clients, but the distribution is not necessarily balanced. The number of samples assigned to each client may vary for different digit classes, resulting in an uneven distribution across the clients. This configuration aims to capture both the overall diversity of the dataset and the varying difficulty levels of classifying different digits. [ Config 3 ]

    Mnist-dataset/
    ├── config1/
    │ ├── client-1/
    │ │ └── data.csv
    │ ├── client-2/
    │ │ └── data.csv
    │ ├── client-3/
    │ │ └── data.csv
    │ └── ...
    ├── config2/
    │ ├── client-1/
    │ │ └── data.csv
    │ ├── client-2/
    │ │ └── data.csv
    │ ├── client-3/
    │ │ └── data.csv
    │ └── ...
    ├── config3/
    │ ├── client-1/
    │ │ └── data.csv
    │ ├── client-2/
    │ │ └── data.csv
    │ ├── client-3/
    │ │ └── data.csv
    │ └── ...
    └── mnist_test.csv

    ***

    License: Yann LeCun and Corinna Cortes hold the copyright of MNIST dataset, which is a derivative work from original NIST datasets. MNIST dataset is made available under the terms of the Creative Commons Attribution-Share Alike 3.0 license.

    ***

  5. Dataset for Development of Volatility Distribution for Organic Matter in...

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2022). Dataset for Development of Volatility Distribution for Organic Matter in Biomass Burning Emissions [Dataset]. https://catalog.data.gov/dataset/dataset-for-development-of-volatility-distribution-for-organic-matter-in-biomass-burning-e
    Explore at:
    Dataset updated
    Nov 6, 2022
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset includes volatility information derived from various biomass burning samples. Emission factors, volatility distributions, mass fractions, and response factors are reported. A data dictionary is provided to define each variable in the dataset. This dataset is associated with the following publication: Sinha, A., I. George, A. Holder, W. Preston, M. Hays, and A. Grieshop. Development of Volatility Distribution for Organic Matter in Biomass Burning Emissions. Environmental Science: Atmospheres. Royal Society of Chemistry, Cambridge, UK, 0000, (2022).

  6. Gender, Age, and Emotion Detection from Voice

    • kaggle.com
    Updated May 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Zaman (2021). Gender, Age, and Emotion Detection from Voice [Dataset]. https://www.kaggle.com/datasets/rohitzaman/gender-age-and-emotion-detection-from-voice/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rohit Zaman
    Description

    Context

    Our target was to predict gender, age and emotion from audio. We found audio labeled datasets on Mozilla and RAVDESS. So by using R programming language 20 statistical features were extracted and then after adding the labels these datasets were formed. Audio files were collected from "Mozilla Common Voice" and “Ryerson AudioVisual Database of Emotional Speech and Song (RAVDESS)”.

    Content

    Datasets contains 20 feature columns and 1 column for denoting the label. The 20 statistical features were extracted through the Frequency Spectrum Analysis using R programming Language. They are: 1) meanfreq - The mean frequency (in kHz) is a pitch measure, that assesses the center of the distribution of power across frequencies. 2) sd - The standard deviation of frequency is a statistical measure that describes a dataset’s dispersion relative to its mean and is calculated as the variance’s square root. 3) median - The median frequency (in kHz) is the middle number in the sorted, ascending, or descending list of numbers. 4) Q25 - The first quartile (in kHz), referred to as Q1, is the median of the lower half of the data set. This means that about 25 percent of the data set numbers are below Q1, and about 75 percent are above Q1. 5) Q75 - The third quartile (in kHz), referred to as Q3, is the central point between the median and the highest distributions. 6) IQR - The interquartile range (in kHz) is a measure of statistical dispersion, equal to the difference between 75th and 25th percentiles or between upper and lower quartiles. 7) skew - The skewness is the degree of distortion from the normal distribution. It measures the lack of symmetry in the data distribution. 8) kurt - The kurtosis is a statistical measure that determines how much the tails of distribution vary from the tails of a normal distribution. It is actually the measure of outliers present in the data distribution. 9) sp.ent - The spectral entropy is a measure of signal irregularity that sums up the normalized signal’s spectral power. 10) sfm - The spectral flatness or tonality coefficient, also known as Wiener entropy, is a measure used for digital signal processing to characterize an audio spectrum. Spectral flatness is usually measured in decibels, which, instead of being noise-like, offers a way to calculate how tone-like a sound is. 11) mode - The mode frequency is the most frequently observed value in a data set. 12) centroid - The spectral centroid is a metric used to describe a spectrum in digital signal processing. It means where the spectrum’s center of mass is centered. 13) meanfun - The meanfun is the average of the fundamental frequency measured across the acoustic signal. 14) minfun - The minfun is the minimum fundamental frequency measured across the acoustic signal 15) maxfun - The maxfun is the maximum fundamental frequency measured across the acoustic signal. 16) meandom - The meandom is the average of dominant frequency measured across the acoustic signal. 17) mindom - The mindom is the minimum of dominant frequency measured across the acoustic signal. 18) maxdom - The maxdom is the maximum of dominant frequency measured across the acoustic signal 19) dfrange - The dfrange is the range of dominant frequency measured across the acoustic signal. 20) modindx - the modindx is the modulation index, which calculates the degree of frequency modulation expressed numerically as the ratio of the frequency deviation to the frequency of the modulating signal for a pure tone modulation.

    Acknowledgements

    Gender and Age Audio Data Souce: Link: https://commonvoice.mozilla.org/en Emotion Audio Data Souce: Link : https://smartlaboratory.org/ravdess/

  7. Z

    Data from: Dataset of lightning flashovers on medium voltage distribution...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarajcev, Petar (2023). Dataset of lightning flashovers on medium voltage distribution lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7382547
    Explore at:
    Dataset updated
    Jan 31, 2023
    Dataset authored and provided by
    Sarajcev, Petar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This synthetic dataset was generated from Monte Carlo simulations of lightning flashovers on medium voltage (MV) distribution lines. It is suitable for training machine learning models for classifying lightning flashovers on distribution lines. The dataset is hierarchical in nature (see below for more information) and class imbalanced.

    Following five different types of lightning interaction with the MV distribution line have been simulated: (1) direct strike to phase conductor (when there is no shield wire present on the line), (2) direct strike to phase conductor with shield wire(s) present on the line (i.e. shielding failure), (3) direct strike to shield wire with backflashover event, (4) indirect near-by lightning strike to ground where shield wire is not present, and (5) indirect near-by lightning strike to ground where shield wire is present on the line. Last two types of lightning interactions induce overvoltage on the phase conductors by radiating EM fields from the strike channel that are coupled to the line conductors. Three different methods of indirect strike analysis have been implemented, as follows: Rusck's model, Chowdhuri-Gross model and Liew-Mar model. Shield wire(s) provide shielding effects to direct, as well as screening effects to indirect, lightning strikes.

    Dataset consists of two independent distribution lines, with heights of 12 m and 15 m, each with a flat configuration of phase conductors. Twin shield wires, if present, are 1.5 m above the phase conductors and 3 m apart [2]. CFO level of the 12 m distribution line is 150 kV and that of the 15 m distribution line is 160 kV. Dataset consists of 10,000 simulations for each of the distribution lines.

    Dataset contains following variables (features):

    'dist': perpendicular distance of the lightning strike location from the distribution line axis (m), generated from the Uniform distribution [0, 500] m,

    'ampl': lightning current amplitude of the strike (kA), generated from the Log-Normal distribution (see IEC 60071 for additional information),

    'front': lightning current wave-front time (us), generated from the Log-Normal distribution; it needs to be emphasized that amplitudes (ampl) and wave-front times (front), as random variables, have been generated from the appropriate bivariate probability distribution which includes statistical correlation between these variates,

    'veloc': velocity of the lightning return-stroke current defined indirectly through the parameter "w" that is generated from the Uniform distribution [50, 500] m/us, which is then used for computing the velocity from the following relation: v = c/sqrt(1+w/I), where "c" is the speed of light in free space (300 m/us) and "I" is the lightning-current amplitude,

    'shield': binary indicator that signals presence or absence of the shield wire(s) on the line (0/1), generated from the Bernoulli distribution with a 50% probability,

    'Ri': average value of the impulse impedance of the tower's grounding (Ohm), generated from the Normal distribution (clipped at zero on the left side) with median value of 50 Ohm and standard deviation of 12.5 Ohm; it should be mentioned that the impulse impedance is often much larger than the associated grounding resistance value, which is why a rather high value of 50 Ohm have been used here,

    'EGM': electrogeometric model used for analyzing striking distances of the distribution line's tower; following options are available: 'Wagner', 'Young', 'AW', 'BW', 'Love', and 'Anderson', where 'AW' stands for Armstrong & Whitehead, while 'BW' means Brown & Whitehead model; statistical distribution of EGM models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.1, 0.2, 0.1, 0.1, 0.3, 0.2],

    'ind': indirect stroke model used for analyzing near-by indirect lightning strikes; following options were implemented: 'rusk' for the Rusck's model, 'chow' for the Chowdhuri-Gross model (with Jakubowski modification) and 'liew' for the Liew-Mar model; statistical distribution of these three models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.6, 0.2, 0.2],

    'CFO': critical flashover voltage level of the distribution line's insulation (kV),

    'height': height of the phase conductors of the distribution line (m),

    'flash': binary indicator that signals if the flashover has been recorded (1) or not (0). This variable is the outcome/label (i.e. binary class).

    Mathematical background used for the analysis of lightning interaction with the MV distribution line can be found in the references cited below.

    References:

    A. R. Hileman, "Insulation Coordination for Power Systems", CRC Press, Boca Raton, FL, 1999.

    J. A. Martinez and F. Gonzalez-Molina, "Statistical evaluation of lightning overvoltages on overhead distribution lines using neural networks," in IEEE Transactions on Power Delivery, vol. 20, no. 3, pp. 2219-2226, July 2005.

    A. Borghetti, C. A. Nucci and M. Paolone, An Improved Procedure for the Assessment of Overhead Line Indirect Lightning Performance and Its Comparison with the IEEE Std. 1410 Method, IEEE Transactions on Power Delivery, Vol. 22, No. 1, 2007, pp. 684-692.

  8. u

    Results and analysis using the Lean Six-Sigma define, measure, analyze,...

    • researchdata.up.ac.za
    docx
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Modiehi Mophethe (2024). Results and analysis using the Lean Six-Sigma define, measure, analyze, improve, and control (DMAIC) Framework [Dataset]. http://doi.org/10.25403/UPresearchdata.25370374.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 12, 2024
    Dataset provided by
    University of Pretoria
    Authors
    Modiehi Mophethe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This section presents a discussion of the research data. The data was received as secondary data however, it was originally collected using the time study techniques. Data validation is a crucial step in the data analysis process to ensure that the data is accurate, complete, and reliable. Descriptive statistics was used to validate the data. The mean, mode, standard deviation, variance and range determined provides a summary of the data distribution and assists in identifying outliers or unusual patterns. The data presented in the dataset show the measures of central tendency which includes the mean, median and the mode. The mean signifies the average value of each of the factors presented in the tables. This is the balance point of the dataset, the typical value and behaviour of the dataset. The median is the middle value of the dataset for each of the factors presented. This is the point where the dataset is divided into two parts, half of the values lie below this value and the other half lie above this value. This is important for skewed distributions. The mode shows the most common value in the dataset. It was used to describe the most typical observation. These values are important as they describe the central value around which the data is distributed. The mean, mode and median give an indication of a skewed distribution as they are not similar nor are they close to one another. In the dataset, the results and discussion of the results is also presented. This section focuses on the customisation of the DMAIC (Define, Measure, Analyse, Improve, Control) framework to address the specific concerns outlined in the problem statement. To gain a comprehensive understanding of the current process, value stream mapping was employed, which is further enhanced by measuring the factors that contribute to inefficiencies. These factors are then analysed and ranked based on their impact, utilising factor analysis. To mitigate the impact of the most influential factor on project inefficiencies, a solution is proposed using the EOQ (Economic Order Quantity) model. The implementation of the 'CiteOps' software facilitates improved scheduling, monitoring, and task delegation in the construction project through digitalisation. Furthermore, project progress and efficiency are monitored remotely and in real time. In summary, the DMAIC framework was tailored to suit the requirements of the specific project, incorporating techniques from inventory management, project management, and statistics to effectively minimise inefficiencies within the construction project.

  9. NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep...

    • zenodo.org
    text/x-python, zip
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio (2025). NADA-SynShapes: A synthetic shape benchmark for testing probabilistic deep learning models [Dataset]. http://doi.org/10.5281/zenodo.15194187
    Explore at:
    zip, text/x-pythonAvailable download formats
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Giulio Del Corso; Giulio Del Corso; Volpini Federico; Volpini Federico; Claudia Caudai; Claudia Caudai; Davide Moroni; Davide Moroni; Sara Colantonio; Sara Colantonio
    License

    Attribution-NonCommercial-NoDerivs 2.5 (CC BY-NC-ND 2.5)https://creativecommons.org/licenses/by-nc-nd/2.5/
    License information was derived automatically

    Time period covered
    Dec 18, 2024
    Description

    NADA (Not-A-Database) is an easy-to-use geometric shape data generator that allows users to define non-uniform multivariate parameter distributions to test novel methodologies. The full open-source package is provided at GIT:NA_DAtabase. See Technical Report for details on how to use the provided package.

    This database includes 3 repositories:

    • NADA_Dis: Is the model able to correctly characterize/Disentangle a complex latent space?
      The repository contains 3x100,000 synthetic black and white images to test the ability of the models to correctly define a proper latent space (e.g., autoencoders) and disentangle it. The first 100,000 images contain 4 shapes and uniform parameter space distributions, while the other images have a more complex underlying distribution (truncated Gaussian and correlated marginal variables).

    • NADA_OOD: Does the model identify Out-Of-Distribution images?
      The repository contains 100,000 training images (4 different shapes with 3 possible colors located in the upper left corner of the canvas) and 6x100,000 increasingly different sets of images (changing the color class balance, reducing the radius of the shape, moving the shape to the lower left corner) providing increasingly challenging out-of-distribution images.
      This can help to test not only the capability of a model, but also methods that produce reliability estimates and should correctly classify OOD elements as "unreliable" as they are far from the original distributions.

    • NADA_AlEp: Does the model distinguish between different types (Aleatoric/Epistemic) of uncertainties?
      The repository contains 5x100,000 images with different type of noise/uncertainties:
      • NADA_AlEp_0_Clean: Dataset clean of noise to use as a possible training set.
      • NADA_AlEp_1_White_Noise: Epistemic white noise dataset. Each image is perturbed with an amount of white noise randomly sampled from 0% to 90%.
      • NADA_AlEp_2_Deformation: Dataset with Epistemic deformation noise. Each image is deformed by a randomly amount uniformly sampled between 0% and 90%. 0% corresponds to the original image, while 100% is a full deformation to the circumscribing circle.
      • NADA_AlEp_3_Label: Dataset with label noise. Formally, 20% of Triangles of a given color are missclassified as a Square with a random color (among Blue, Orange, and Brown) and viceversa (Squares to Triangles). Label noise introduces \textit{Aleatoric Uncertainty} because it is inherent in the data and cannot be reduced.
      • NADA_AlEp_4_Combined: Combined dataset with all previous sources of uncertainty.

    Each image can be used for classification (shape/color) or regression (radius/area) tasks.

    All datasets can be modified and adapted to the user's research question using the included open source data generator.

  10. d

    Coho Distribution [ds326]

    • catalog.data.gov
    • data.ca.gov
    • +3more
    Updated Jul 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2025). Coho Distribution [ds326] [Dataset]. https://catalog.data.gov/dataset/coho-distribution-ds326-38277
    Explore at:
    Dataset updated
    Jul 24, 2025
    Dataset provided by
    California Department of Fish and Wildlife
    Description

    June 2016 VersionThis dataset represents the "Observed Distribution" for coho salmon in California by using observations made only between 1990 and the present. It was developed for the express purpose of assisting with species recovery planning efforts. The process for developing this dataset was to collect as many observations of the species as possible and derive the stream-based geographic distribution for the species based solely on these positive observations.For the purpose of this dataset an observation is defined as a report of a sighting or other evidence of the presence of the species at a given place and time. As such, observations are modeled by year observed as point locations in the GIS. All such observations were collected with information regarding who reported the observation, their agency/organization/affiliation, the date that they observed the species, who compiled the information, etc. This information is maintained in the developers file geodatabase (©Environmental Science Research Institute (ESRI) 2016).To develop this distribution dataset, the species observations were applied to California Streams, a CDFW derivative of USGS National Hydrography Dataset (NHD) High Resolution hydrography. For each observation, a path was traced down the hydrography from the point of observation to the ocean, thereby deriving the shortest migration route from the point of observation to the sea. By appending all of these migration paths together, the "Observed Distribution" for the species is developed.It is important to note that this layer does not attempt to model the entire possible distribution of the species. Rather, it only represents the known distribution based on where the species has been observed and reported. While some observations indeed represent the upstream extent of the species (e.g., an observation made at a hard barrier), the majority of observations only indicate where the species was sampled for or otherwise observed. Because of this, this dataset likely underestimates the absolute geographic distribution of the species.It is also important to note that the species may not be found on an annual basis in all indicated reaches due to natural variations in run size, water conditions, and other environmental factors. As such, the information in this dataset should not be used to verify that the species are currently present in a given stream. Conversely, the absence of distribution linework for a given stream does not necessarily indicate that the species does not occur in that stream. The observation data were compiled from a variety of disparate sources including but not limited to CDFW, USFS, NMFS, timber companies, and the public. Forms of documentation include CDFW administrative reports, personal communications with biologists, observation reports, and literature reviews. The source of each feature (to the best available knowledge) is included in the data attributes for the observations in the geodatabase, but not for the resulting linework. The spatial data has been referenced to California Streams, a CDFW derivative of USGS National Hydrography Dataset (NHD) High Resolution hydrography.Usage of this dataset:Examples of appropriate uses include:- species recovery planning- Evaluation of future survey sites for the species- Validating species distribution modelsExamples of inappropriate uses include:- Assuming absence of a line feature means that the species are not present in that stream.- Using this data to make parcel or ground level land use management decisions.- Using this dataset to prove or support non-existence of the species at any spatial scale.- Assuming that the line feature represents the maximum possible extent of species distribution.All users of this data should seek the assistance of qualified professionals such as surveyors, hydrologists, or fishery biologists as needed to ensure that such users possess complete, precise, and up to date information on species distribution and water body location.Any copy of this dataset is considered to be a snapshot of the species distribution at the time of release. It is impingent upon the user to ensure that they have the most recent version prior to making management or planning decisions.Please refer to "Use Constraints" section below.

  11. 3D Clustering Dataset

    • kaggle.com
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umair Zia (2024). 3D Clustering Dataset [Dataset]. https://www.kaggle.com/datasets/stealthtechnologies/cluster3d-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Umair Zia
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is ideal for practicing and evaluating advanced clustering techniques, allowing for the exploration of different methods to handle complex data distributions and improve cluster separability.

    x: Type: Continuous Description: The x-coordinate of a point in the 3D space. Each cluster has its own center and spread in the x dimension, contributing to the overall distribution of points.

    y: Type: Continuous Description: The y-coordinate of a point in the 3D space. Like the x-coordinate, each cluster has a specific center and spread in the y dimension, affecting the clustering structure.

    z: Type: Continuous Description: The z-coordinate of a point in the 3D space. The z dimension, along with x and y, defines the 3D position of each point, with each cluster having distinct central and spread values.

    color: Type: Categorical

  12. a

    ABS - Personal Income - Total Income Distribution (SA3) 2017-2018 - Dataset...

    • data.aurin.org.au
    Updated Mar 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). ABS - Personal Income - Total Income Distribution (SA3) 2017-2018 - Dataset - AURIN [Dataset]. https://data.aurin.org.au/dataset/au-govt-abs-abs-personal-income-total-income-distribution-sa3-2017-18-sa3-2016
    Explore at:
    Dataset updated
    Mar 5, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset presents information about total income distribution. The data covers the financial year of 2017-2018, and is based on Statistical Area Level 3 (SA3) according to the 2016 edition of the Australian Statistical Geography Standard (ASGS). Total Income is the sum of all reported income derived from Employee income, Own unincorporated business, Superannuation, Investments and Other income. Total income does not include the non-lodger population. Government pensions, benefits or allowances are excluded from the Australian Bureau of Statistics (ABS) income data and do not appear in Other income or Total income. Pension recipients can fall below the income threshold that necessitates them lodging a tax return, or they may only receive tax free pensions or allowances. Hence they will be missing from the personal income tax data set. Recent estimates from the ABS Survey of Income and Housing (which records Government pensions and allowances) suggest that this component can account for between 9% to 11% of Total income. All monetary values are presented as gross pre-tax dollars, as far as possible. This means they reflect income before deductions and loses, and before any taxation or levies (e.g. the Medicare levy or the temporary budget repair levy) are applied. The amounts shown are nominal, they have not been adjusted for inflation. The income presented in this release has been categorised into income types, these categories have been devised by the ABS to closely align to ABS definitions of income. The statistics in this release are compiled from the Linked Employer Employee Dataset (LEED), a cross-sectional database based on administrative data from the Australian taxation system. The LEED includes more than 120 million tax records over seven consecutive years between 2011-12 and 2017-18. Please note: All personal income tax statistics included in LEED were provided in de-identified form with no home address or date of birth. Addresses were coded to the ASGS and date of birth was converted to an age at 30 June of the reference year prior to data provision.

  13. S

    A dataset of spatiotemporal changes in potential desertification areas in...

    • scidb.cn
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cai Yifei; wang feng; Pan Xubin; Zhang Fangmin; Ren Guoyu; Lu Qi (2024). A dataset of spatiotemporal changes in potential desertification areas in China (1930–2100) [Dataset]. http://doi.org/10.57760/sciencedb.08395
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Cai Yifei; wang feng; Pan Xubin; Zhang Fangmin; Ren Guoyu; Lu Qi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    The dataset is based on the definition of potential geographical distribution of desertification according to the United Nations Convention to Combat Desertification. Using the Climate Research Unit Time Series 4.07 (CRU TS4.07) meteorological dataset, data from 10 climate models in the Coupled Model Intercomparison Project 6 (CMIP6), along with future CO2 concentration data, while considering the impact of future changes in CO2 concentration on potential evapotranspiration, we calculated a dataset of the spatial distribution of potential geographical distribution of desertification in China from 1930 to 2100. We constructed a long-term time series spatial distribution dataset of potential geographical distribution of desertification in China. The dataset can provide scientific data support for research on the long-term spatial distribution of potential geographical distribution of desertification in China and related studies, as well as China’s desertification control strategy and the construction of major ecological projects in northern China.

  14. Z

    Data from: Covid19Kerala.info-Data: A collective open dataset of COVID-19...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 6, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hritwik N Edavalath (2020). Covid19Kerala.info-Data: A collective open dataset of COVID-19 outbreak in the south Indian state of Kerala [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3818096
    Explore at:
    Dataset updated
    Sep 6, 2020
    Dataset provided by
    Nishad Thalhath
    Kumar Sujith
    Jijo Ulahannan
    Jeevan Uthaman
    Sharadh Manian
    Nikhil Narayanan
    Sreehari Pillai
    E Rajeevan
    Shabeesh Balan
    Sooraj P Suresh
    Unnikrishnan Sureshkumar
    Prem Prabhakaran
    Sindhu Joseph
    Neetha Nanoth Vellichirammal
    Akhil Balakrishnan
    Hritwik N Edavalath
    Manoj Karingamadathil
    Musfir Mohammed
    Sreekanth Chaliyeduth
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    South India, India, Kerala
    Description

    Covid19Kerala.info-Data is a consolidated multi-source open dataset of metadata from the COVID-19 outbreak in the Indian state of Kerala. It is created and maintained by volunteers of ‘Collective for Open Data Distribution-Keralam’ (CODD-K), a nonprofit consortium of individuals formed for the distribution and longevity of open-datasets. Covid19Kerala.info-Data covers a set of correlated temporal and spatial metadata of SARS-CoV-2 infections and prevention measures in Kerala. Static releases of this dataset snapshots are manually produced from a live database maintained as a set of publicly accessible Google sheets. This dataset is made available under the Open Data Commons Attribution License v1.0 (ODC-BY 1.0).

    Schema and data package Datapackage with schema definition is accessible at https://codd-k.github.io/covid19kerala.info-data/datapackage.json. Provided datapackage and schema are based on Frictionless data Data Package specification.

    Temporal and Spatial Coverage

    This dataset covers COVID-19 outbreak and related data from the state of Kerala, India, from January 31, 2020 till the date of the publication of this snapshot. The dataset shall be maintained throughout the entirety of the COVID-19 outbreak.

    The spatial coverage of the data lies within the geographical boundaries of the Kerala state which includes its 14 administrative subdivisions. The state is further divided into Local Self Governing (LSG) Bodies. Reference to this spatial information is included on appropriate data facets. Available spatial information on regions outside Kerala was mentioned, but it is limited as a reference to the possible origins of the infection clusters or movement of the individuals.

    Longevity and Provenance

    The dataset snapshot releases are published and maintained in a designated GitHub repository maintained by CODD-K team. Periodic snapshots from the live database will be released at regular intervals. The GitHub commit logs for the repository will be maintained as a record of provenance, and archived repository will be maintained at the end of the project lifecycle for the longevity of the dataset.

    Data Stewardship

    CODD-K expects all administrators, managers, and users of its datasets to manage, access, and utilize them in a manner that is consistent with the consortium’s need for security and confidentiality and relevant legal frameworks within all geographies, especially Kerala and India. As a responsible steward to maintain and make this dataset accessible— CODD-K absolves from all liabilities of the damages, if any caused by inaccuracies in the dataset.

    License

    This dataset is made available by the CODD-K consortium under ODC-BY 1.0 license. The Open Data Commons Attribution License (ODC-By) v1.0 ensures that users of this dataset are free to copy, distribute and use the dataset to produce works and even to modify, transform and build upon the database, as long as they attribute the public use of the database or works produced from the same, as mentioned in the citation below.

    Disclaimer

    Covid19Kerala.info-Data is provided under the ODC-BY 1.0 license as-is. Though every attempt is taken to ensure that the data is error-free and up to date, the CODD-K consortium do not bear any responsibilities for inaccuracies in the dataset or any losses—monetary or otherwise—that users of this dataset may incur.

  15. Retail Sales Dataset

    • kaggle.com
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Talib (2023). Retail Sales Dataset [Dataset]. https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mohammad Talib
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Welcome to the Retail Sales and Customer Demographics Dataset! This synthetic dataset has been meticulously crafted to simulate a dynamic retail environment, providing an ideal playground for those eager to sharpen their data analysis skills through exploratory data analysis (EDA). With a focus on retail sales and customer characteristics, this dataset invites you to unravel intricate patterns, draw insights, and gain a deeper understanding of customer behavior.

    ****Dataset Overview:**

    This dataset is a snapshot of a fictional retail landscape, capturing essential attributes that drive retail operations and customer interactions. It includes key details such as Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. These attributes enable a multifaceted exploration of sales trends, demographic influences, and purchasing behaviors.

    Why Explore This Dataset?

    • Realistic Representation: Though synthetic, the dataset mirrors real-world retail scenarios, allowing you to practice analysis within a familiar context.
    • Diverse Insights: From demographic insights to product preferences, the dataset offers a broad spectrum of factors to investigate.
    • Hypothesis Generation: As you perform EDA, you'll have the chance to formulate hypotheses that can guide further analysis and experimentation.
    • Applied Learning: Uncover actionable insights that retailers could use to enhance their strategies and customer experiences.

    Questions to Explore:

    • How does customer age and gender influence their purchasing behavior?
    • Are there discernible patterns in sales across different time periods?
    • Which product categories hold the highest appeal among customers?
    • What are the relationships between age, spending, and product preferences?
    • How do customers adapt their shopping habits during seasonal trends?
    • Are there distinct purchasing behaviors based on the number of items bought per transaction?
    • What insights can be gleaned from the distribution of product prices within each category?

    Your EDA Journey:

    Prepare to immerse yourself in a world of data-driven exploration. Through data visualization, statistical analysis, and correlation examination, you'll uncover the nuances that define retail operations and customer dynamics. EDA isn't just about numbers—it's about storytelling with data and extracting meaningful insights that can influence strategic decisions.

    Embrace the Retail Sales and Customer Demographics Dataset as your canvas for discovery. As you traverse the landscape of this synthetic retail environment, you'll refine your analytical skills, pose intriguing questions, and contribute to the ever-evolving narrative of the retail industry. Happy exploring!

  16. d

    CommunitySurvey2023unweighted

    • catalog.data.gov
    • datasets.ai
    • +8more
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tempe (2024). CommunitySurvey2023unweighted [Dataset]. https://catalog.data.gov/dataset/communitysurvey2023unweighted
    Explore at:
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    City of Tempe
    Description

    These data include the individual responses for the City of Tempe Annual Community Survey conducted by ETC Institute. This dataset has two layers and includes both the weighted data and unweighted data. Weighting data is a statistical method in which datasets are adjusted through calculations in order to more accurately represent the population being studied. The weighted data are used in the final published PDF report.These data help determine priorities for the community as part of the City's on-going strategic planning process. Averaged Community Survey results are used as indicators for several city performance measures. The summary data for each performance measure is provided as an open dataset for that measure (separate from this dataset). The performance measures with indicators from the survey include the following (as of 2023):1. Safe and Secure Communities1.04 Fire Services Satisfaction1.06 Crime Reporting1.07 Police Services Satisfaction1.09 Victim of Crime1.10 Worry About Being a Victim1.11 Feeling Safe in City Facilities1.23 Feeling of Safety in Parks2. Strong Community Connections2.02 Customer Service Satisfaction2.04 City Website Satisfaction2.05 Online Services Satisfaction Rate2.15 Feeling Invited to Participate in City Decisions2.21 Satisfaction with Availability of City Information3. Quality of Life3.16 City Recreation, Arts, and Cultural Centers3.17 Community Services Programs3.19 Value of Special Events3.23 Right of Way Landscape Maintenance3.36 Quality of City Services4. Sustainable Growth & DevelopmentNo Performance Measures in this category presently relate directly to the Community Survey5. Financial Stability & VitalityNo Performance Measures in this category presently relate directly to the Community SurveyMethods:The survey is mailed to a random sample of households in the City of Tempe. Follow up emails and texts are also sent to encourage participation. A link to the survey is provided with each communication. To prevent people who do not live in Tempe or who were not selected as part of the random sample from completing the survey, everyone who completed the survey was required to provide their address. These addresses were then matched to those used for the random representative sample. If the respondent’s address did not match, the response was not used. To better understand how services are being delivered across the city, individual results were mapped to determine overall distribution across the city. Additionally, demographic data were used to monitor the distribution of responses to ensure the responding population of each survey is representative of city population. Processing and Limitations:The location data in this dataset is generalized to the block level to protect privacy. This means that only the first two digits of an address are used to map the location. When they data are shared with the city only the latitude/longitude of the block level address points are provided. This results in points that overlap. In order to better visualize the data, overlapping points were randomly dispersed to remove overlap. The result of these two adjustments ensure that they are not related to a specific address, but are still close enough to allow insights about service delivery in different areas of the city. The weighted data are used by the ETC Institute, in the final published PDF report.The 2023 Annual Community Survey report is available on data.tempe.gov or by visiting https://www.tempe.gov/government/strategic-management-and-innovation/signature-surveys-research-and-dataThe individual survey questions as well as the definition of the response scale (for example, 1 means “very dissatisfied” and 5 means “very satisfied”) are provided in the data dictionary.Additional InformationSource: Community Attitude SurveyContact (author): Adam SamuelsContact E-Mail (author): Adam_Samuels@tempe.govContact (maintainer): Contact E-Mail (maintainer): Data Source Type: Excel tablePreparation Method: Data received from vendor after report is completedPublish Frequency: AnnualPublish Method: ManualData Dictionary

  17. d

    Data from: Estimated Historical Distribution of Grassland Communities of the...

    • catalog.data.gov
    • data.usgs.gov
    • +4more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Estimated Historical Distribution of Grassland Communities of the Southern Great Plains [Dataset]. https://catalog.data.gov/dataset/estimated-historical-distribution-of-grassland-communities-of-the-southern-great-plains
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    The purpose of this project was to map the estimated distribution of grassland communities of the Southern Great Plains prior to Euro-American settlement. The Southern Great Plains Rapid Ecoregional Assessment (REA), under the direction of the Bureau of Land Management and the Great Plains Landscape Conservation Cooperative, includes four ecoregions: the High Plains, Central Great Plains, Southwestern Tablelands, and the Nebraska Sand Hills. The REA advisors and stakeholders determined that the mapping accuracy of available national land-cover maps was insufficient in many areas to adequately address management questions for the REA. Based on the recommendation of the REA stakeholders, we estimated the potential historical distribution of 10 grassland communities within the Southern Great Plains project area using data on soils, climate, and vegetation from the Natural Resources Conservation Service (NRCS) including the Soil Survey Geographic Database (SSURGO) and Ecological Site Information System (ESIS). The dominant grassland communities of the Southern Great Plains addressed as conservation elements for the REA area are shortgrass, mixed-grass, and sand prairies. We also mapped tall-grass, mid-grass, northwest mixed-grass, and cool season bunchgrass prairies, saline and foothill grasslands, and semi-desert grassland and steppe. Grassland communities were primarily defined using the annual productivity of dominant species in the ESIS data. The historical grassland community classification was linked to the SSURGO data using vegetation types associated with the predominant component of mapped soil units as defined in the ESIS data. We augmented NRCS data with Landscape Fire and Resource Management Planning Tools (LANDFIRE) Biophysical Settings classifications 1) where NRCS data were unavailable and 2) where fifth-level watersheds intersected the boundary of the High Plains ecoregion in Wyoming. Spatial data representing the estimated historical distribution of grassland communities of the Southern Great Plains are provided as a 30 x 30-meter gridded surface (raster dataset). This information will help to address the priority management questions for grassland communities for the Southern Great Plains REA and can be used to inform other regional-level land management decisions.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
Organization logo

Simulation Data Set

Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description

These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

Search
Clear search
Close search
Google apps
Main menu