39 datasets found
  1. f

    UC_vs_US Statistic Analysis.xlsx

    • figshare.com
    xlsx
    Updated Jul 9, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jul 9, 2020
    Dataset provided by
    Utrecht University
    Authors
    F. (Fabiano) Dalpiaz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

    Tagging scheme:
    Aligned (AL) - A concept is represented as a class in both models, either
    

    with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

    All the calculations and information provided in the following sheets
    

    originate from that raw data.

    Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
    

    including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

    Sheet 3 (Size-Ratio):
    

    The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

    Sheet 4 (Overall):
    

    Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

    For sheet 4 as well as for the following four sheets, diverging stacked bar
    

    charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

    Sheet 5 (By-Notation):
    

    Model correctness and model completeness is compared by notation - UC, US.

    Sheet 6 (By-Case):
    

    Model correctness and model completeness is compared by case - SIM, HOS, IFA.

    Sheet 7 (By-Process):
    

    Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

    Sheet 8 (By-Grade):
    

    Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

  2. q

    MATLAB code and output files for integral, mean and covariance of the...

    • researchdatafinder.qut.edu.au
    Updated Jul 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Matthew Adams (2022). MATLAB code and output files for integral, mean and covariance of the simplex-truncated multivariate normal distribution [Dataset]. https://researchdatafinder.qut.edu.au/display/n20044
    Explore at:
    Dataset updated
    Jul 25, 2022
    Dataset provided by
    Queensland University of Technology (QUT)
    Authors
    Dr Matthew Adams
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Compositional data, which is data consisting of fractions or probabilities, is common in many fields including ecology, economics, physical science and political science. If these data would otherwise be normally distributed, their spread can be conveniently represented by a multivariate normal distribution truncated to the non-negative space under a unit simplex. Here this distribution is called the simplex-truncated multivariate normal distribution. For calculations on truncated distributions, it is often useful to obtain rapid estimates of their integral, mean and covariance; these quantities characterising the truncated distribution will generally possess different values to the corresponding non-truncated distribution.

    In the paper Adams, Matthew (2022) Integral, mean and covariance of the simplex-truncated multivariate normal distribution. PLoS One, 17(7), Article number: e0272014. https://eprints.qut.edu.au/233964/, three different approaches that can estimate the integral, mean and covariance of any simplex-truncated multivariate normal distribution are described and compared. These three approaches are (1) naive rejection sampling, (2) a method described by Gessner et al. that unifies subset simulation and the Holmes-Diaconis-Ross algorithm with an analytical version of elliptical slice sampling, and (3) a semi-analytical method that expresses the integral, mean and covariance in terms of integrals of hyperrectangularly-truncated multivariate normal distributions, the latter of which are readily computed in modern mathematical and statistical packages. Strong agreement is demonstrated between all three approaches, but the most computationally efficient approach depends strongly both on implementation details and the dimension of the simplex-truncated multivariate normal distribution.

    This dataset consists of all code and results for the associated article.

  3. Weather and Housing in North America

    • kaggle.com
    zip
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
    Explore at:
    zip(512280 bytes)Available download formats
    Dataset updated
    Feb 13, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    North America
    Description

    Weather and Housing in North America

    Exploring the Relationship between Weather and Housing Conditions in 2012

    By [source]

    About this dataset

    This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

    First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

    Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

    Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

    Research Ideas

    • Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
    • Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
    • Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...

  4. Z

    Data from: A New Bayesian Approach to Increase Measurement Accuracy Using a...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek (2025). A New Bayesian Approach to Increase Measurement Accuracy Using a Precision Entropy Indicator [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14417120
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset provided by
    Semmelweis University
    Authors
    Domjan, Peter; Angyal, Viola; Bertalan, Adam; Vingender, Istvan; Dinya, Elek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "We believe that by accounting for the inherent uncertainty in the system during each measurement, the relationship between cause and effect can be assessed more accurately, potentially reducing the duration of research."

    Short description

    This dataset was created as part of a research project investigating the efficiency and learning mechanisms of a Bayesian adaptive search algorithm supported by the Imprecision Entropy Indicator (IEI) as a novel method. It includes detailed statistical results, posterior probability values, and the weighted averages of IEI across multiple simulations aimed at target localization within a defined spatial environment. Control experiments, including random search, random walk, and genetic algorithm-based approaches, were also performed to benchmark the system's performance and validate its reliability.

    The task involved locating a target area centered at (100; 100) within a radius of 10 units (Research_area.png), inside a circular search space with a radius of 100 units. The search process continued until 1,000 successful target hits were achieved.

    To benchmark the algorithm's performance and validate its reliability, control experiments were conducted using alternative search strategies, including random search, random walk, and genetic algorithm-based approaches. These control datasets serve as baselines, enabling comprehensive comparisons of efficiency, randomness, and convergence behavior across search methods, thereby demonstrating the effectiveness of our novel approach.

    Uploaded files

    The first dataset contains the average IEI values, generated by randomly simulating 300 x 1 hits for 10 bins per quadrant (4 quadrants in total) using the Python programming language, and calculating the corresponding IEI values. This resulted in a total of 4 x 10 x 300 x 1 = 12,000 data points. The summary of the IEI values by quadrant and bin is provided in the file results_1_300.csv. The calculation of IEI values for averages is based on likelihood, using an absolute difference-based approach for the likelihood probability computation. IEI_Likelihood_Based_Data.zip

    The weighted IEI average values for likelihood calculation (Bayes formula) are provided in the file Weighted_IEI_Average_08_01_2025.xlsx

    This dataset contains the results of a simulated target search experiment using Bayesian posterior updates and Imprecision Entropy Indicators (IEI). Each row represents a hit during the search process, including metrics such as Shannon entropy (H), Gini index (G), average distance, angular deviation, and calculated IEI values. The dataset also includes bin-specific posterior probability updates and likelihood calculations for each iteration. The simulation explores adaptive learning and posterior penalization strategies to optimize the search efficiency. Our Bayesian adaptive searching system source code (search algorithm, 1000 target searches): IEI_Self_Learning_08_01_2025.pyThis dataset contains the results of 1,000 iterations of a successful target search simulation. The simulation runs until the target is successfully located for each iteration. The dataset includes further three main outputs: a) Results files (results{iteration_number}.csv): Details of each hit during the search process, including entropy measures, Gini index, average distance and angle, Imprecision Entropy Indicators (IEI), coordinates, and the bin number of the hit. b) Posterior updates (Pbin_all_steps_{iter_number}.csv): Tracks the posterior probability updates for all bins during the search process acrosations multiple steps. c) Likelihoodanalysis(likelihood_analysis_{iteration_number}.csv): Contains the calculated likelihood values for each bin at every step, based on the difference between the measured IEI and pre-defined IE bin averages. IEI_Self_Learning_08_01_2025.py

    Based on the mentioned Python source code (see point 3, Bayesian adaptive searching method with IEI values), we performed 1,000 successful target searches, and the outputs were saved in the:Self_learning_model_test_output.zip file.

    Bayesian Search (IEI) from different quadrant. This dataset contains the results of Bayesian adaptive target search simulations, including various outputs that represent the performance and analysis of the search algorithm. The dataset includes: a) Heatmaps (Heatmap_I_Quadrant, Heatmap_II_Quadrant, Heatmap_III_Quadrant, Heatmap_IV_Quadrant): These heatmaps represent the search results and the paths taken from each quadrant during the simulations. They indicate how frequently the system selected each bin during the search process. b) Posterior Distributions (All_posteriors, Probability_distribution_posteriors_values, CDF_posteriors_values): Generated based on posterior values, these files track the posterior probability updates, including cumulative distribution functions (CDF) and probability distributions. c) Macro Summary (summary_csv_macro): This file aggregates metrics and key statistics from the simulation. It summarizes the results from the individual results.csv files. d) Heatmap Searching Method Documentation (Bayesian_Heatmap_Searching_Method_05_12_2024): This document visualizes the search algorithm's path, showing how frequently each bin was selected during the 1,000 successful target searches. e) One-Way ANOVA Analysis (Anova_analyze_dataset, One_way_Anova_analysis_results): This includes the database and SPSS calculations used to examine whether the starting quadrant influences the number of search steps required. The analysis was conducted at a 5% significance level, followed by a Games-Howell post hoc test [43] to identify which target-surrounding quadrants differed significantly in terms of the number of search steps. Results were saved in the Self_learning_model_test_results.zip

    This dataset contains randomly generated sequences of bin selections (1-40) from a control search algorithm (random search) used to benchmark the performance of Bayesian-based methods. The process iteratively generates random numbers until a stopping condition is met (reaching target bins 1, 11, 21, or 31). This dataset serves as a baseline for analyzing the efficiency, randomness, and convergence of non-adaptive search strategies. The dataset includes the following: a) The Python source code of the random search algorithm. b) A file (summary_random_search.csv) containing the results of 1000 successful target hits. c) A heatmap visualizing the frequency of search steps for each bin, providing insight into the distribution of steps across the bins. Random_search.zip

    This dataset contains the results of a random walk search algorithm, designed as a control mechanism to benchmark adaptive search strategies (Bayesian-based methods). The random walk operates within a defined space of 40 bins, where each bin has a set of neighboring bins. The search begins from a randomly chosen starting bin and proceeds iteratively, moving to a randomly selected neighboring bin, until one of the stopping conditions is met (bins 1, 11, 21, or 31). The dataset provides detailed records of 1,000 random walk iterations, with the following key components: a) Individual Iteration Results: Each iteration's search path is saved in a separate CSV file (random_walk_results_.csv), listing the sequence of steps taken and the corresponding bin at each step. b) Summary File: A combined summary of all iterations is available in random_walk_results_summary.csv, which aggregates the step-by-step data for all 1,000 random walks. c) Heatmap Visualization: A heatmap file is included to illustrate the frequency distribution of steps across bins, highlighting the relative visit frequencies of each bin during the random walks. d) Python Source Code: The Python script used to generate the random walk dataset is provided, allowing reproducibility and customization for further experiments. Random_walk.zip

    This dataset contains the results of a genetic search algorithm implemented as a control method to benchmark adaptive Bayesian-based search strategies. The algorithm operates in a 40-bin search space with predefined target bins (1, 11, 21, 31) and evolves solutions through random initialization, selection, crossover, and mutation over 1000 successful runs. Dataset Components: a) Run Results: Individual run data is stored in separate files (genetic_algorithm_run_.csv), detailing: Generation: The generation number. Fitness: The fitness score of the solution. Steps: The path length in bins. Solution: The sequence of bins visited. b) Summary File: summary.csv consolidates the best solutions from all runs, including their fitness scores, path lengths, and sequences. c) All Steps File: summary_all_steps.csv records all bins visited during the runs for distribution analysis. d) A heatmap was also generated for the genetic search algorithm, illustrating the frequency of bins chosen during the search process as a representation of the search pathways.Genetic_search_algorithm.zip

    Technical Information

    The dataset files have been compressed into a standard ZIP archive using Total Commander (version 9.50). The ZIP format ensures compatibility across various operating systems and tools.

    The XLSX files were created using Microsoft Excel Standard 2019 (Version 1808, Build 10416.20027)

    The Python program was developed using Visual Studio Code (Version 1.96.2, user setup), with the following environment details: Commit fabd6a6b30b49f79a7aba0f2ad9df9b399473380f, built on 2024-12-19. The Electron version is 32.6, and the runtime environment includes Chromium 128.0.6263.186, Node.js 20.18.1, and V8 12.8.374.38-electron.0. The operating system is Windows NT x64 10.0.19045.

    The statistical analysis included in this dataset was partially conducted using IBM SPSS Statistics, Version 29.0.1.0

    The CSV files in this dataset were created following European standards, using a semicolon (;) as the delimiter instead of a comma, encoded in UTF-8 to ensure compatibility with a wide

  5. s

    Data from: Data files used to study change dynamics in software systems

    • figshare.swinburne.edu.au
    pdf
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 22, 2024
    Dataset provided by
    Swinburne
    Authors
    Rajesh Vasa
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).

  6. o

    Grid Transformer Power Flow Historic Monthly

    • ukpowernetworks.opendatasoft.com
    Updated Oct 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Grid Transformer Power Flow Historic Monthly [Dataset]. https://ukpowernetworks.opendatasoft.com/explore/dataset/ukpn-grid-transformer-operational-data-monthly/
    Explore at:
    Dataset updated
    Oct 28, 2025
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionUK Power Network maintains the 132kV voltage level network and below. An important part of the distribution network is the stepping down of voltage as it is moved towards the household; this is achieved using transformers. Transformers have a maximum rating for the utilisation of these assets based upon protection, overcurrent, switch gear, etc. This dataset contains the Grid Substation Transformers, also known as Bulk Supply Points, that typically step-down voltage from 132kV to 33kV (occasionally down to 66 or more rarely 20-25). These transformers can be viewed on the single line diagrams in our Long-Term Development Statements (LTDS) and the underlying data is then found in the LTDS tables.Care is taken to protect the private affairs of companies connected to the 33kV network, resulting in the redaction of certain transformers. Where redacted, we provide monthly statistics to continue to add value where possible. Where monthly statistics exist but half-hourly is absent, this data has been redacted.This dataset provides monthly statistics data across these named transformers from 2021 through to the previous month across our license areas. The data are aligned with the same naming convention as the LTDS for improved interoperability.To find half-hourly current and power flow data for a transformer, use the ‘tx_id’ that can be cross referenced in the Grid Transformers Half Hourly Dataset.If you want to download all this data, it is perhaps more convenient from our public sharepoint: Open Data Portal Library - Grid Transformers - All Documents (sharepoint.com)This dataset is part of a larger endeavour to share more operational data on UK Power Networks assets. Please visit our Network Operational Data Dashboard for more operational datasets.Methodological ApproachThe dataset is not derived, it is the measurements from our network stored in our historian.The measurement devices are taken from current transformers attached to the cable at the circuit breaker, and power is derived combining this with the data from voltage transformers physically attached to the busbar. The historian stores datasets based on a report-by-exception process, such that a certain deviation from the present value must be reached before logging a point measurement to the historian. We extract the data following a 30-min time weighted averaging method to get half-hourly values. Where there are no measurements logged in the period, the data provided is blank; due to the report-by-exception process, it may be appropriate to forward fill this data for shorter gaps.We developed a data redactions process to protect the privacy or companies according to the Utilities Act 2000 section 105.1.b, which requires UK Power Networks to not disclose information relating to the affairs of a business. For this reason, where the demand of a private customer is derivable from our data and that data is not already public information (e.g., data provided via Elexon on the Balancing Mechanism), we redact the half-hourly time series, and provide only the monthly averages. This redaction process considers the correlation of all the data, of only corresponding periods where the customer is active, the first order difference of all the data, and the first order difference of only corresponding periods where the customer is active. Should any of these four tests have a high linear correlation, the data is deemed redacted. This process is not simply applied to only the circuit of the customer, but of the surrounding circuits that would also reveal the signal of that customer.The directionality of the data is not consistent within this dataset. Where directionality was ascertainable, we arrange the power data in the direction of the LTDS "from node" to the LTDS "to node". Measurements of current do not indicate directionality and are instead positive regardless of direction. In some circumstances, the polarity can be negative, and depends on the data commissioner's decision on what the operators in the control room might find most helpful in ensuring reliable and secure network operation.Quality Control StatementThe data is provided "as is". In the design and delivery process adopted by the DSO, customer feedback and guidance is considered at each phase of the project. One of the earliest steers was that raw data was preferable. This means that we do not perform prior quality control screening to our raw network data. The result of this decision is that network rearrangements and other periods of non-intact running of the network are present throughout the dataset, which has the potential to misconstrue the true utilisation of the network, which is determined regulatorily by considering only by in-tact running arrangements. Therefore, taking the maximum or minimum of these transformers are not a reliable method of correctly ascertaining the true utilisation. This does have the intended added benefit of giving a realistic view of how the network was operated. The critical feedback was that our customers have a desire to understand what would have been the impact to them under real operational conditions. As such, this dataset offers unique insight into that.Assurance StatementCreating this dataset involved a lot of human data imputation. At UK Power Networks, we have differing software to run the network operationally (ADMS) and to plan and study the network (PowerFactory). The measurement devices are intended to primarily inform the network operators of the real time condition of the network, and importantly, the network drawings visible in the LTDS are a planning approach, which differs to the operational. To compile this dataset, we made the union between the two modes of operating manually. A team of data scientists, data engineers, and power system engineers manually identified the LTDS transformer from the single line diagram, identified the line name from LTDS Table 2a/b, then identified the same transformer in ADMS to identify the measurement data tags. This was then manually inputted to a spreadsheet. Any influential customers to that circuit were noted using ADMS and the single line diagrams. From there, a python code is used to perform the triage and compilation of the datasets. There is potential for human error during the manual data processing. These issues can include missing transformers, incorrectly labelled transformers, incorrectly identified measurement data tags, incorrectly interpreted directionality. Whilst care has been taken to minimise the risk of these issues, they may persist in the provided dataset. Any uncertain behaviour observed by using this data should be reported to allow us to correct as fast as possible.Additional informationDefinitions of key terms related to this dataset can be found in the Open Data Portal Glossary.Download dataset information: Metadata (JSON)We would be grateful if you find this dataset useful to submit a “reuse” case study to tell us what you did and how you used it. This enables us to drive our direction and gain better understanding for how we improve our data offering in the future. Click here for more information: Open Data Portal Reuses — UK Power NetworksTo view this data please register and login.

  7. Demographic and Health Survey 1998 - Ghana

    • microdata.worldbank.org
    • catalog.ihsn.org
    • +1more
    Updated Jun 6, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ghana Statistical Service (GSS) (2017). Demographic and Health Survey 1998 - Ghana [Dataset]. https://microdata.worldbank.org/index.php/catalog/1385
    Explore at:
    Dataset updated
    Jun 6, 2017
    Dataset provided by
    Ghana Statistical Services
    Authors
    Ghana Statistical Service (GSS)
    Time period covered
    1998 - 1999
    Area covered
    Ghana
    Description

    Abstract

    The 1998 Ghana Demographic and Health Survey (GDHS) is the latest in a series of national-level population and health surveys conducted in Ghana and it is part of the worldwide MEASURE DHS+ Project, designed to collect data on fertility, family planning, and maternal and child health.

    The primary objective of the 1998 GDHS is to provide current and reliable data on fertility and family planning behaviour, child mortality, children’s nutritional status, and the utilisation of maternal and child health services in Ghana. Additional data on knowledge of HIV/AIDS are also provided. This information is essential for informed policy decisions, planning and monitoring and evaluation of programmes at both the national and local government levels.

    The long-term objectives of the survey include strengthening the technical capacity of the Ghana Statistical Service (GSS) to plan, conduct, process, and analyse the results of complex national sample surveys. Moreover, the 1998 GDHS provides comparable data for long-term trend analyses within Ghana, since it is the third in a series of demographic and health surveys implemented by the same organisation, using similar data collection procedures. The GDHS also contributes to the ever-growing international database on demographic and health-related variables.

    Geographic coverage

    National

    Analysis unit

    • Household
    • Children under five years
    • Women age 15-49
    • Men age 15-59

    Kind of data

    Sample survey data

    Sampling procedure

    The major focus of the 1998 GDHS was to provide updated estimates of important population and health indicators including fertility and mortality rates for the country as a whole and for urban and rural areas separately. In addition, the sample was designed to provide estimates of key variables for the ten regions in the country.

    The list of Enumeration Areas (EAs) with population and household information from the 1984 Population Census was used as the sampling frame for the survey. The 1998 GDHS is based on a two-stage stratified nationally representative sample of households. At the first stage of sampling, 400 EAs were selected using systematic sampling with probability proportional to size (PPS-Method). The selected EAs comprised 138 in the urban areas and 262 in the rural areas. A complete household listing operation was then carried out in all the selected EAs to provide a sampling frame for the second stage selection of households. At the second stage of sampling, a systematic sample of 15 households per EA was selected in all regions, except in the Northern, Upper West and Upper East Regions. In order to obtain adequate numbers of households to provide reliable estimates of key demographic and health variables in these three regions, the number of households in each selected EA in the Northern, Upper West and Upper East regions was increased to 20. The sample was weighted to adjust for over sampling in the three northern regions (Northern, Upper East and Upper West), in relation to the other regions. Sample weights were used to compensate for the unequal probability of selection between geographically defined strata.

    The survey was designed to obtain completed interviews of 4,500 women age 15-49. In addition, all males age 15-59 in every third selected household were interviewed, to obtain a target of 1,500 men. In order to take cognisance of non-response, a total of 6,375 households nation-wide were selected.

    Note: See detailed description of sample design in APPENDIX A of the survey report.

    Mode of data collection

    Face-to-face

    Research instrument

    Three types of questionnaires were used in the GDHS: the Household Questionnaire, the Women’s Questionnaire, and the Men’s Questionnaire. These questionnaires were based on model survey instruments developed for the international MEASURE DHS+ programme and were designed to provide information needed by health and family planning programme managers and policy makers. The questionnaires were adapted to the situation in Ghana and a number of questions pertaining to on-going health and family planning programmes were added. These questionnaires were developed in English and translated into five major local languages (Akan, Ga, Ewe, Hausa, and Dagbani).

    The Household Questionnaire was used to enumerate all usual members and visitors in a selected household and to collect information on the socio-economic status of the household. The first part of the Household Questionnaire collected information on the relationship to the household head, residence, sex, age, marital status, and education of each usual resident or visitor. This information was used to identify women and men who were eligible for the individual interview. For this purpose, all women age 15-49, and all men age 15-59 in every third household, whether usual residents of a selected household or visitors who slept in a selected household the night before the interview, were deemed eligible and interviewed. The Household Questionnaire also provides basic demographic data for Ghanaian households. The second part of the Household Questionnaire contained questions on the dwelling unit, such as the number of rooms, the flooring material, the source of water and the type of toilet facilities, and on the ownership of a variety of consumer goods.

    The Women’s Questionnaire was used to collect information on the following topics: respondent’s background characteristics, reproductive history, contraceptive knowledge and use, antenatal, delivery and postnatal care, infant feeding practices, child immunisation and health, marriage, fertility preferences and attitudes about family planning, husband’s background characteristics, women’s work, knowledge of HIV/AIDS and STDs, as well as anthropometric measurements of children and mothers.

    The Men’s Questionnaire collected information on respondent’s background characteristics, reproduction, contraceptive knowledge and use, marriage, fertility preferences and attitudes about family planning, as well as knowledge of HIV/AIDS and STDs.

    Response rate

    A total of 6,375 households were selected for the GDHS sample. Of these, 6,055 were occupied. Interviews were completed for 6,003 households, which represent 99 percent of the occupied households. A total of 4,970 eligible women from these households and 1,596 eligible men from every third household were identified for the individual interviews. Interviews were successfully completed for 4,843 women or 97 percent and 1,546 men or 97 percent. The principal reason for nonresponse among individual women and men was the failure of interviewers to find them at home despite repeated callbacks.

    Note: See summarized response rates by place of residence in Table 1.1 of the survey report.

    Sampling error estimates

    The estimates from a sample survey are affected by two types of errors: (1) nonsampling errors, and (2) sampling errors. Nonsampling errors are the results of shortfalls made in implementing data collection and data processing, such as failure to locate and interview the correct household, misunderstanding of the questions on the part of either the interviewer or the respondent, and data entry errors. Although numerous efforts were made during the implementation of the 1998 GDHS to minimize this type of error, nonsampling errors are impossible to avoid and difficult to evaluate statistically.

    Sampling errors, on the other hand, can be evaluated statistically. The sample of respondents selected in the 1998 GDHS is only one of many samples that could have been selected from the same population, using the same design and expected size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability between all possible samples. Although the degree of variability is not known exactly, it can be estimated from the survey results.

    A sampling error is usually measured in terms of the standard error for a particular statistic (mean, percentage, etc.), which is the square root of the variance. The standard error can be used to calculate confidence intervals within which the true value for the population can reasonably be assumed to fall. For example, for any given statistic calculated from a sample survey, the value of that statistic will fall within a range of plus or minus two times the standard error of that statistic in 95 percent of all possible samples of identical size and design.

    If the sample of respondents had been selected as a simple random sample, it would have been possible to use straightforward formulas for calculating sampling errors. However, the 1998 GDHS sample is the result of a two-stage stratified design, and, consequently, it was necessary to use more complex formulae. The computer software used to calculate sampling errors for the 1998 GDHS is the ISSA Sampling Error Module. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. The Jackknife repeated replication method is used for variance estimation of more complex statistics such as fertility and mortality rates.

    Data appraisal

    Data Quality Tables - Household age distribution - Age distribution of eligible and interviewed women - Age distribution of eligible and interviewed men - Completeness of reporting - Births by calendar years - Reporting of age at death in days - Reporting of age at death in months

    Note: See detailed tables in APPENDIX C of the survey report.

  8. m

    Gini_Index - El Salvador

    • macro-rankings.com
    csv, excel
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    macro-rankings, Gini_Index - El Salvador [Dataset]. https://www.macro-rankings.com/selected-country-rankings/gini-index/el-salvador
    Explore at:
    excel, csvAvailable download formats
    Dataset authored and provided by
    macro-rankings
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    El Salvador
    Description

    Time series data for the statistic Gini_Index and country El Salvador. Indicator Definition:Gini index measures the extent to which the distribution of income (or, in some cases, consumption expenditure) among individuals or households within an economy deviates from a perfectly equal distribution. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line. Thus a Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.The statistic "Gini Index" stands at 39.80 percent as of 12/31/2023, the highest value since 12/31/2017. Regarding the One-Year-Change of the series, the current value constitutes an increase of 1.00 percentage points compared to the value the year prior.The 1 year change in percentage points is 1.00.The 5 year change in percentage points is 1.20.The 10 year change in percentage points is -3.60.The Serie's long term average value is 45.55 percent. It's latest available value, on 12/31/2023, is 5.75 percentage points lower, compared to it's long term average value.The Serie's change in percentage points from it's minimum value, on 12/31/2017, to it's latest available value, on 12/31/2023, is +1.80.The Serie's change in percentage points from it's maximum value, on 12/31/1998, to it's latest available value, on 12/31/2023, is -14.60.

  9. m

    Gini_Index - Sweden

    • macro-rankings.com
    csv, excel
    Updated Dec 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    macro-rankings (2023). Gini_Index - Sweden [Dataset]. https://www.macro-rankings.com/Selected-Country-Rankings/Gini-Index/Sweden
    Explore at:
    csv, excelAvailable download formats
    Dataset updated
    Dec 31, 2023
    Dataset authored and provided by
    macro-rankings
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    Time series data for the statistic Gini_Index and country Sweden. Indicator Definition:Gini index measures the extent to which the distribution of income (or, in some cases, consumption expenditure) among individuals or households within an economy deviates from a perfectly equal distribution. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line. Thus a Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.The statistic "Gini Index" stands at 29.30 percent as of 12/31/2023. Regarding the One-Year-Change of the series, the current value constitutes a decrease of -2.30 percentage points compared to the value the year prior.The 1 year change in percentage points is -2.30.The 3 year change in percentage points is 0.4.The 5 year change in percentage points is -0.7.The 10 year change in percentage points is 0.5.The Serie's long term average value is 27.37 percent. It's latest available value, on 12/31/2023, is 1.93 percentage points higher, compared to it's long term average value.The Serie's change in percentage points from it's minimum value, on 12/31/1981, to it's latest available value, on 12/31/2023, is +6.40.The Serie's change in percentage points from it's maximum value, on 12/31/2022, to it's latest available value, on 12/31/2023, is -2.30.

  10. m

    Data from: Probability waves: adaptive cluster-based correction by...

    • data.mendeley.com
    • narcis.nl
    Updated Feb 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DIMITRI ABRAMOV (2021). Probability waves: adaptive cluster-based correction by convolution of p-value series from mass univariate analysis [Dataset]. http://doi.org/10.17632/rrm4rkr3xn.1
    Explore at:
    Dataset updated
    Feb 8, 2021
    Authors
    DIMITRI ABRAMOV
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    dataset and Octave/MatLab codes/scripts for data analysis Background: Methods for p-value correction are criticized for either increasing Type II error or improperly reducing Type I error. This problem is worse when dealing with thousands or even hundreds of paired comparisons between waves or images which are performed point-to-point. This text considers patterns in probability vectors resulting from multiple point-to-point comparisons between two event-related potentials (ERP) waves (mass univariate analysis) to correct p-values, where clusters of signiticant p-values may indicate true H0 rejection. New method: We used ERP data from normal subjects and other ones with attention deficit hyperactivity disorder (ADHD) under a cued forced two-choice test to study attention. The decimal logarithm of the p-vector (p') was convolved with a Gaussian window whose length was set as the shortest lag above which autocorrelation of each ERP wave may be assumed to have vanished. To verify the reliability of the present correction method, we realized Monte-Carlo simulations (MC) to (1) evaluate confidence intervals of rejected and non-rejected areas of our data, (2) to evaluate differences between corrected and uncorrected p-vectors or simulated ones in terms of distribution of significant p-values, and (3) to empirically verify rate of type-I error (comparing 10,000 pairs of mixed samples whit control and ADHD subjects). Results: the present method reduced the range of p'-values that did not show covariance with neighbors (type I and also type-II errors). The differences between simulation or raw p-vector and corrected p-vectors were, respectively, minimal and maximal for window length set by autocorrelation in p-vector convolution. Comparison with existing methods: Our method was less conservative while FDR methods rejected basically all significant p-values for Pz and O2 channels. The MC simulations, gold-standard method for error correction, presented 2.78±4.83% of difference (all 20 channels) from p-vector after correction, while difference between raw and corrected p-vector was 5,96±5.00% (p = 0.0003). Conclusion: As a cluster-based correction, the present new method seems to be biological and statistically suitable to correct p-values in mass univariate analysis of ERP waves, which adopts adaptive parameters to set correction.

  11. Dataset for: Quantifying how diagnostic test accuracy depends on threshold...

    • search.datacite.org
    • wiley.figshare.com
    Updated Jul 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hayley Elizabeth Jones; Constantine Gatsonis; Thomas A Trikalinos; Nicky J Welton; Tony Ades (2019). Dataset for: Quantifying how diagnostic test accuracy depends on threshold in a meta-analysis [Dataset]. http://doi.org/10.6084/m9.figshare.8267015
    Explore at:
    Dataset updated
    Jul 31, 2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Wiley
    Authors
    Hayley Elizabeth Jones; Constantine Gatsonis; Thomas A Trikalinos; Nicky J Welton; Tony Ades
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Tests for disease often produce a continuous measure, such as the concentration of some biomarker in a blood sample. In clinical practice, a threshold C is selected such that results, say, greater than C are declared positive, and those less than C negative. Measures of test accuracy such as sensitivity and specificity depend crucially on C, and the optimal value of this threshold is usually a key question for clinical practice. Standard methods for meta-analysis of test accuracy (i) do not provide summary estimates of accuracy at each threshold, precluding selection of the optimal threshold, and further (ii) do not make use of all available data. We describe a multinomial meta-analysis model that can take any number of pairs of sensitivity and specificity from each study and explicitly quantifies how accuracy depends on C. Our model assumes that some pre-specified or Box-Cox transformation of test results in the diseased and disease-free populations has a logistic distribution. The Box-Cox transformation parameter can be estimated from the data, allowing for a flexible range of underlying distributions. We parameterise in terms of the means and scale parameters of the two logistic distributions. In addition to credible intervals for the pooled sensitivity and specificity across all thresholds, we produce prediction intervals, allowing for between-study heterogeneity in all parameters. We demonstrate the model using two case study meta-analyses, examining the accuracy of tests for acute heart failure and pre-eclampsia. We show how the model can be extended to explore reasons for heterogeneity using study-level covariates.

  12. Reconfigured data from Table 2 in terms of pseudo-responses (), the residual...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George J. Besseris (2023). Reconfigured data from Table 2 in terms of pseudo-responses (), the residual response (), and their associated rank values (,). [Dataset]. http://doi.org/10.1371/journal.pone.0073275.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    George J. Besseris
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reconfigured data from Table 2 in terms of pseudo-responses (), the residual response (), and their associated rank values (,).

  13. S

    Sample of Yidu-N7K data set

    • scidb.cn
    Updated Aug 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zengtao Jiao (2021). Sample of Yidu-N7K data set [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00095
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2021
    Dataset provided by
    Science Data Bank
    Authors
    Zengtao Jiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    [instructions for use] 1. This data set is manually edited by Yidu cloud medicine according to the real medical record distribution; 2. This dataset is an example of the yidu-n7k dataset on openkg. Yidu-n7k dataset can only be used for academic research of natural language processing, not for commercial purposes. ———————————————— Yidu-n4k data set is derived from chip 2019 evaluation task 1, that is, the data set of "clinical terminology standardization task". The standardization of clinical terms is an indispensable task in medical statistics. Clinically, there are often hundreds of different ways to write about the same diagnosis, operation, medicine, examination, test and symptoms. The problem to be solved in Standardization (normalization) is to find the corresponding standard statement for various clinical statements. With the basis of terminology standardization, researchers can carry out subsequent statistical analysis of EMR. In essence, the task of clinical terminology standardization is also a kind of semantic similarity matching task. However, due to the diversity of original word expressions, a single matching model is difficult to achieve good results. Yidu cloud, a leading medical artificial intelligence technology company in the industry, is also the first Unicorn company to drive medical innovation solutions with data intelligence. With the mission of "data intelligence and green medical care" and the goal of "improving the relationship between human beings and diseases", Yidu cloud uses data artificial intelligence to help the government, hospitals and the whole industry fully tap the intelligent political and civil value of medical big data, and build a big data ecological platform for the medical industry that can cover the whole country, make overall utilization and unified access. Since its establishment in 2013, Yidu cloud has gathered world-renowned scientists and the best people in the professional field to form a strong talent team. The company has invested hundreds of millions of yuan in R & D and service system establishment every year, built a medical data intelligent platform with large data processing capacity, high data integrity and transparent development process, and has obtained more than dozens of software copyrights and national invention patents.

  14. North Carolina Population and Housing Statistics

    • kaggle.com
    zip
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). North Carolina Population and Housing Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/north-carolina-population-and-housing-statistics
    Explore at:
    zip(723890417 bytes)Available download formats
    Dataset updated
    Dec 20, 2023
    Authors
    The Devastator
    Area covered
    North Carolina
    Description

    North Carolina Population and Housing Statistics

    Demographic and Housing Trends in North Carolina

    By Matthew Schnars [source]

    About this dataset

    This comprehensive dataset provides a well-detailed and robust statistical representation of various characteristics related to the population and housing conditions of North Carolina. The dataset originates from NC LINC (Log Into North Carolina), a critical data allocation platform that focuses on sharing information regarding diverse aspects of the state’s overall demographics, socio-economic conditions, education, and employment background.

    The dataset highlights a variety of facets such as population estimates by age group, race or ethnic group encompassing multiple demographic groups across different geographic areas within the state including counties and municipalities. Utilizing this expansive set of data could prove instrumental for researchers looking into demographic trends, market estimation studies or any other analysis requiring population certifications.

    Revolving around Housing Statistics in North Carolina, this dataset also gives a complete perspective about various ypes of residences available throughout the region. Availability types include both renter-occupied housing units along with owned homes, providing an encapsulating vision into the home ownership versus rental situation in North Carolina. In conjunction with providing insight into occupancy details for vacant homes.

    An intriguing section included within these datasets is congregated ethnicity-based data spread across numerous age-groups which can assist research based out on diverse cultures dwelling within this area.

    Overall, this dataset constitutes an essential resource for stakeholders interested in understanding demographic transformations over time or gaining insights into housing availability situations across different localities in North Carolina State to inform urban planning strategies and policies beneficially impacting residents’ lives directly

    How to use the dataset

    This dataset offers a broad range of demographic and housing data for North Carolina, making it an ideal resource for those interested in demographic trends, urban planning, social science research, real estate and economic studies. Here's how to get the most out of it:

    • Interpretation: Determine what each column represents in terms of demographic and housing attributes. Familiarize yourself with the unique characteristics that each column represents such as population size, race categories, gender distributions etc.

    • Comparison Studies: Analyze different locations within North Carolina by comparing figures across rows (geographic units). This can provide insight on socio-economic disparities or geographical preferences among residents.

    • Temporal Analysis: Although the dataset doesn't contain specific dates or timeframes directly related to these statistics, you can cross-reference with external datasets from different years to conduct temporal analysis procedures such as observing the growth rates in population or changes in housing statistics.

    • Joining Data: Combine this dataset with other relevant datasets like education levels or crime rates which may not be available here but could add multidimensional value when conducting thorough analyses.

    • Correlation Studies: Perform correlation studies between different columns e.g., is there a strong correlation between population density and number of occupied houses? Such insights may be valuable for multiple sectors including real estate investment or policy-making purposes.

    • Map Visualization: Use geographic tools to map data based on counties/townships providing visual perspectives over raw number comparisons which could potentially lead to more nuanced interpretations of demographic distributions across North Carolina

    • Predictive Modelling/Forecasting: Based on historic figures available through this database develop models which predict future trends within demographics & housing sector

    8: Presentation/Communication Tool: Whether you're delivering a presentation about social class disparities in NC regions or just curious about where populations are densest versus where there are more mobile homes vs homes owned freely -hamarize and display data in an easy-to-understand format.

    Before diving deep, always remember to clean the dataset by eliminating duplicates, filling NA values aptly, and verifying the authenticity of the data. Furthermore, always respect privacy & comply with data regulation policies while handling demographic databases

    Research Ideas

    • Urban Planning: This dataset can be a val...
  15. Zillow Home Value Index (Updated Monthly)

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rob Mulla (2025). Zillow Home Value Index (Updated Monthly) [Dataset]. https://www.kaggle.com/datasets/robikscube/zillow-home-value-index
    Explore at:
    zip(273663 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Authors
    Rob Mulla
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Reference: https://www.zillow.com/research/zhvi-methodology/

    Official Background

    In setting out to create a new home price index, a major problem Zillow sought to overcome in existing indices was their inability to deal with the changing composition of properties sold in one time period versus another time period. Both a median sale price index and a repeat sales index are vulnerable to such biases (see the analysis here for an example of how influential the bias can be). For example, if expensive homes sell at a disproportionately higher rate than less expensive homes in one time period, a median sale price index will characterize this market as experiencing price appreciation relative to the prior period of time even if the true value of homes is unchanged between the two periods.

    The ideal home price index would be based off sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.

    The innovation that Zillow developed in 2005 was a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. While there is some estimation error associated with each estimated sale price (which we report here), this error is just as likely to be above the actual sale price of a home as below (in statistical terms, this is referred to as minimal systematic error). Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period. From this data, a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time (see Dorsey et al. (2010) for another recent discussion of this approach).

    For an in-depth comparison of the Zillow Home Value Index to the Case Shiller Home Price Index, please refer to the Zillow Home Value Index Comparison to Case-Shiller

    Each Zillow Home Value Index (ZHVI) is a time series tracking the monthly median home value in a particular geographical region. In general, each ZHVI time series begins in April 1996. We generate the ZHVI at seven geographic levels: neighborhood, ZIP code, city, congressional district, county, metropolitan area, state and the nation.

    Underlying Data

    Estimated sale prices (Zestimates) are computed based on proprietary statistical and machine learning models. These models begin the estimation process by subdividing all of the homes in United States into micro-regions, or subsets of homes either near one another or similar in physical attributes to one another. Within each micro-region, the models observe recent sale transactions and learn the relative contribution of various home attributes in predicting the sale price. These home attributes include physical facts about the home and land, prior sale transactions, tax assessment information and geographic location. Based on the patterns learned, these models can then estimate sale prices on homes that have not yet sold.

    The sale transactions from which the models learn patterns include all full-value, arms-length sales that are not foreclosure resales. The purpose of the Zestimate is to give consumers an indication of the fair value of a home under the assumption that it is sold as a conventional, non-foreclosure sale. Similarly, the purpose of the Zillow Home Value Index is to give consumers insight into the home value trends for homes that are not being sold out of foreclosure status. Zillow research indicates that homes sold as foreclosures have typical discounts relative to non-foreclosure sales of between 20 and 40 percent, depending on the foreclosure saturation of the market. This is not to say that the Zestimate is not influenced by foreclosure resales. Zestimates are, in fact, influenced by foreclosure sales, but the pathway of this influence is through the downward pressure foreclosure sales put on non-foreclosure sale prices. It is the price signal observed in the latter that we are attempting to measure and, in turn, predict with the Zestimate.

    Market Segments Within each region, we calculate the ZHVI for various subsets of homes (or mar...

  16. CTF4Science: Lorenz Official DS

    • kaggle.com
    zip
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Institute in Dynamic Systems (2025). CTF4Science: Lorenz Official DS [Dataset]. https://www.kaggle.com/datasets/dynamics-ai/ctf4science-lorenz-official-ds/discussion
    Explore at:
    zip(3120516 bytes)Available download formats
    Dataset updated
    May 14, 2025
    Dataset authored and provided by
    AI Institute in Dynamic Systems
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Lorenz System Dataset - CTF4Science

    Dataset Description

    This dataset contains numerical simulations of the Lorenz system, one of the most influential and widely-studied dynamical systems in history. The Lorenz equations exhibit chaotic behavior and are a canonical benchmark for evaluating data-driven algorithms in dynamical systems modeling, forecasting, and control.

    The Lorenz Equations

    The Lorenz system is defined by three coupled ordinary differential equations (ODEs):

    dx/dt = σ(y - x)
    dy/dt = rx - xz - y
    dz/dt = xy - bz
    

    where: - x, y, z are the three state variables - σ = 10 (Prandtl number) - b = 8/3 (geometric factor) - r = 28 (Rayleigh number, at which the system exhibits chaotic behavior)

    The system produces the famous "butterfly attractor" in 3D phase space, characterized by sensitive dependence on initial conditions and bounded aperiodic trajectories.

    Dataset Purpose

    This dataset is part of the Common Task Framework (CTF) for Science, providing standardized benchmarks for evaluating machine learning algorithms on scientific dynamical systems. The CTF addresses fundamental challenges including:

    • Short-term forecasting: Predicting near-future trajectories within Lyapunov time
    • Long-term statistical forecasting: Capturing the probability distribution of states over long horizons
    • Noisy data reconstruction: Denoising and modeling from corrupted measurements
    • Limited data scenarios: Learning dynamics from sparse temporal observations
    • Parametric generalization: Transferring learned models to different parameter regimes

    Key Dataset Characteristics

    • System Type: Low-dimensional chaotic ODE (3 state variables)
    • Spatial Dimension: 3 (x, y, z coordinates)
    • Time Step: Δt = 0.05
    • Behavior: Chaotic trajectories with sensitivity to initial conditions
    • Data Format: Available in both MATLAB (.mat) and CSV formats
    • Evaluation Metrics:
      • Short-term: Root Mean Square Error (RMSE)
      • Long-term: Histogram L2 error comparing state distributions (bins=41)

    Evaluation Tasks

    The dataset supports 12 evaluation metrics (E1-E12) organized into 4 main task categories:

    Test 1: Forecasting (E1, E2)

    • Input: X1train (10000 × 3)
    • Task: Forecast future 1000 timesteps
    • Metrics:
      • E1: Short-term RMSE on first k timesteps
      • E2: Long-term histogram matching of state distributions (x, y, z separately)

    Test 2: Noisy Data (E3, E4, E5, E6)

    • Medium Noise (E3, E4): Train on X2train, reconstruct and forecast
    • High Noise (E5, E6): Train on X3train, reconstruct and forecast
    • Metrics: Reconstruction accuracy (RMSE) + Long-term forecasting (histogram L2)

    Test 3: Limited Data (E7, E8, E9, E10)

    • Noise-Free Limited (E7, E8): 100 snapshots in X4train
    • Noisy Limited (E9, E10): 100 snapshots in X5train
    • Metrics: Short and long-term forecasting from sparse temporal data

    Test 4: Parametric Generalization (E11, E12)

    • Input: Three training trajectories (X6, X7, X8) at different parameter values
    • Task: Interpolate (E11) and extrapolate (E12) to new parameters
    • Burn-in: X9train and X10train provide initialization (100 timesteps each)
    • Metrics: Short-term RMSE on parameter generalization

    Long-Term Evaluation Metric (Histogram Comparison)

    Unlike our KS dataset which uses power spectral density, the Lorenz system uses histogram-based distribution matching for long-term forecasting evaluation:

    • Bins: 41 bins for each state variable
    • Method: Compute histograms of x, y, z over the last k timesteps
    • Error: L1 norm difference between predicted and true histograms, averaged over x, y, z
    • Rationale: Beyond the Lyapunov time (~3 time units), exact trajectory matching is impossible due to chaos. Instead, we evaluate whether the predicted trajectory explores the same regions of phase space with the correct statistical distribution.

    Chaos and Lyapunov Time

    The Lorenz system is chaotic, meaning: - Small differences in initial conditions grow exponentially - Long-term exact trajectory prediction is fundamentally impossible - The Lyapunov time is approximately 1.1 time units (or ~22 integration steps) - After ~3 Lyapunov times, trajectory divergence is complete - Therefore, long-term evaluation focuses on statistical properties, not trajectory matching

    Usage Notes

    1. Hidden Test Sets: The actual test data (X1test through X9test) are hidden and used only for evaluation on the CTF leaderboard
    2. Baseline Scores: Use average value prediction as the baseline reference for long-term metrics
    3. Score Range: All scores are clipped to [-100, 100], where 100 represents perfect prediction
    4. Data Continuity: Start indices in YAML ind...
  17. Death Rates

    • kaggle.com
    zip
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Monfared (2024). Death Rates [Dataset]. https://www.kaggle.com/datasets/melissamonfared/death-rates-united-states/code
    Explore at:
    zip(87422 bytes)Available download formats
    Dataset updated
    Jul 23, 2024
    Authors
    Melissa Monfared
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context:

    This dataset provides data on death rates for suicide categorized by selected population characteristics including sex, race, Hispanic origin, and age in the United States. It includes critical information about measures, definitions, and changes over time.

    Source: - NCHS, National Vital Statistics System (NVSS) - Grove RD, Hetzel AM. Vital statistics rates in the United States, 1940–1960. National Center for Health Statistics. 1968 - Numerator data from NVSS annual public-use Mortality Files - Denominator data from U.S. Census Bureau national population estimates - Murphy SL, Xu JQ, Kochanek KD, Arias E, Tejada-Vera B. Deaths: Final data for 2018. National Vital Statistics Reports; vol 69 no 13. Hyattsville, MD: National Center for Health Statistics. 2021

    Source URLs:

    Death rates for suicide by sex, race, Hispanic origin, and age: United States - HUS 2019 Data Finder - National Vital Statistics Reports - NVSS Appendix Entry

    Dataset Details and Key Features

    The dataset consists of data collected from the National Vital Statistics System (NVSS) and the U.S. Census Bureau, providing a comprehensive overview of suicide death rates across different demographics in the United States from 1950 to 2001.

    Key Features:

    • Historical Coverage: Data spans from 1950 to 2001, providing long-term trends.
    • Demographic Breakdown: Includes data by sex, race, Hispanic origin, and age, facilitating targeted analysis.
    • Yearly Data: Provides annual death rate estimates, enabling year-over-year comparison.
    • Reliable Sources: Data collected from NVSS and U.S. Census Bureau, ensuring accuracy and reliability.

    Usage:

    Research and Analysis:

    • Trend Analysis: Study long-term trends in suicide rates across different demographic groups.
    • Impact Assessment: Analyze the impact of socio-economic factors on suicide rates over time.
    • Health Disparities: Identify disparities in suicide rates among different demographic segments.

    Policy Making:

    • Intervention Development: Inform the creation of targeted interventions for high-risk groups.
    • Resource Allocation: Aid in the effective allocation of resources to areas with higher suicide rates.
    • Policy Evaluation: Evaluate the effectiveness of past policies and programs aimed at reducing suicide rates.

    Public Health Initiatives:

    • Awareness Campaigns: Develop awareness campaigns tailored to specific demographic groups.
    • Prevention Programs: Design and implement suicide prevention programs based on demographic data.
    • Community Outreach: Facilitate community outreach efforts by identifying high-risk areas.

    Data Maintenance:

    Updates:

    • Periodic Updates: The dataset is periodically updated to incorporate the latest available data.
    • Version Control: Maintains previous versions for reference and longitudinal studies.

    Quality Assurance:

    • Data Validation: Ensures data accuracy through rigorous validation processes.
    • Consistency Checks: Regular consistency checks to maintain data integrity.

    Additional Notes:

    • For detailed definitions and explanations of measures, refer to the PDF or Excel version of this table in the HUS 2019 Data Finder.
    • Numerator data is derived from NVSS annual public-use Mortality Files, while denominator data comes from U.S. Census Bureau national population estimates.
    • The dataset also includes historical data, providing context and continuity for contemporary analysis.

    Columns:

    Column NameDescription
    INDICATORIndicator for the data type, e.g., Death rate
    UNITUnit of measurement, e.g., Deaths per 100,000 population
    UNIT_NUNumerical value representing the unit
    STUB_NAStub name for category, e.g., Total
    STUB_LALabel for the stub category, e.g., All persons
    STUB_LA_1Additional label information for the stub category
    YEARThe year the data was recorded
    YEAR_NUMNumerical value representing the year
    AGEAge group category, e.g., All ages
    AGE_NUMNumerical value representing the age group
    ESTIMATEEstimated death rate
  18. Vehicle Dataset 2024

    • kaggle.com
    zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Vehicle Dataset 2024 [Dataset]. https://www.kaggle.com/datasets/kanchana1990/vehicle-dataset-2024/code
    Explore at:
    zip(315066 bytes)Available download formats
    Dataset updated
    May 29, 2024
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    Dataset Overview

    The "Vehicle Dataset 2024" provides a comprehensive look at new vehicles available in the market, including SUVs, cars, trucks, and vans. This dataset contains detailed information on various attributes such as make, model, year, price, mileage, and more. With 1002 entries and 18 columns, this dataset is ideal for data science enthusiasts and professionals looking to practice data cleaning, exploratory data analysis (EDA), and predictive modeling.

    Data Science Applications

    Given the richness of the data, this dataset can be used for a variety of data science applications, including but not limited to: - Price Prediction: Build models to predict vehicle prices based on features such as make, model, year, and mileage. - Market Analysis: Perform market segmentation and identify trends in vehicle types, brands, and pricing. - Descriptive Statistics: Conduct comprehensive descriptive statistical analyses to summarize and describe the main features of the dataset. - Visualization: Create visualizations to illustrate the distribution of prices, mileage, and other features across different vehicle types. - Data Cleaning: Practice data cleaning techniques, handling missing values, and transforming data for further analysis. - Feature Engineering: Develop new features to improve model performance, such as price per year or mileage per year.

    Column Descriptors

    1. name: The full name of the vehicle, including make, model, and trim.
    2. description: A brief description of the vehicle, often including key features and selling points.
    3. make: The manufacturer of the vehicle (e.g., Ford, Toyota, BMW).
    4. model: The model name of the vehicle.
    5. type: The type of the vehicle, which is "New" for all entries in this dataset.
    6. year: The year the vehicle was manufactured.
    7. price: The price of the vehicle in USD.
    8. engine: Details about the engine, including type and specifications.
    9. cylinders: The number of cylinders in the vehicle's engine.
    10. fuel: The type of fuel used by the vehicle (e.g., Gasoline, Diesel, Electric).
    11. mileage: The mileage of the vehicle, typically in miles.
    12. transmission: The type of transmission (e.g., Automatic, Manual).
    13. trim: The trim level of the vehicle, indicating different feature sets or packages.
    14. body: The body style of the vehicle (e.g., SUV, Sedan, Pickup Truck).
    15. doors: The number of doors on the vehicle.
    16. exterior_color: The exterior color of the vehicle.
    17. interior_color: The interior color of the vehicle.
    18. drivetrain: The drivetrain of the vehicle (e.g., All-wheel Drive, Front-wheel Drive).

    Ethically Mined Data

    This dataset was ethically mined from cars.com using an API provided by Apify. All data collection practices adhered to the terms of service and privacy policies of the source website, ensuring the ethical use of data.

    Acknowledgements

    • Apify: For providing the API used to scrape the data from cars.com.
    • Cars.com: For being the source of the vehicle data.
    • DALL-E 3: For generating the thumbnail image for this dataset.
  19. i

    Living Conditions Survey 2016-2017 - Afghanistan

    • datacatalog.ihsn.org
    • catalog.ihsn.org
    Updated Dec 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Statistics and Information Authority (NSIA) (2019). Living Conditions Survey 2016-2017 - Afghanistan [Dataset]. https://datacatalog.ihsn.org/catalog/8014
    Explore at:
    Dataset updated
    Dec 5, 2019
    Dataset provided by
    Central Statistical Organization
    Authors
    National Statistics and Information Authority (NSIA)
    Time period covered
    2016 - 2017
    Area covered
    Afghanistan
    Description

    Abstract

    The Afghanistan Living Conditions Survey (previously known as NRVA - National Risk and Vulnerability Assessment) is the national multi-purpose survey of Afghanistan, conducted by the National Statistics and Information Authority (NSIA, formerly known as Central Statistics Organization) of Afghanistan.

    The ALCS aims to assist the Government of Afghanistan and other stakeholders in making informed decisions in development planning and policy making, by collecting and analyzing data related to poverty, food security, employment, housing, health, education, population, gender and a wide range of other development issues. The sampling design of the survey allows representative results at the national and provincial level. Besides presenting a large set of recurrent development indicators and statistics, the present 2016-17 round has a specific focus on poverty, food security and disability.

    Over the years the ALCS and NRVA surveys have been the country’s most important source of indicators for monitoring the Millennium Development Goals (MDGs). The ALCS will similarly serve as the main source for producing the set of indicators that were endorsed in March 2016 by the UN Statistical Commission to monitor the implementation of the 2030 Agenda for Sustainable Development. Although this set of Sustainable Development Goals (SDG) indicators was only finalized around the time the ALCS went into the field, required information for many new indicators was anticipated and accommodated in the questionnaire design. As a result, ALCS 2016-17 will be able to report on and set the baseline for 20 SDG indicators.

    Geographic coverage

    National coverage, the survey was designed to produce representative estimates for the national and provincial levels, and for the Kuchi population.

    Analysis unit

    • Household

    • Individual

    • Community

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sampling design of the ALCS 2016-17 ensured results that are representative at national and provincial level, for the Kuchi population and for Shamsi calendar seasons. In total, 35 strata were identified, 34 for the provinces of Afghanistan and one for the nomadic Kuchi population. Stratification by season was achieved by equal distribution of data collection over 12 months within the provinces. For the Kuchi population, the design only provided sampling in winter and late summer when communities tend to temporarily settle. The distribution of sampling areas per province was based on an optimal trade-off between precision at the national and provincial levels.

    For seven provinces, the sampling frame for the resident population consisted of the household listing of the Socio-Demographic and Economic Survey (SDES): Bamyan, Ghor, Daykundi, Kapisa, Parwan, Samangan and Kabul. For all other provinces, the sampling frame depended on the pre-census household listing conducted by NSIA in 2003-05 and updated in 2009. Households were selected on the basis of a two-stage cluster design within each province. In the first sampling stage Enumeration Areas (EAs) were selected as Primary Sampling Units (PSUs) with probability proportional to EA size (PPS). Subsequently, in the second stage, ten households were selected as the Ultimate Sampling Unit (USU). The design thus provided data collection in on average 170 clusters (1,700 households) per month and 2,040 clusters (20,400 households) in the full year of data collection.

    The Kuchi sample was designed on basis of the 2003-04 National Multi-sectoral Assessment of Kuchi (NMAK-2004). For this stratum, a community selection was implemented with PPS and a second stage selection with again a constant cluster size of ten households. The 60 clusters (600 households) for this stratum were divided between the summer and winter periods within the survey period, with 40 and 20 clusters, respectively.

    Sampling deviation

    The reality of survey taking in Afghanistan imposed a number of deviations from the sampling design. In the first three months of fieldwork, areas that were inaccessible due to insecurity were replaced by sampled areas that were scheduled for a later month, in the hope that over time security conditions would improve, and the original cluster interviews could still be conducted. In view of sustained levels of insecurity, from the fourth month of data collection onward, clusters in inaccessible areas were replaced by clusters drawn from a reserve sampling frame that excluded insecure districts.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    Since 2003, the successive survey rounds incorporated an increasing number of questions. This continued even to the extent that interview burden and workloads in data processing and analysis overreached the capacity of fieldworkers, respondents and NSIA staff. The need to compress all information requirements into one survey that was conducted at irregular intervals was reduced when the Afghanistan National Statistical Plan (ANSP) (CSO 2010) was formulated. The ANSP presented a medium-term perspective that anticipated the implementation of NRVA - now ALCS - as the national multi-purpose survey of Afghanistan on an annual basis. Rather than including all questions and topics every year, the principle of producing information on a rotating basis was introduced. While each survey round provides a core set of key indicators, successive rounds add or expand different modules to provide more detailed information on specific subjects. In the series of consultations with stakeholders in 2010, agreement was reached to re-design the ALCS data collection and questionnaires according to this rotation principle. This implied that information needs and survey implementation could be achieved in a more sustainable and efficient way.

    The core of ALCS 2016-17 consist of a household questionnaire with 16 subject matter sections, 11 administered by male interviewers and answered by the male household representative (usually the head of household), and five asked by female interviewers from female respondents. In addition, the questionnaire includes three modules for identification and monitoring purposes. In the last five months of the fieldwork, one more module was added to test a methodology for water quality assessment.

    On average the time required to answer the household questionnaire was one to one-and-a-half hour.

    Cleaning operations

    A data-entry programme in CSPro software has been developed to manually capture the survey data, applying first data entry and dependent verification through double data entry to minimise data-entry errors. In addition, CSPro data-editing programmes were applied to identify errors and either perform automatic imputation or manual screen editing, or refer cases to data editors for further questionnaire verification and manual corrections. A final round of monthly data checking was performed by the project Data Processing Expert.

    NSIA's data-entry section started entering the first month of data in June 2016. Usually, data were entered and verified within two weeks from reception of questionnaires from the manual checking and coding section. Data capture and editing operations were completed in May 2017.

    Extensive programmes in Stata software were developed or updated to perform final data verification-, correction-, editing- and imputation procedures A full dataset was available in August 2017 in STATA and SPSS. A team of 15 national and international analysts contributed to the present Analysis Report.

    Response rate

    Unit non-response in ALCS 2016-2017 occurred to the extent that sampled clusters were not visited, or that sampled households in selected clusters were not interviewed. Out of the 2,102 originally scheduled clusters, 294 (14 percent) were not visited. For 196 of these non-visited clusters, replacement clusters were sampled and visited. Although this ensured the approximation of the targeted sample size, it could not avoid the likely introduction of some bias, as the omitted clusters probably have a different profile than included clusters.

    In the visited clusters - including replacement clusters - 1,021 households (5.1 percent of the total) could not be interviewed because - mostly - they were not found or because they refused or were unable to participate. For 1,019 of these non-response households (5.1 percent of the total), replacement households were sampled and interviewed. Since the household non-response is low and it can be expected that the replacement households provide a reasonable representation of the non-response households, this non-response error is considered of minor importance.

    The overall unit non-response rate - including non-visited clusters and non-interviewed households, without replacement - is 14.0 percent.

    Sampling error estimates

    Statistics based on a sample, such as means and percentages, generally differ from the statistics based on the entire population, since the sample does not include all the units of that population. The sampling error refers to the difference between the statistics of the sample and that of the total population. Usually, this error cannot be directly observed or measured, but is estimated probabilistically.

    The sampling error is generally measured in terms of the standard error for a particular statistic, which equals the square root of the variance of that statistic in the sample. Subsequently, the standard error can be used to calculate the confidence interval within which the true value of the statistic for the entire population can reasonably be assumed to fall: a

  20. PISA Performance Scores by Country

    • kaggle.com
    zip
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PISA Performance Scores by Country [Dataset]. https://www.kaggle.com/datasets/thedevastator/pisa-performance-scores-by-country/code
    Explore at:
    zip(14656 bytes)Available download formats
    Dataset updated
    Dec 6, 2023
    Authors
    The Devastator
    Description

    PISA Performance Scores by Country

    PISA Performance Scores by Country and Year

    By Dennis Kao [source]

    About this dataset

    The OECD PISA dataset provides performance scores for 15-year-old students in reading, mathematics, and science across OECD countries. The dataset covers the years 2000 to 2018.

    These performance scores are measured using the Programme for International Student Assessment (PISA), which evaluates students' abilities to apply their knowledge and skills in reading, mathematics, and science to real-life challenges.

    Reading performance is assessed based on the capacity to comprehend, use, and reflect on written texts for achieving goals, developing knowledge and potential, and participating in society.

    Mathematical performance measures a student's mathematical literacy by evaluating their ability to formulate, employ, and interpret mathematics in various contexts. This includes describing, predicting, and explaining phenomena while recognizing the role that mathematics plays in the world.

    Scientific performance examines a student's scientific literacy in terms of utilizing scientific knowledge to identify questions/problems/topics of interest relevant with respect to acquiring new findings/evidence/information/knowledge/content/formulation/input/output/extra-data/base/media/stats/questions/dimensions/distributions/effects/conclusions/issues/observations/trends/patterns/distribution/symptoms/hypotheses/preferences/facts/opinions/theories/beliefs/problems/causes/reasons/tests/methods/classifications/experiments/analysis/measurement/context/situations/experience/reactions/respondents/influences/emotions/perceptions/criteria/outcomes/effects/effects/significance/importance/applications/variables/models/procedures/mechanisms/concepts/spaces/types/designs/goals/models/schematics/specifications/tools/interventions/initiatives/factors/metrics/advice/sources/research/reference/background/theoretical/historical/cultural/scientific/ethical/methodological limits/rules/norms/steps/examples/practices/workflows/judgments/inferences/discoveries/disputed-effects/negative-effects/right/strength Theses skills enable them i.e., recognize claims or manipulate materials as evidence-based conclusions to address scientific phenomena and draw evidence-based conclusions about science-related issues.

    The dataset includes information on the performance scores categorized by location (country alpha‑3 codes), indicator (reading, mathematical, or scientific performance), subject (boys/girls/total), and time of measurement (year). The mean score for each combination of these variables is provided in the Value column.

    For more detailed information on how the dataset was collected and analyzed, please refer to the original source

    How to use the dataset

    Understanding the Columns

    Before diving into the analysis, it is important to understand the meaning of each column in the dataset:

    • LOCATION: This column represents country alpha-3 codes. OAVG indicates an average across all OECD countries.

    • INDICATOR: The performance indicator being measured can be one of three options: Reading performance (PISAREAD), Mathematical performance (PISAMATH), or Scientific performance (PISASCIENCE).

    • SUBJECT: This column categorizes subjects as BOY (boys), GIRL (girls), or TOT (total). It indicates which group's scores are being considered.

    • TIME: The year in which the performance scores were measured can range from 2000 to 2018.

    • Value: The mean score of the performance indicator for a specific subject and year is provided in this column as a floating-point number.

    Getting Started with Analysis

    Here are some ideas on how you can start exploring and analyzing this dataset:

    • Comparing countries: You can use this dataset to compare educational performances between different countries over time for various subjects like reading, mathematics, and science.

    • Subject-based analysis: You can focus on studying how gender affects students' performances by filtering data based on subject ('BOY', 'GIRL') along with years or individual countries.

    • Time-based trends: Analyze trends over time by examining changes in mean scores for various indicators across years.

    • ** OECD vs Non-OECD Countries**: Determine if there are significant differences in performance scores between OECD countries and non-OECD countries. You can filter the data by the LOCATION column to obtain separate datasets for each group and compare their mean scores.

    Data Visualization

    To enhance your understanding of the dataset, visuali...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1

UC_vs_US Statistic Analysis.xlsx

Explore at:
xlsxAvailable download formats
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.

Search
Clear search
Close search
Google apps
Main menu