83 datasets found
  1. n

    Data from: WiBB: An integrated method for quantifying the relative...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    zip
    Updated Aug 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 20, 2021
    Dataset provided by
    Field Museum of Natural History
    Beijing Normal University
    Authors
    Qin Li; Xiaojun Kou
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

    A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

    Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

  2. d

    Data From: The 1014F knockdown resistance mutation is not a strong correlate...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data From: The 1014F knockdown resistance mutation is not a strong correlate of phenotypic resistance to pyrethroids in Florida populations of Culex quinquefasciatus [Dataset]. https://catalog.data.gov/dataset/data-from-the-1014f-knockdown-resistance-mutation-is-not-a-strong-correlate-of-phenotypic--78b35
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Culex quinquefasciatus is an important target for vector control because of its ability to transmit pathogens that cause disease. Most populations are resistant to pyrethroids and often to organophosphates, the two most common classes of active ingredients used by public health agencies. A knockdown resistance (kdr) mutation, resulting in a change from a leucine to phenylalanine in the voltage gated sodium channel, is one mechanism contributing to the pyrethroid resistant phenotype. Enzymatic resistance has also been shown to play a very important role. Recent studies have shown strong resistance in populations even when kdr is relatively low which indicates factors other than kdr may be larger contributors to resistance. In this study, we examined on a statewide scale (over 70 populations), the strength of the correlation between resistance in the CDC bottle bioassay and the kdr genotypes and allele frequencies. Spearman correlation analysis showed only moderate (-0.51) and weak (-0.29) correlation between the kdr genotype and permethrin and deltamethrin respectively. The frequency of the kdr allele was an even weaker correlate. These results indicate, in contrast to Aedes aegypti, assessing kdr in populations of Culex quinquefasciatus is not a good surrogate for phenotypic resistance testing.

  3. f

    Dataset for: Comparison of Two Correlated ROC Surfaces at a Given Pair of...

    • wiley.figshare.com
    xlsx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonidas Bantis; Ziding Feng (2023). Dataset for: Comparison of Two Correlated ROC Surfaces at a Given Pair of True Classification Rates [Dataset]. http://doi.org/10.6084/m9.figshare.6527219.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wiley
    Authors
    Leonidas Bantis; Ziding Feng
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The receiver operating characteristics (ROC) curve is typically employed when one wants to evaluate the discriminatory capability of a continuous or ordinal biomarker in the case where two groups are to be distinguished, commonly the ’healthy’ and the ’diseased’. There are cases for which the disease status has three categories. Such cases employ the (ROC) surface, which is a natural generalization of the ROC curve for three classes. In this paper, we explore new methodologies for comparing two continuous biomarkers that refer to a trichotomous disease status, when both markers are applied to the same patients. Comparisons based on the volume under the surface have been proposed, but that measure is often not clinically relevant. Here, we focus on comparing two correlated ROC surfaces at given pairs of true classification rates, which are more relevant to patients and physicians. We propose delta-based parametric techniques, power transformations to normality, and bootstrap-based smooth nonparametric techniques to investigate the performance of an appropriate test. We evaluate our approaches through an extensive simulation study and apply them to a real data set from prostate cancer screening.

  4. f

    Data from: Spatio-Chromatic Adaptation via Higher-Order Canonical...

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Feb 12, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hyvärinen, Aapo; Laparra, Valero; Gutmann, Michael U.; Malo, Jesús (2014). Spatio-Chromatic Adaptation via Higher-Order Canonical Correlation Analysis of Natural Images [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001237164
    Explore at:
    Dataset updated
    Feb 12, 2014
    Authors
    Hyvärinen, Aapo; Laparra, Valero; Gutmann, Michael U.; Malo, Jesús
    Description

    Independent component and canonical correlation analysis are two general-purpose statistical methods with wide applicability. In neuroscience, independent component analysis of chromatic natural images explains the spatio-chromatic structure of primary cortical receptive fields in terms of properties of the visual environment. Canonical correlation analysis explains similarly chromatic adaptation to different illuminations. But, as we show in this paper, neither of the two methods generalizes well to explain both spatio-chromatic processing and adaptation at the same time. We propose a statistical method which combines the desirable properties of independent component and canonical correlation analysis: It finds independent components in each data set which, across the two data sets, are related to each other via linear or higher-order correlations. The new method is as widely applicable as canonical correlation analysis, and also to more than two data sets. We call it higher-order canonical correlation analysis. When applied to chromatic natural images, we found that it provides a single (unified) statistical framework which accounts for both spatio-chromatic processing and adaptation. Filters with spatio-chromatic tuning properties as in the primary visual cortex emerged and corresponding-colors psychophysics was reproduced reasonably well. We used the new method to make a theory-driven testable prediction on how the neural response to colored patterns should change when the illumination changes. We predict shifts in the responses which are comparable to the shifts reported for chromatic contrast habituation.

  5. Weather and Housing in North America

    • kaggle.com
    zip
    Updated Feb 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
    Explore at:
    zip(512280 bytes)Available download formats
    Dataset updated
    Feb 13, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    North America
    Description

    Weather and Housing in North America

    Exploring the Relationship between Weather and Housing Conditions in 2012

    By [source]

    About this dataset

    This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

    First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

    Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

    Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

    Research Ideas

    • Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.
    • Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.
    • Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...

  6. m

    COVID-19 Combined Data-set with Improved Measurement Errors

    • data.mendeley.com
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Afshin Ashofteh (2020). COVID-19 Combined Data-set with Improved Measurement Errors [Dataset]. http://doi.org/10.17632/nw5m4hs3jr.3
    Explore at:
    Dataset updated
    May 13, 2020
    Authors
    Afshin Ashofteh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Public health-related decision-making on policies aimed at controlling the COVID-19 pandemic outbreak depends on complex epidemiological models that are compelled to be robust and use all relevant available data. This data article provides a new combined worldwide COVID-19 dataset obtained from official data sources with improved systematic measurement errors and a dedicated dashboard for online data visualization and summary. The dataset adds new measures and attributes to the normal attributes of official data sources, such as daily mortality, and fatality rates. We used comparative statistical analysis to evaluate the measurement errors of COVID-19 official data collections from the Chinese Center for Disease Control and Prevention (Chinese CDC), World Health Organization (WHO) and European Centre for Disease Prevention and Control (ECDC). The data is collected by using text mining techniques and reviewing pdf reports, metadata, and reference data. The combined dataset includes complete spatial data such as countries area, international number of countries, Alpha-2 code, Alpha-3 code, latitude, longitude, and some additional attributes such as population. The improved dataset benefits from major corrections on the referenced data sets and official reports such as adjustments in the reporting dates, which suffered from a one to two days lag, removing negative values, detecting unreasonable changes in historical data in new reports and corrections on systematic measurement errors, which have been increasing as the pandemic outbreak spreads and more countries contribute data for the official repositories. Additionally, the root mean square error of attributes in the paired comparison of datasets was used to identify the main data problems. The data for China is presented separately and in more detail, and it has been extracted from the attached reports available on the main page of the CCDC website. This dataset is a comprehensive and reliable source of worldwide COVID-19 data that can be used in epidemiological models assessing the magnitude and timeline for confirmed cases, long-term predictions of deaths or hospital utilization, the effects of quarantine, stay-at-home orders and other social distancing measures, the pandemic’s turning point or in economic and social impact analysis, helping to inform national and local authorities on how to implement an adaptive response approach to re-opening the economy, re-open schools, alleviate business and social distancing restrictions, design economic programs or allow sports events to resume.

  7. Student Performance

    • kaggle.com
    zip
    Updated Oct 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Chauhan (2022). Student Performance [Dataset]. https://www.kaggle.com/datasets/whenamancodes/student-performance
    Explore at:
    zip(106753 bytes)Available download formats
    Dataset updated
    Oct 7, 2022
    Authors
    Aman Chauhan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

    Attributes for both Maths.csv (Math course) and Portuguese.csv (Portuguese language course) datasets:

    ColumnsDescription
    schoolstudent's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
    sexstudent's sex (binary: 'F' - female or 'M' - male)
    agestudent's age (numeric: from 15 to 22)
    addressstudent's home address type (binary: 'U' - urban or 'R' - rural)
    famsizefamily size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
    Pstatusparent's cohabitation status (binary: 'T' - living together or 'A' - apart)
    Medumother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
    Fedufather's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
    Mjobmother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    Fjobfather's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
    reasonreason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
    guardianstudent's guardian (nominal: 'mother', 'father' or 'other')
    traveltimehome to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
    studytimeweekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
    failuresnumber of past class failures (numeric: n if 1<=n<3, else 4)
    schoolsupextra educational support (binary: yes or no)
    famsupfamily educational support (binary: yes or no)
    paidextra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
    activitiesextra-curricular activities (binary: yes or no)
    nurseryattended nursery school (binary: yes or no)
    higherwants to take higher education (binary: yes or no)
    internetInternet access at home (binary: yes or no)
    romanticwith a romantic relationship (binary: yes or no)
    famrelquality of family relationships (numeric: from 1 - very bad to 5 - excellent)
    freetimefree time after school (numeric: from 1 - very low to 5 - very high)
    gooutgoing out with friends (numeric: from 1 - very low to 5 - very high)
    Dalcworkday alcohol consumption (numeric: from 1 - very low to 5 - very high)
    Walcweekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
    healthcurrent health status (numeric: from 1 - very bad to 5 - very good)
    absencesnumber of school absences (numeric: from 0 to 93)

    These grades are related with the course subject, Math or Portuguese:

    GradeDescription
    G1first period grade (numeric: from 0 to 20)
    G2second period grade (numeric: from 0 to 20)
    G3final grade (numeric: from 0 to 20, output target)

    More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha

  8. Employee Attrition and Factors

    • kaggle.com
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Employee Attrition and Factors [Dataset]. https://www.kaggle.com/datasets/thedevastator/employee-attrition-and-factors
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Employee Attrition and Factors

    Examining Performance, Financials, and Job Role for Impact on Retention

    By [source]

    About this dataset

    This dataset offers a comprehensive and varied analysis of an organization's employees, focusing on areas such as employee attrition, personal and job-related factors, and financials. Included are numerous parameters such as Age, Gender, Marital Status, Business Travel Frequency, Daily Rate of Pay, Departmental Information such as Distance From Home Office or Education Level Obtained by the employee in question. Also included is a variant series of parameters related to the job being performed such as Job Involvement (level), Job Level (relative to similar roles within the same organization), Job Role specifically meant for that individual(function/task), total working hours in a week/month/year be it overtime or standard hours for a given role. Furthermore detailed aspects include Percent Salary Hike during their tenure with the company from promotion or otherwise , Performance Rating based on specific criteria established by leadership , Relationship Satisfaction among peers at workplace but also taking into account outside family members that can influence stress levels in varying capacities ,Monthly Income considered at its starting point once hired then compared against their monthly payrate with overtime hours included if applicable along with Number Companies Worked before if any. Lastly the Retirement Status commonly known as Attrition is highlighted; covering whether there was an intent to stay with one employer through retirement age or if attrition took place for reasons beyond ones control earlier than expected . Through this dataset you can get an insight into various major aspect regarding today's workforce management philosphies which have changed drastically over time due to advancements in technology

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Understand the variables that make up this dataset. The dataset includes several personal and job-related variables such as Age, Gender, Marital Status, Business Travel, Daily Rate, Department, Distance From Home, Education, Education Field, Employee Count, Employee Number, Environment Satisfaction Hoursly Rate and so on. Knowing what each variable is individuallly will help when exploring employee attrition as a whole.
    • Analyze the data for patterns as well as outliers or anomalies either at an individual level or across all of the data points together. Identifying these patterns or discrepancies can offer insight into factors that are related to employee attrition.
    • Visualize the data using charts and graphs to allow for easy understanding of which relationships might be causing higher levels of employees leaving the organization over time dimensions like age or job role can be key factors in employee attrition rates visually displaying how they relate to one another can provide clarity into what needs to change within an organization in order to reduce attrition rates
    • Explore relationships between pairs of variables through correlation analysis correlations are measures of how strongly two variables are related when looking at employment retention it’s important to analyze correlations at both an individual level and for all variables together showing which pairings have more influence than others when it comes to influencing employee decisions
      5 Use descriptive analytics methods such as scatter plots histograms boxplots etc with aggregated values from each field like average age average monthly income etc These analytics help gain a deeper understanding about where changes need to be made internally
      6 Utilize predictive analytics with more advanced techniques such as regressions clustering decision trees in order identify trendsfrom past data points then build models on those insights from different perspectives helping further prepare organizations against potential high levelsinvolving employees departing ?

    Research Ideas

    • Identifying performance profiles of employees at risk for attrition through predictive analytics and using this insight to create personalized development plans or retention strategies.
    • Using the data to assess the impact of different financial incentives or variations in job role/structure on employee attitudes, satisfaction and ultimately attrition rates.
    • Analyzing different age groups' responses to various perks or turnover patterns in order to understand how organizations can better engage different demographic segments

    Acknowledgements

    If you use this dataset in your research, pl...

  9. Computing a correlation length scale from MFLL-OCO2 CO2 differences, and...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    application/gzip, bin +1
    Updated Feb 10, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David F Baker; David F Baker; Emily Bell; Kenneth J. Davis; Joel F. Campbell; Bing Lin; Jeremy Dobler; Emily Bell; Kenneth J. Davis; Joel F. Campbell; Bing Lin; Jeremy Dobler (2021). Computing a correlation length scale from MFLL-OCO2 CO2 differences, and accounting for correlated errors when assimilating OCO-2 data [Dataset]. http://doi.org/10.5281/zenodo.4399884
    Explore at:
    bin, txt, application/gzipAvailable download formats
    Dataset updated
    Feb 10, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David F Baker; David F Baker; Emily Bell; Kenneth J. Davis; Joel F. Campbell; Bing Lin; Jeremy Dobler; Emily Bell; Kenneth J. Davis; Joel F. Campbell; Bing Lin; Jeremy Dobler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains code and data used in 'A new exponentially-decaying error correlation model for assimilating OCO-2 column-average CO2 data, using a length scale computed from airborne lidar measurements' by David F. Baker, Emily Bell, Kenneth J. Davis, Joel F. Campbell, Bing Lin, and Jeremy Dobler, submitted to Geoscientific Model Development.

    In particular, the MATLAB script used to compute the autocorrelation spectrum of Multi-functional Fiber Laser LiDAR (MFLL) and Orbiting Carbon Observatory (OCO-2) column CO2 differences (in Section 2.2 of the paper) is given here as file

    comp_MFLL_OCO2_autocorrl_spectrum.m

    along with the needed MFLL and OCO-2 CO2 data for each of the six flights analyzed (as described in Section 2.1 of the paper) in files:

    20160727_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20160805_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20170215_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20170308_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20171022_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20171027_mfll_averaged_L1_RA_GMAO_ACTadj.h5
    20160727_oco_averaged_B9_GMAO_ACTadj.h5
    20160805_oco_averaged_B9_GMAO_ACTadj.h5
    20170215_oco_averaged_B9_GMAO_ACTadj.h5
    20170308_oco_averaged_B9_GMAO_ACTadj.h5
    20171022_oco_averaged_B9_GMAO_ACTadj.h5
    20171027_oco_averaged_B9_GMAO_ACTadj.h5

    The MFLL data given here was downloaded in late 2018 in the form of L1b files (calibrated radiances), as described in Bell et al (2020).

    In the second part of the paper, different error correlation models are presented and applied to the averaging of OCO-2 column CO2 data. The original bias-corrected OCO-2 data, in the form of daily OCO-2 version 10 "Lite" files, have been from obtained from NASA's GES DISC data repository, here:
    https://disc.gsfc.nasa.gov/datasets/OCO2_L2_Lite_FP_10r/summary?keywords=OCO2_L2_Lite_FP

    The bias-corrected column CO2 retrievals, their uncertainties, and other parameters needed for this analysis were extracted from these
    "Lite" files and saved to daily files, which have been packaged up in the following compressed tarball:
    OCO2_XCO2_2014_2020.tar.gz

    These daily files are read in and averaged across 2-second and 10-second spans (as described in Section 3.5 of the paper), using the different error correlation models outlined in the paper. The code that implements these averages is given in the following two FORTRAN programs:

    Make_OCO2_2sec_averages.f90
    Make_OCO2_10sec_averages.f90

    which need the following list of days having good OCO-2 data:

    OCO2_dates.txt

    Program "Make_OCO2_2sec_averages.f90" averages the OCO-2 data across a 2-second (~13.5 km long) span along the groundtrack, collapsing the relatively thin data swath into a one-dimensional data record, upon which the one-dimensional averaging models describe in Sections 3.1 and 3.2 of the paper may be applied. Program "Make_OCO2_10sec_averages.f90" implements these averaging models, which average the 2-second averages across longer, 10-second (~67.5 km) spans. Please see the manuscript for more information on the data and methods provided here.

  10. Two-time correlation function based on speckle patterns from x-ray photon...

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    txt
    Updated Aug 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Birte Riechers; Birte Riechers; Robert Maaß; Robert Maaß (2024). Two-time correlation function based on speckle patterns from x-ray photon correlation spectroscopy associated with "Intermittent cluster dynamics and temporal fractional diffusion in a bulk metallic glass" (scientific article published in Nature Communications, 2024) [Dataset]. http://doi.org/10.5281/zenodo.12684513
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Birte Riechers; Birte Riechers; Robert Maaß; Robert Maaß
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset consists of contrast data, i.e., the two-time correlation function, based on speckle patterns measured at the at the 8ID-E beamline of the Advanced Photon Source at Argonne National Laboratory.

    Experimental details are stated in the paper specified under "related work" and in the accompanying supplementary information.

    You are welcome to use this dataset in compliance with the CC BY 4.0 licence assigned to this dataset.

    Any questions regarding the data can be addressed to birte.riechers@bam.de who would also appreciate a note if you find the data useful.

    _

    The data consists of 32 text files in total, which correspond to the main and lower panel Figure 2 of the main publication.

    30 of these text files are contrast data, which are named "contrast_DT250s_nn.text" wiith "nn" as the identifier of consecutive data sets going from 1 to 30. Each data set consists of p rows and q columns, DT250s denotes the time resolution of data points, which is 250 s along both row and column values.

    The data set called "Time_Contrast_1to30s.txt" states the start time in seconds of the first data point of each of the thirty contrast data set.

    The data set called "ScatteredIntensity.txt" states the scattered intensity at full time resolution, i.e. 2.5 s.

    The files are plain text files with the data points separated by "space" along rows and "new line" along columns.

  11. Behavioral Risk Factors: HRQOL

    • kaggle.com
    zip
    Updated Jan 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Behavioral Risk Factors: HRQOL [Dataset]. https://www.kaggle.com/datasets/thedevastator/behavioral-risk-factors-hrqol/suggestions
    Explore at:
    zip(2247473 bytes)Available download formats
    Dataset updated
    Jan 21, 2023
    Authors
    The Devastator
    Description

    Behavioral Risk Factors: HRQOL

    Analyzing Health-Related Quality of Life in the United States

    By Health [source]

    About this dataset

    The Behavioral Risk Factor Surveillance System (BRFSS) is an annual state-based, telephone survey of adults in the United States. It collects a variety of health-related data, including Health Related Quality of Life (HRQOL). This dataset contains results from the HRQOL survey within a range of locations across the US for the year indicated.

    This dataset includes 14 columns which summarize and quantify different aspects concerning HRQOL topics. The year, location abbreviation, description and geo-location provide background contextual information which help define each row. The question column indicates the response provided to by respondents, while category classifies it into overarching groupings. Additionally there are columns covering sample size and data value attributes such as standard error, unit and type all evidence chipping away at informative insights into how Americans’ quality of life is changing over time — all cleverly presented in this one concise dataset!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to analyze this dataset, it is important have a good understanding of the columns included in it. The columns provide various pieces of information about the data such as year collected, location abbreviation, location name and type of data value collected. Furthermore, understanding what each column means is essential for proper interpretation and analysis; for example knowing that ‘Data_Value %’ indicates what percentage responded a certain way or that ‘Sample_Size’ shows how many people were surveyed can help you make better decisions when looking at patterns within the data set.

    Once you understand the general structure behind this dataset one should also familiarize themselves with some basic statistical analysis tools such as mean/median/mode calculations comparative/correlative analysis so they can really gain insights into how health-related quality of life affects different populations across countries or regions.. To get even more meaningful results you might also want to consider adding other variables or datasets into your report that correlate with HRQOL - like poverty rate or average income level - so you can make clearer conclusions about potential contributing factors towards certain insights you uncover while using this dataset alone.

    Research Ideas

    • Identifying trends between geolocation and health-related quality of life indicators to better understand how environmental factors may impact specific communities.
    • Visualizing the correlations between health-related quality of life variables across different locations over time to gain insights on potential driving developmental or environmental issues.
    • Monitoring the effects of public health initiatives dealing with qualitative health data such as those conducted by CDC, Department of Health and Human Services, and other organizations by tracking changes in different aspects of HRQOL measures over time across multiple locations

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    See the dataset description for more information.

    Columns

    File: rows.csv | Column name | Description | |:-------------------------------|:-------------------------------------------------------------------------------------------------------------| | Year | Year when the data was collected. (Integer) | | LocationAbbr | Abbreviations of various locations where data was recorded. (String) | | LocationDesc | Full names of states whose records are included in this survey. (String) | | Category | Particular topic chosen for research such as “Healthy People 2010 Topics” or “Older Adults Issues”. (String) | | Question | Each question corresponds to metrics tracked within each topic. (String) | | DataSource | Source from which survey responses were collected. (String) | | Data_Value_Unit | Units taken for recording survey types...

  12. Student Performance in Secondary Education

    • kaggle.com
    zip
    Updated Sep 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adil Shamim (2025). Student Performance in Secondary Education [Dataset]. https://www.kaggle.com/datasets/adilshamim8/personalized-learning-and-adaptive-education-dataset
    Explore at:
    zip(12345 bytes)Available download formats
    Dataset updated
    Sep 4, 2025
    Authors
    Adil Shamim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains information on secondary school student performance collected from two Portuguese schools. It was originally introduced by **Cortez & Silva ** in the paper “Using Data Mining to Predict Secondary School Student Performance.”

    The data was gathered through school reports and student questionnaires, covering demographic, social, and academic-related variables. Two separate datasets are provided:

    • student-mat.csv → Math course performance
    • student-por.csv → Portuguese language course performance

    Number of instances: 649 (Mathematics) + 649 (Portuguese) Number of features: 30 input variables + 3 grade outputs (G1, G2, G3) Target variable: G3 (final grade, 0–20 scale) Missing values: None

    Objective

    The main goal is to predict student academic success, especially the final grade G3.

    • Since G1 (first period grade) and G2 (second period grade) are highly correlated with G3, experiments can be designed with or without these features:

      • Easier task → Predicting G3 using G1 and G2
      • Harder task (more useful) → Predicting G3 without G1 and G2

    This dataset is suitable for:

    • Regression (predicting numeric grades)
    • Classification (e.g., pass/fail, grade levels)

    Features

    The dataset includes 30 attributes from multiple categories:

    • Demographicssex, age, address, famsize, Pstatus
    • Parental backgroundMedu, Fedu, Mjob, Fjob, guardian
    • School-relatedschool, reason, traveltime, studytime, failures
    • Support systemsschoolsup, famsup, paid, activities, nursery, higher, internet
    • Lifestyle & socialromantic, freetime, goout, Dalc, Walc, health, absences
    • Performance indicatorsG1, G2, G3

    Key Insights

    • Strong G1/G2 ↔ G3 correlation: Final grade is heavily dependent on earlier grades.
    • Student overlap: 382 students appear in both datasets (Math & Portuguese), identifiable by matching attributes.
    • No missing data: Dataset is clean and ready for modeling.

    Use Cases

    • Predicting final grades for early intervention.
    • Identifying at-risk students who may need extra support.
    • Exploring socio-economic and lifestyle factors influencing education.
    • Testing feature engineering and model comparison strategies.

    Reference

    • Cortez, P., & Silva, A. Using Data Mining to Predict Secondary School Student Performance. Proceedings of the 5th Annual Future Business Technology Conference.

    This dataset is a playground for classification & regression tasks, ideal for experimenting with feature selection, ensemble methods, and interpretable ML approaches.

  13. Student Score - Hypothesis Testing (T Test)

    • kaggle.com
    zip
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Student Score - Hypothesis Testing (T Test) [Dataset]. https://www.kaggle.com/datasets/vikramamin/student-score-hypothesis-testing-t-test
    Explore at:
    zip(7328 bytes)Available download formats
    Dataset updated
    Sep 21, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbf9b8f2d8afc8aad16aadf167ee53777%2FPicture1.png?generation=1695275487466508&alt=media" alt="">

    • Data Cleaning https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5ac517a06cd3aff12b58297504902583%2FPicture2.png?generation=1695276101423952&alt=media" alt="">

    • Convert data types of the required variables https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F61c8e665b906ba21d0579d06ab85b028%2FPicture3.png?generation=1695276209705142&alt=media" alt="">

    • Run libraries dplyr, ggplot2, tidyverse, tidyr

    • Find out the count of male vs female students https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb33f152cfb579742aca479923f271b6d%2FPicture4.png?generation=1695276542256981&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F210dd1f7bf238efff7227f5465c77806%2FCount%20of%20Students.jpeg?generation=1695276553831777&alt=media" alt="">

    • We keep only two columns namely 'Sex' and 'G3' and remove the other columns https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F148205c33fd7cafc1a0ac05ac205c2b1%2FPicture5.png?generation=1695276691132338&alt=media" alt="">

    • t=-2.0651 indicates the distance from 0

    • df = 390.57 is related to the sample size, how many free data points are available for making comparisons

    • p value = 0.03958 is the probability value and indicates that we can reject the null hypothesis as it is less than that of alpha (0.05). Hence it is statisticall y significant.

    • 95% confidence interval suggests that the true difference in means will lie between -1.85 and -0.04 (95% of time)

    • We can see the difference in means between the two groups (10.91-9.96) = 0.95

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F02b61a708cb074be592362c39ad33779%2FPicture6.png?generation=1695277010381962&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F769ed45d3ef398e14589e461e3d3fedd%2FHistogram.jpeg?generation=1695277023581085&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F21025562578ad7901abd35319a09579d%2FPicture7.png?generation=1695277093476017&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0ead3643b8e945b83a257cdb30871143%2FDensity%20plot.jpeg?generation=1695277110253483&alt=media" alt="">

    • Both the histogram and the density plot indicate that there are students who got 0. Could this be due to non attendance of exams. Let us find out the number of students who got 0.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F407028608e7d32361198d15ef854ace2%2FPicture8.png?generation=1695277271891422&alt=media" alt="">

    -38 students in total out of 395 have got a score of 0. That is 9.62% students. - Let us check the mean for both groups by removing students who got zeros. - We have created a new data frame called student 2 which includes a total of 357 students with no zero marks

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F06194aa5ed468045ef0d8cdeb82945d5%2FPicture9.png?generation=1695277409031566&alt=media" alt="">

    • Conclusion:
    • mean of females is 11.20 and 11.86 of males. The difference in mean of the two groups is 0.66 as compared to the earlier mean difference of 0.95.
    • P value is shown as 0.05335. For us to reject the null hypothesis the p value should be less than 0.05.
    • Therefore it is difficult to say if it is statistically significant.
  14. p

    Music & Affect 2020 Dataset Study 2.csv

    • psycharchives.org
    Updated Sep 17, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Music & Affect 2020 Dataset Study 2.csv [Dataset]. https://www.psycharchives.org/handle/20.500.12034/3089
    Explore at:
    Dataset updated
    Sep 17, 2020
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset for: Leipold, B. & Loepthien, T. (2021). Attentive and emotional listening to music: The role of positive and negative affect. Jahrbuch Musikpsychologie, 30. https://doi.org/10.5964/jbdgm.78 In a cross-sectional study associations of global affect with two ways of listening to music – attentive–analytical listening (AL) and emotional listening (EL) were examined. More specifically, the degrees to which AL and EL are differentially correlated with positive and negative affect were examined. In Study 1, a sample of 1,291 individuals responded to questionnaires on listening to music, positive affect (PA), and negative affect (NA). We used the PANAS that measures PA and NA as high arousal dimensions. AL was positively correlated with PA, EL with NA. Moderation analyses showed stronger associations between PA and AL when NA was low. Study 2 (499 participants) differentiated between three facets of affect and focused, in addition to PA and NA, on the role of relaxation. Similar to the findings of Study 1, AL was correlated with PA, EL with NA and PA. Moderation analyses indicated that the degree to which PA is associated with an individual´s tendency to listen to music attentively depends on their degree of relaxation. In addition, the correlation between pleasant activation and EL was stronger for individuals who were more relaxed; for individuals who were less relaxed the correlation between unpleasant activation and EL was stronger. In sum, the results demonstrate not only simple bivariate correlations, but also that the expected associations vary, depending on the different affective states. We argue that the results reflect a dual function of listening to music, which includes emotional regulation and information processing.: Dataset Study 2

  15. B

    Data from: Temporal changes in taxon abundances are positively correlated...

    • borealisdata.ca
    • datasetcatalog.nlm.nih.gov
    Updated Nov 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gavia Lertzman-Lepofsky; Aleksandra Dolezal; Mia Waters; Alexandre Fuster-Calvo; Emily Black; Stephanie Flaman; Samantha Straus; Ryan Langendorf; Isaac Eckert; Sophia Fan; Haley Branch; Nathalie Chardon; Courtney G. G. Collins (2024). Temporal changes in taxon abundances are positively correlated but poorly predicted at the global scale [Dataset]. http://doi.org/10.5683/SP3/FV7PTK
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 14, 2024
    Dataset provided by
    Borealis
    Authors
    Gavia Lertzman-Lepofsky; Aleksandra Dolezal; Mia Waters; Alexandre Fuster-Calvo; Emily Black; Stephanie Flaman; Samantha Straus; Ryan Langendorf; Isaac Eckert; Sophia Fan; Haley Branch; Nathalie Chardon; Courtney G. G. Collins
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Swiss National Science Foundation
    Canadian institute of ecology and evolution
    Description

    AbstractLinking changes in taxon abundance to biotic and abiotic drivers over space and time is critical for understanding biodiversity responses to global change. Furthermore, deciphering temporal trends in relationships among taxa, including correlated abundance changes (e.g. synchrony), can facilitate predictions of future shifts. However, what drive these correlated changes over large scales are complex and understudied, impeding our ability to predict shifts in ecological communities. We use two global datasets containing abundance time-series (BioTIME) and biotic interactions (GloBI) to quantify correlations among yearly changes in the abundance of pairs of geographically proximal taxa (genus pairs). We use a hierarchical linear model and cross-validation to test the overall magnitude, direction, and predictive accuracy of correlated abundance changes among genera at the global scale. We then test how correlated abundance changes are influenced by latitude, biotic interactions, disturbance, and time-series length while accounting for differences among studies and taxonomic categories. We find that abundance changes between genus pairs are, on average, positively correlated over time, suggesting synchrony at the global scale. Furthermore, we find that abundance changes are more positively correlated with longer time-series, with known biotic interactions, and in disturbed habitats. However, the magnitude of these ecological drivers alone are relatively weak, with model predictive accuracy increasing approximately two-fold with the inclusion of study identity and taxonomic category. This suggests that while patterns in abundance correlations are shaped by ecological drivers at the global scale, these drivers have limited utility in forecasting changes in abundances among unknown taxa or in the context of future global change. Our study indicates that including taxonomy and known ecological drivers can improve predictions of biodiversity loss over large spatial and temporal scales, but also that idiosyncrasies of different studies continue to weaken our ability to make global predictions. MethodsThis dataset was collected by downloading and curating existing data from BioTIME and GlobI. The BioTime data was filtered by (see Figure 2 of the manuscript) excluding biomass, marine, aquatic surveys, aggregating abundance to genus level per plot. We subset data to include only time series that contain 10+ consecutive overlapping years. For each genus, we calculated the log proportional change in abundance for each time step to remove temporal autocorrelation. We used 'Globi to identify if there are known interactions between each genus pair.

  16. S

    Collection of bibliography included in the study; Co-occurrence matrix set...

    • scidb.cn
    Updated Mar 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wu Shengnan (2023). Collection of bibliography included in the study; Co-occurrence matrix set of subject words included in the study; Opportunity code; Trust code; Triple, Code of open triangle and closed triangle; Code run and software calculation result set [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00224
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Wu Shengnan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This study selected the relevant literature related to the adverse drug reactions of metformin from 1991 to 2020 as the data source, divided the time segment with a period of 3 years, and obtained the title information (see the title collection of the literature included in the study. zip), and then extracted the subject words through the bicomb2021 software to construct the co-occurrence matrix, and a total of 10 co-occurrence matrices were obtained (see the subject word co-occurrence matrix collection included in the study. zip). Import the 10 co-occurrence matrices into the self-designed python code and r code (see opportunity code. zip; trust code. zip; open triangle and closed triangle code. zip) to obtain the opportunity, trust value and the number of edge triangles of each node pair in the 10 networks. Use GePhi0.9.7 software to calculate the motivation value of the node pair, use Excel to calculate the global clustering coefficient of each network, and the edge clustering coefficient of each node pair, The number of edge triangles of each node pair is built by using excel software to construct the scatter diagram of node pair opportunity, trust, motivation value and node pair edge clustering coefficient, and the correlation between node pair opportunity value and edge clustering coefficient is calculated by using spss software, as well as the correlation between node pair trust, motivation value and edge clustering coefficient, and the number of closed triangles of node pair (see code operation and software calculation result set. zip).Select the literature bibliography data from 2000 to 2009 to build the panel data (see the literature bibliography collection included in the study. zip), and also use the self-designed python code and r code (see opportunity code. zip; trust code. zip; open triangle and closed triangle code. zip) to get the opportunity, trust value and the number of edge triangles of each node pair in 10 networks, and use GePhi0.9.7 software to calculate the motivation value of node pairs Proximity centrality, intermediary centrality, feature vector centrality and average path length of node pairs are imported into Stata/MP 17.0 software to obtain the correlation between node attributes and network characteristics (see code operation and software calculation result set. zip).The data contained in each data name is described in detail:1. Collection of bibliographies included in the studyThe data collection contains two folders, named the literature collection from 1991 to 2020 and the literature collection from 2000 to 2009. The literature collection from 1991 to 2020 stores the bibliographic data of 10 time periods from 1991 to 2020, and the literature collection from 2000 to 2009 stores the bibliographic data of 10 overlapping windows from 2000 to 2009.2. Co-occurrence matrix set of subject words included in the studyThe data set contains two folders, named the 1991-2020 subject word co-occurrence matrix set and the 2000-2009 subject word co-occurrence matrix set. The subject word co-occurrence matrix of 1991-2020 contains the subject word co-occurrence matrix of 10 time segments from 1991-2020. The first row and first column of each co-occurrence matrix are subject words, and the number represents the number of co-occurrence times of the subject word pair. The subject word co-occurrence matrix set in 2000-2009 stores the subject word co-occurrence matrix of 10 time windows in 2000-2009.3. Opportunity Code.zipThis code is used to calculate the opportunity value of node pair. The input data is co-occurrence matrix, and the input format is. csv format.4. Trust Code.zipThis code is used to calculate the opportunity value of node pair. The input data is co-occurrence matrix, and the input format is. csv format.5. Code of open triangle and closed triangle.zipThis code is used to calculate the number of closed triangles and open triangles on the side of the node pair. The input data is the co-occurrence matrix, and the input format is. csv format.6. Code run and software calculation result set.zipThe data set contains two folders, named 1991-2020 calculation results and 2000-2009 calculation results. The 1991-2020 calculation results store the calculation results and scatter diagrams of 10 time segments in 1991-2020. Take 1991-1993 as an example, the first row of each table is marked with the opportunity, comprehensive trust, motivation, edge clustering coefficient, and the number of closed triangles. At the end of each table, the mean value of opportunity, trust, motivation and Pearson correlation coefficient with edge clustering coefficient and the number of closed triangles are calculated.The 2000-2009 folder stores the panel data and the opportunity, trust, motivation of the stata software calculation, and the correlation between the node attributes and the network characteristics of the node.

  17. B

    Coexpression Analysis of Human Genes Across Many Microarray Data Sets

    • borealisdata.ca
    • search.dataone.org
    Updated Mar 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Homin K Lee; Amy K Hsu; Jon Sajdak; Jie Qin; Paul Pavlidis (2019). Coexpression Analysis of Human Genes Across Many Microarray Data Sets [Dataset]. http://doi.org/10.5683/SP2/JOJYOP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2019
    Dataset provided by
    Borealis
    Authors
    Homin K Lee; Amy K Hsu; Jon Sajdak; Jie Qin; Paul Pavlidis
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    ABCF
    Description

    We present a large-scale analysis of mRNA coexpression based on 60 large human data sets containing a total of 3924 microarrays. We sought pairs of genes that were reliably coexpressed (based on the correlation of their expression profiles) in multiple data sets, establishing a high-confidence network of 8805 genes connected by 220,649 “coexpression links” that are observed in at least three data sets. Confirmed positive correlations between genes were much more common than confirmed negative correlations. We show that confirmation of coexpression in multiple data sets is correlated with functional relatedness, and show how cluster analysis of the network can reveal functionally coherent groups of genes. Our findings demonstrate how the large body of accumulated microarray data can be exploited to increase the reliability of inferences about gene function.

  18. Data from: Data on Dispute Related Violence in a Northeastern City, United...

    • catalog.data.gov
    • icpsr.umich.edu
    Updated Nov 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Justice (2025). Data on Dispute Related Violence in a Northeastern City, United States, 2010 to 2012 [Dataset]. https://catalog.data.gov/dataset/data-on-dispute-related-violence-in-a-northeastern-city-united-states-2010-to-2012-451c7
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    National Institute of Justicehttp://nij.ojp.gov/
    Area covered
    United States
    Description

    These data are part of NACJD's Fast Track Release and are distributed as they were received from the data depositor. The files have been zipped by NACJD for release, but not checked or processed except for the removal of direct identifiers. Users should refer to the accompanying readme file for a brief description of the files available with this collection and consult the investigator(s) if further information is needed.The objective of this project was to enhance understanding of violent disputes by examining the use of aggression to rectify a perceived wrong. It also sought to identify the factors that determine if retaliatory violence occurs within disputes as well as to understand how long retaliatory disputes last, and what factors lead to the termination of such disputes. This collection includes two SPSS data files: "Dispute_Database_for_NACJD.sav" with 40 variables and 111 cases and "Northeastern_City_Violence_Database_NACJD_submission.sav" with 164 variables and 1,303 cases.

  19. o

    Subthalamic nucleus correlates of force adaptation

    • data.mrc.ox.ac.uk
    • ora.ox.ac.uk
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damian M Herz; Sergiu Groppa; Peter Brown (2023). Subthalamic nucleus correlates of force adaptation [Dataset]. http://doi.org/10.5287/ora-9ovjdypbb
    Explore at:
    Dataset updated
    2023
    Authors
    Damian M Herz; Sergiu Groppa; Peter Brown
    Time period covered
    2023
    Dataset funded by
    Independent Research Fund Denmark
    Medical Research Council, UKRI
    Description

    This code analyses behavioural data from a group of 16 Parkinson patients and 15 healthy control participants performing an action adaptation tasks, in which participants need to continuously adapt the applied force based on the feedback they receive. The first feedback ranges from 0 (worst) to 10 (best) points depending on the error between actual force and target force (Value-cue) and the second feedback indicates whether the force had been too low or too high (Direction-feedback). The main behavioural outcomes are measures of force production and force adaptation (folder 1, used for figure 1 in the published article). In patients local field potentials were recorded during the task and corresponding code is stored in folder 2 (figure 2&3). In 14 patients burst deep brain stimulation was applied during a second session. Its effects on behaviour and local field potentials are analysed with code from folder 3 and 4 (figures 4&5). The results have been published in a paper entitled ‘Neural underpinnings of action adaptation in the subthalamic nucleus’ by Herz et al.

    The code has been tested on a MacBook Pro, macOS Mojave 10.14.6. All data were analysed in Matlab (2019a, requires a software license) and FieldTrip. Installation guides can be found on: https://matlab.mathworks.com/ and https://www.fieldtriptoolbox.org/download/. Run times of the different scripts is usually short (from < 1 minutes to ~ 5 minutes) for most analyses except for cluster-based permutation tests of linear mixed effects models, which take a few hours.

    Example data is provided for the behavioural analysis for 2 healthy control (HC) participants. Of note, this is not the actual data from HC01 & HC02 from the study, but it allows testing the behavioural scripts if applicable (see below for instructions).

    (i) Behavioral data:

    Scripts: ‘CompareLevodopaDemographicsMVC’: this script compares demographics and the maximum voluntary contraction (MVC) between patients and HC, and tests the effect of levodopa on the Unified Parkinson’s Disease Rating Scale (UPDRS). ‘GetEvents’ (_PD & _HC): Imports the events file from PsychoPy using ‘ExtractData’ (& _HC) and saves it as a mat-file. ‘GetForce’ (_PD & _HC): Computes several variables reflecting force production and adaptation using several functions: ‘Forceparameters’ computes measures of force production and the time of peak force, and allows illustrating single trials, ‘Forces_within’ computes mean and standard error of mean (SEM) of single subject force traces, ’Stat_within’ computes several single subject correlations and measures of force adaptation, ‘Forces_across’ computes mean and SEM of group force traces, ‘stat_across’ computes group-mean trajectories of actual force and target force. The results are saved as a mat-file for subject-averaged data and a csv-file for single subject data. ‘Plot_Stats’ computes statistics of these measures of force production and adaptation and plots the results.

    (ii) Local field potential data:

    Scripts: ‘GetLFP_FirstLevel.m’: Loads data and applies preprocessing, time-frequency analysis and re-aligning of data using FieldTrip. It uses the custom-written functions. ‘MakeMontage_AllBipolar’ (which creates a bipolar montage from the monopolar data) and ‘EpochData_TF’ (which epochs the continuous data aligned to the feedback cue and peak force). The epoched spectra are saved. ‘GetLFP_SecondLevel_PlotSpectra.m’: Loads the spectra from first level analysis and plots the grand average as well as group-averaged beta, alpha (for feedback-aligned data) and gamma traces (for movement-aligned data).
    ‘GetLFP_SecondLevel_LME.m’: Loads the spectra from first level analysis and computes LME analyses with variables of interest using moving windows of single trial beta power.For cluster-based permutation tests (which take several hours) the function ‘PermTests_LME’ is used. ‘GetLFP_SecondLevel_controlLME.m’: Loads the spectra from first level analysis and computes control LME analyses: Effect of Value and Direction on Alpha power in the feedback period (where it showed an increase), effect of change in force and absolute change in force on Gamma power before peak force (where it showed an increase) and on Beta power after the Value feedback (where it showed a correlation with Value).

    (iii) DBS effects on behaviour:

    Scripts: ‘GetEvents_Stim’: Imports the events file from PsychoPy using ‘ExtractData’ and saves it as a mat-file. ‘GetForce_Stim’: Computes several variables reflecting force production and adaptation using several functions: ‘Forceparameters’ computes measures of force production and the time of peak force, and allows illustrating single trials, ‘Forces_within’ computes mean and standard error of mean (SEM) of single subject force traces, ‘Forces_across’ computes mean and SEM of group force traces. The results are saved as a mat-file for subject-averaged data and a csv-file for single subject data. ‘GetToS’ loads a file with the stimulation trace during the task, calls the function ‘ToS_DownsampleBinaryRemoveRamp’ (which downsamples the data to 1000Hz, makes stimulation binary (1 for ON, 0 for OFF) and removes the ramping so that only stimulation at effective intensities counts as stimulation) and loads the relevant behavioural data (change in force and absolute change in force). It then calls the functions ‘ToS_WindowedStim’ (which computes for each trial whether or not stimulation was given in any 100 ms moving windows for cue- and movement aligned data) and ‘ToS_Windowed_nexttrial’ (which computes change in force and absolute change in force for windows in which stimulation was applied vs. was not applied).The results are saved in a mat-file. ‘Plot_ToS’ loads this data, plots effects of stimulation on absolute change in force and change in force and provides statistics using cluster-based permutation tests (‘PermTests_ToS’). It also saves single trial behavioural data with a column stating whether DBS was applied in the critical time windows (which is used for the DBS effects on local field potentials analysis).

    (iv) DBS effects on local field potentials:

    Scripts: ‘GetLFP_FirstLevel_Stim.m’: Loads data and applies preprocessing, time-frequency analysis and re-aligning of data using FieldTrip analogously to the script described under (ii) except that it also detrends and demeans the data, applies a low-pass filter at 100 Hz and excludes noisy data points, which are then interpolated. The epoched spectra aligned to feedback and movement are saved.

    ‘GetLFP_FirstLevel_Stim_TrigOnset.m’: Same as above, but aligned to onset of stimulation bursts.

    ‘GetLFP_SecondLevel_Stim.m’: Loads data from the previous analysis, loads single trial data with info whether DBS was applied at critical time windows and plots these beta traces together with beta power off stimulation for time windows of interest. Cluster-based permutation tests are applied using the function ‘PermTests_ToS’. Grand-average spectra and beta power irrespective of stimulation-timing are also plotted.

    ‘GetLFP_SecondLevel_TrigOnset.m’: Loads data from the previous _TrigOnset analysis and plots the group average.

    (v) Downloaded scripts:

    The following scripts were downloaded from mathworks.com: ‘computeCohen_d’ (measure of effect size), ‘jblill’ (filling significant clusters from permutation tests), ‘shadedErrorBar’ (illustrating mean and SEM).

    (vi) Testing example data:

    Two example datasets are provided (termed Kont01 & Kont02), which allow testing the behavioural force analysis. To do this the script GetForce_HC.m should be opened. DirName and EventPath should be adjusted for the actual path. Line 24: Should be changed to ‘for Subj=1:2’ Lines 103-108 should be commented, i.e. not used. Setting plotforce to 1 (line 12) plots the single subject and group average force spectra. Setting check to 1 (line 11) plots single trial force data. For this only use subject 1 or 2, not both (i.e. in line 24 use ‘for Subj=1’ or ‘for Subj=2’).

  20. Data from: Is my model fit for purpose? Validating a population model for...

    • zenodo.org
    • datasetcatalog.nlm.nih.gov
    • +5more
    bin, csv
    Updated Jul 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Hale; Robin Hale; Jian Yen; Charles Todd; Ivor Stuart; Henry Wootton; Jason Thiem; John Koehn; Zeb Tonkin; Jarod Lyon; Michael McCarthy; Tomas Bird; Ben Fanson; Jian Yen; Charles Todd; Ivor Stuart; Henry Wootton; Jason Thiem; John Koehn; Zeb Tonkin; Jarod Lyon; Michael McCarthy; Tomas Bird; Ben Fanson (2023). Is my model fit for purpose? Validating a population model for predicting freshwater fish responses to flow management [Dataset]. http://doi.org/10.5061/dryad.5qfttdzbz
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Jul 31, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robin Hale; Robin Hale; Jian Yen; Charles Todd; Ivor Stuart; Henry Wootton; Jason Thiem; John Koehn; Zeb Tonkin; Jarod Lyon; Michael McCarthy; Tomas Bird; Ben Fanson; Jian Yen; Charles Todd; Ivor Stuart; Henry Wootton; Jason Thiem; John Koehn; Zeb Tonkin; Jarod Lyon; Michael McCarthy; Tomas Bird; Ben Fanson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Models based on ecological processes ("process-explicit models") are often used to predict ecosystem responses to environmental changes or management scenarios. However, models are imperfect and need to be validated, ideally by testing their assumptions and outputs against independent empirical data sets. Examples of validation of process-explicit models are rare. Recently, stochastic population models have been developed to predict the likely responses (over 10-120 years) of a riverine fish (golden perch, Macquaria ambigua) to flow management in the Murray-Darling Basin (MDB) in eastern Australia, one of the world's most regulated river basins. Declines of golden perch (and other species) are a direct consequence of altered hydrology, and managers require information to predict how fish will respond to possible future hydrological conditions to guide the substantial investments in flow management. Here, we use two independent field data sets to validate our population model. We compared model predictions to observed trends to ask: (1) how do predicted population sizes and growth rates compare to observed data? (2) does the correlation between predicted and observed population sizes and growth rates vary among populations? (3) does the correlation between predicted and observed population sizes and growth rates vary across observed hydrological conditions? and (4) how do modelled and observed fish movement rates compare? We found reasonable correlations between fish population sizes and growth rates as predicted by the model and observed in independent data sets for several populations (Aim 1) but the strength of these correlations varied among populations (Aim 2) and hydrological conditions (Aim 3). Predicted and observed fish movement rates were strongly correlated (Aim 4). Population models are frequently used in conservation decision-making but are rarely validated. We demonstrate that: (1) validation can identify model strengths and weaknesses; (2) observed data sets often have inherent limitations that can preclude robust validations; (3) validation is likely be more common if appropriate observed data sets are available; and (4) validation should consider the purpose of modelling. Wider consideration of these messages would contribute to more critical examinations of models so they can be most appropriately used in conservation decision-making.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1

Data from: WiBB: An integrated method for quantifying the relative importance of predictive variables

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Aug 20, 2021
Dataset provided by
Field Museum of Natural History
Beijing Normal University
Authors
Qin Li; Xiaojun Kou
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

Search
Clear search
Close search
Google apps
Main menu