100+ datasets found
  1. f

    Median values, interquartile range (IQR) and Number of outliers.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 16, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Whaley, Dana H.; Denis, Max; Alizad, Azra; Pruthi, Sandhya; Mehrmohammadi, Mohammad; Chen, Shigao; Song, Pengfei; Meixner, Duane D.; Fatemi, Mostafa; Fazzio, Robert T. (2015). Median values, interquartile range (IQR) and Number of outliers. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001856309
    Explore at:
    Dataset updated
    Mar 16, 2015
    Authors
    Whaley, Dana H.; Denis, Max; Alizad, Azra; Pruthi, Sandhya; Mehrmohammadi, Mohammad; Chen, Shigao; Song, Pengfei; Meixner, Duane D.; Fatemi, Mostafa; Fazzio, Robert T.
    Description

    Median values, interquartile range (IQR) and Number of outliers.

  2. Median, interquartile range (IQR) and significance level of the difference...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthias Gilgien; Philip Crivelli; Jörg Spörri; Josef Kröll; Erich Müller (2023). Median, interquartile range (IQR) and significance level of the difference between discipline medians and distributions for all parameters, and percentage of DH for GS and SG. [Dataset]. http://doi.org/10.1371/journal.pone.0118119.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Matthias Gilgien; Philip Crivelli; Jörg Spörri; Josef Kröll; Erich Müller
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DH represents 100% for the relative measure. Differences between medians and distributions were significant between all disciplines if indicated with * and were significantly different between GS and SG when marked with 1, significantly different between GS and DH if marked with 2 and significantly different between SG and DH if marked with 3. If no parameter was significantly different the column is empty. Columns marked with—indicate that the measure was not calculated.Median, interquartile range (IQR) and significance level of the difference between discipline medians and distributions for all parameters, and percentage of DH for GS and SG.

  3. Descriptive statistics, mean ± SD, range, median and interquartile range...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat (2023). Descriptive statistics, mean ± SD, range, median and interquartile range (IQR). [Dataset]. http://doi.org/10.1371/journal.pone.0055232.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Hélène Follet; Delphine Farlay; Yohann Bala; Stéphanie Viguet-Carrin; Evelyne Gineyts; Brigitte Burt-Pichat; Julien Wegrzyn; Pierre Delmas; Georges Boivin; Roland Chapurlat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Descriptive statistics, mean ± SD, range, median and interquartile range (IQR).

  4. Characteristics of women, overall and according to BMI categories; data...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie A. Pasco; Geoffrey C. Nicholson; Sharon L. Brennan; Mark A. Kotowicz (2023). Characteristics of women, overall and according to BMI categories; data presented as mean (±SD), median (IQR) or frequency (%). [Dataset]. http://doi.org/10.1371/journal.pone.0029580.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Julie A. Pasco; Geoffrey C. Nicholson; Sharon L. Brennan; Mark A. Kotowicz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    *n = 1041 (35 missing data).BMI = body mass index (kg/m2); SD = standard deviation; IQR = interquartile range; EI energy intake (MJ/d); BMR = basal metabolic rate (MJ/d).

  5. Meta data and supporting documentation

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Meta data and supporting documentation [Dataset]. https://catalog.data.gov/dataset/meta-data-and-supporting-documentation
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  6. Simulation Data Set

    • catalog.data.gov
    • s.cnmilf.com
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Simulation Data Set [Dataset]. https://catalog.data.gov/dataset/simulation-data-set
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).

  7. f

    The median, interquartile range (IQR) and range of the minimum (Factors I,...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Aug 21, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scorgie, Fiona E.; Gnanathasan, Christeine A.; Lincz, Lisa F.; Shahmy, Seyed; Isbister, Geoffrey K.; Maduwage, Kalana; Karunathilake, Harendra; Mohamed, Fahim; O’Leary, Margaret A.; Abeysinghe, Chandana (2015). The median, interquartile range (IQR) and range of the minimum (Factors I, II, V, VII, VIII, IX, X) or maximum (PT/INR, aPTT, D-Dimer) factor concentrations/clotting times measured for the 146 patients during their hospital admission. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001888501
    Explore at:
    Dataset updated
    Aug 21, 2015
    Authors
    Scorgie, Fiona E.; Gnanathasan, Christeine A.; Lincz, Lisa F.; Shahmy, Seyed; Isbister, Geoffrey K.; Maduwage, Kalana; Karunathilake, Harendra; Mohamed, Fahim; O’Leary, Margaret A.; Abeysinghe, Chandana
    Description

    The median, interquartile range (IQR) and range of the minimum (Factors I, II, V, VII, VIII, IX, X) or maximum (PT/INR, aPTT, D-Dimer) factor concentrations/clotting times measured for the 146 patients during their hospital admission.

  8. Median (interquartile range) of percentage of adult respondents with need...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    • +1more
    xls
    Updated Jun 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anita K. Wagner; Amy J. Graves; Zhengyu Fan; Saul Walker; Fang Zhang; Dennis Ross-Degnan (2023). Median (interquartile range) of percentage of adult respondents with need for and access to care in 53 countries. [Dataset]. http://doi.org/10.1371/journal.pone.0057228.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Anita K. Wagner; Amy J. Graves; Zhengyu Fan; Saul Walker; Fang Zhang; Dennis Ross-Degnan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Median (interquartile range) of percentage of adult respondents with need for and access to care in 53 countries.

  9. Numpy , pandas and matplot lib practice

    • kaggle.com
    zip
    Updated Jul 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pratham saraf (2023). Numpy , pandas and matplot lib practice [Dataset]. https://www.kaggle.com/datasets/prathamsaraf1389/numpy-pandas-and-matplot-lib-practise/suggestions
    Explore at:
    zip(385020 bytes)Available download formats
    Dataset updated
    Jul 16, 2023
    Authors
    pratham saraf
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.

    Specifics of the Dataset:

    The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.

    One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:

    Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule

    Context of the Dataset:

    The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:

    The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.

  10. f

    Proportion of positive results, interquartile range (IQR), minimum-maximum...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    • +1more
    Updated Apr 17, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brienen, Eric A. T.; Kahama, Anthony I.; Melchers, Natalie V. S. Vinkeles; van Dam, Govert J.; Shaproski, David; Vennervald, Birgitte J.; van Lieshout, Lisette (2014). Proportion of positive results, interquartile range (IQR), minimum-maximum range, and median per diagnostic test at three different time points (baseline) of 24 S. haematobium-positive subjects. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001188275
    Explore at:
    Dataset updated
    Apr 17, 2014
    Authors
    Brienen, Eric A. T.; Kahama, Anthony I.; Melchers, Natalie V. S. Vinkeles; van Dam, Govert J.; Shaproski, David; Vennervald, Birgitte J.; van Lieshout, Lisette
    Description

    Proportion of positive results, interquartile range (IQR), minimum-maximum range, and median per diagnostic test at three different time points (baseline) of 24 S. haematobium-positive subjects.

  11. f

    Median response times in seconds (interquartile range in parenthesis) as a...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Nov 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hunt, Thomas E.; Ball, Linden J.; Stupple, Edward J. N.; Steel, Richard; Pitchford, Melanie (2017). Median response times in seconds (interquartile range in parenthesis) as a function of response type and CRT problem. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001816057
    Explore at:
    Dataset updated
    Nov 3, 2017
    Authors
    Hunt, Thomas E.; Ball, Linden J.; Stupple, Edward J. N.; Steel, Richard; Pitchford, Melanie
    Description

    Median response times in seconds (interquartile range in parenthesis) as a function of response type and CRT problem.

  12. Time Series Data of Carbon Monoxide Concentrations

    • kaggle.com
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    REDNAM MANIKANTA SAI NEERAJ (2024). Time Series Data of Carbon Monoxide Concentrations [Dataset]. https://www.kaggle.com/datasets/manikantasai18/time-series-data-of-carbon-monoxide-concentrations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2024
    Dataset provided by
    Kaggle
    Authors
    REDNAM MANIKANTA SAI NEERAJ
    Description

    The dataset provides the median, 25th percentile, and 75th percentile of carbon monoxide (CO) concentrations in Delhi, measured in moles per square meter and vertically integrated over a 9-day mean period. This data offers insights into the distribution and variability of CO levels over time.

    The data, collected from July 10, 2018, to August 10, 2024, is sourced from the Tropomi Explorer

    CO is a harmful gas that can significantly impact human health. High levels of CO can lead to respiratory issues, cardiovascular problems, and even be life-threatening in extreme cases. Forecasting CO levels helps in predicting and managing air quality to protect public health.

    CO is often emitted from combustion processes, such as those in vehicles and industrial activities. Forecasting CO levels can help in monitoring the impact of these sources and evaluating the effectiveness of emission control measures.**

    Accurate CO forecasts can assist in urban planning and pollution control strategies, especially in densely populated areas where air quality issues are more pronounced.

    Columns and Data Description: system:time_start: This column represents the date when the CO measurements were taken. p25: This likely represents the 25th percentile value of CO levels for the given date, providing insight into the lower range of the distribution. Median: The median CO level for the given date, which is the middle value of the dataset and represents a typical value. IQR: The Interquartile Range, which measures the spread of the middle 50% of the data. It’s calculated as the difference between the 75th percentile (p75) and the 25th percentile (p25) values.

  13. r

    Data from: GEOMACS (Geological and Oceanographic Model of Australias...

    • researchdata.edu.au
    • data.gov.au
    • +1more
    Updated Jul 24, 2008
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Australian Ocean Data Network (2008). GEOMACS (Geological and Oceanographic Model of Australias Continental Shelf) Interquartile range [Dataset]. https://researchdata.edu.au/geomacs-geological-oceanographic-interquartile-range/691522
    Explore at:
    Dataset updated
    Jul 24, 2008
    Dataset provided by
    Australian Ocean Data Network
    Area covered
    Description

    Geoscience Australias GEOMACS model was utilised to produce hindcast hourly time series of continental shelf (~20 to 300 m depth) bed shear stress (unit of measure: Pascal, Pa) on a 0.1 degree grid covering the period March 1997 to February 2008 (inclusive). The hindcast data represents the combined contribution to the bed shear stress by waves, tides, wind and density-driven circulation. Included in the parameters that will be calculated to represent the magnitude of the bulk of the data are the quartiles of the distribution; Q25, Q50 and Q75 (i.e. the values for which 25, 50 and 75 percent of the observations fall below). The interquartile range, , of the GEOMACS output takes the observations from between Q25 and Q75 to provide an accurate representation of the spread of observations. The interquartile range was shown to provide a more robust representation of the observations than the standard deviation, which produced highly skewed observations (Hughes and Harris 2008). This dataset is a contribution to the CERF Marine Biodiversity Hub and is hosted temporarily by CMAR on behalf of Geoscience Australia.

  14. Ames Housing Dataset with Engineered Features

    • kaggle.com
    zip
    Updated Aug 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fazelsamar (2025). Ames Housing Dataset with Engineered Features [Dataset]. https://www.kaggle.com/datasets/fazelsamar/ames-housing-dataset-with-engineered-features
    Explore at:
    zip(393857 bytes)Available download formats
    Dataset updated
    Aug 29, 2025
    Authors
    fazelsamar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Ames Housing Dataset with Engineered Features

    This dataset is a cleaned and enhanced version of the popular Ames Housing Dataset, originally compiled by Dean De Cock. It is designed for regression tasks, specifically predicting house sale prices.

    Key Transformations and Features:

    • Missing Value Handling: Missing values have been addressed through dropping columns with excessive missing data and imputing remaining missing values using appropriate strategies (mode for categorical, median for numerical).
    • Categorical Encoding: Categorical features have been converted into numerical formats using a combination of Ordinal Encoding for variables with a natural order and One-Hot Encoding for nominal variables.
    • Feature Engineering: Several new features have been created to potentially improve model performance, including:
      • HouseAge: The age of the house calculated from the year it was built and the year it was sold.
      • Log_LotArea: A log transformation of the 'Lot Area' to address skewness.
      • TotalSF: The total square footage of the house, combining basement, first floor, and second floor areas.
    • Feature Selection: Highly correlated features have been identified and some have been removed to mitigate multicollinearity.
    • Outlier Handling: Outliers in numerical features have been capped using the Interquartile Range (IQR) rule.
    • Skewness Handling: Skewed numerical features have been transformed using a log transformation to achieve a more normal distribution.
    • Duplicate Removal: Duplicate rows have been identified and removed.

    Potential Use Cases:

    This dataset is suitable for various regression modeling tasks, including:

    • Building predictive models for house prices.
    • Exploring the impact of different features on sale price.
    • Practicing data preprocessing and feature engineering techniques.

    This cleaned and engineered dataset provides a solid foundation for developing accurate and robust house price prediction models.

  15. f

    Median and interquartile range of R0 by serotype and by province.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Feb 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grenfell, Bryan T.; Yu, Hongjie; Xing, Weijia; Liu, Fengfeng; Hsiao, Victor Y.; Wu, Joseph T.; Metcalf, C. Jessica E.; van Doorn, H. Rogier; Takahashi, Saki; Leung, Gabriel M.; Liao, Qiaohong; Zhang, Jing; Farrar, Jeremy J.; Van Boeckel, Thomas P.; Cowling, Benjamin J.; Chang, Zhaorui; Sun, Junling (2016). Median and interquartile range of R0 by serotype and by province. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001600804
    Explore at:
    Dataset updated
    Feb 18, 2016
    Authors
    Grenfell, Bryan T.; Yu, Hongjie; Xing, Weijia; Liu, Fengfeng; Hsiao, Victor Y.; Wu, Joseph T.; Metcalf, C. Jessica E.; van Doorn, H. Rogier; Takahashi, Saki; Leung, Gabriel M.; Liao, Qiaohong; Zhang, Jing; Farrar, Jeremy J.; Van Boeckel, Thomas P.; Cowling, Benjamin J.; Chang, Zhaorui; Sun, Junling
    Description

    Median and interquartile range of R0 by serotype and by province.

  16. Median (InterQuartile Range, IQR) of air polltants and adjusteda odds ratio...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Payam Dadvand; Mark J. Nieuwenhuijsen; Xavier Basagaña; Mar Alvarez-Pedrerol; Albert Dalmau-Bueno; Marta Cirach; Ioar Rivas; Bert Brunekreef; Xavier Querol; Ian G. Morgan; Jordi Sunyer (2023). Median (InterQuartile Range, IQR) of air polltants and adjusteda odds ratio (95% confidence intervals) of the use of spectacles associated with one Inter-Quartile Range (IQR) increase in exposure to each pollutant. [Dataset]. http://doi.org/10.1371/journal.pone.0167046.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Payam Dadvand; Mark J. Nieuwenhuijsen; Xavier Basagaña; Mar Alvarez-Pedrerol; Albert Dalmau-Bueno; Marta Cirach; Ioar Rivas; Bert Brunekreef; Xavier Querol; Ian G. Morgan; Jordi Sunyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Median (InterQuartile Range, IQR) of air polltants and adjusteda odds ratio (95% confidence intervals) of the use of spectacles associated with one Inter-Quartile Range (IQR) increase in exposure to each pollutant.

  17. f

    The sample size (n), median and interquartile range (IQR) of the 2-year...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 8, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xu, Steven Jingliang; Lee, Fred Wang-Fat; Ho, Simon Yat-Fan (2020). The sample size (n), median and interquartile range (IQR) of the 2-year measurements taken by Citizen Science Leaders (CSLs) compared with those taken by the Environmental Protection Department of Hong Kong (EPD) where two locations were about 100 m apart from each other. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000449992
    Explore at:
    Dataset updated
    Sep 8, 2020
    Authors
    Xu, Steven Jingliang; Lee, Fred Wang-Fat; Ho, Simon Yat-Fan
    Description

    The sample size (n), median and interquartile range (IQR) of the 2-year measurements taken by Citizen Science Leaders (CSLs) compared with those taken by the Environmental Protection Department of Hong Kong (EPD) where two locations were about 100 m apart from each other.

  18. Precipitation Interquartile Range Fall Estimation (PERSIANN) 1984-2014

    • noaa.hub.arcgis.com
    Updated Dec 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA GeoPlatform (2024). Precipitation Interquartile Range Fall Estimation (PERSIANN) 1984-2014 [Dataset]. https://noaa.hub.arcgis.com/maps/8ddea6c7812e45b6b1c9848e6d93ad38
    Explore at:
    Dataset updated
    Dec 17, 2024
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Authors
    NOAA GeoPlatform
    Area covered
    Description

    The Precipitation Estimation from Remotely Sensed Information using an Artificial Neural Network-Climate Data Record (PERSIANN-CDR) is a satellite-based precipitation dataset for hydrological and climate studies, spanning from 1983 to present. It is the longest satellite-based precipitation record available, with daily data at 0.25° resolution for the 60°S–60°N latitude band.PERSIANN rain rate estimates are generated at 0.25° resolution and calibrated to a monthly merged in-situ and satellite product from the Global Precipitation Climatology Project (GPCP). The model uses Gridded Satellite (GridSat-B1) infrared data at 3-hourly time steps, with the raw output (PERSIANN-B1) bias-corrected and accumulated to produce the daily PERSIANN-CDR.The maps show 31 years (1984–2014) of annual and seasonal median and interquartile range (IQR) data. The median represents the 50th percentile of precipitation, and the IQR reflects the range between the 75th and 25th percentiles, showing data variability. Median and IQR are preferred over mean and standard deviation as they are less influenced by extreme values and better represent non-normally distributed data, such as precipitation, which is skewed and zero-limited.Data and Metadata: NCEIThis is a component of the Gulf Data Atlas (V1.0) for the Physical topic area.

  19. United States Climate Reference Network (USCRN) Standardized Soil Moisture...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA National Centers for Environmental Information (Point of Contact) (2023). United States Climate Reference Network (USCRN) Standardized Soil Moisture and Soil Moisture Climatology [Dataset]. https://catalog.data.gov/dataset/united-states-climate-reference-network-uscrn-standardized-soil-moisture-and-soil-moisture-clim2
    Explore at:
    Dataset updated
    Sep 19, 2023
    Dataset provided by
    National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Area covered
    United States
    Description

    The U.S. Climate Reference Network (USCRN) was designed to monitor the climate of the United States using research quality instrumentation located within representative pristine environments. This Standardized Soil Moisture (SSM) and Soil Moisture Climatology (SMC) product set is derived using the soil moisture observations from the USCRN. The hourly soil moisture anomaly (SMANOM) is derived by subtracting the MEDIAN from the soil moisture volumetric water content (SMVWC) and dividing the difference by the interquartile range (IQR = 75th percentile - 25th percentile) for that hour: SMANOM = (SMVWC - MEDIAN) / (IQR). The soil moisture percentile (SMPERC) is derived by taking all the values that were used to create the empirical cumulative distribution function (ECDF) that yielded the hourly MEDIAN and adding the current observation to the set, recalculating the ECDF, and determining the percentile value of the current observation. Finally, the soil temperature for the individual layers is provided for the dataset user convenience. The SMC files contain the MEAN, MEDIAN, IQR, and decimal fraction of available data that are valid for each hour of the year at 5, 10, 20, 50, and 100 cm depth soil layers as well as for a top soil layer (TOP) and column soil layer (COLUMN). The TOP layer consists of an average of the 5 and 10 cm depths, while the COLUMN layer includes all available depths at a location, either two layers or five layers depending on soil depth. The SSM files contain the mean VWC, SMANOM, SMPERC, and TEMPERATURE for each of the depth layers described above. File names are structured as CRNSSM0101-STATIONNAME.csv and CRNSMC0101-STATIONNAME.csv. SSM stands for Standardized Soil Moisture and SCM represent Soil Moisture Climatology. The first two digits of the trailing integer indicate major version and the second two digits minor version of the product.

  20. Human Activity Recognition Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aruna S (2023). Human Activity Recognition Dataset [Dataset]. https://www.kaggle.com/datasets/arunasivapragasam/human-activity-recognition-dataset
    Explore at:
    zip(51310476 bytes)Available download formats
    Dataset updated
    Feb 21, 2023
    Authors
    Aruna S
    Description

    The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% for the test data.

    The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low-frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.

    The features selected for this database come from the accelerometer and gyroscope 3-axial raw signals tAcc-XYZ and tGyro-XYZ. These time-domain signals (prefix 't' to denote time) were captured at a constant rate of 50 Hz. Then they were filtered using a median filter and a 3rd order low pass Butterworth filter with a corner frequency of 20 Hz to remove noise. Similarly, the acceleration signal was then separated into the body and gravity acceleration signals (tBodyAcc-XYZ and tGravityAcc-XYZ) using another low pass Butterworth filter with a corner frequency of 0.3 Hz.

    Subsequently, the body l linear acceleration and angular velocity were derived in time to obtain Jerk signals (tBodyAccJerk-XYZ and tBodyGyroJerk-XYZ). Also the magnitude of these three-dimensional signals were calculated using the Euclidean norm (tBodyAccMag, tGravityAccMag, tBodyAccJerkMag, tBodyGyroMag, tBodyGyroJerkMag).

    Finally a Fast Fourier Transform (FFT) was applied to some of these signals producing fBodyAcc-XYZ, fBodyAccJerk-XYZ, fBodyGyro-XYZ, fBodyAccJerkMag, fBodyGyroMag, fBodyGyroJerkMag. (Note the 'f' to indicate frequency domain signals).

    These signals were used to estimate variables of the feature vector for each pattern: '-XYZ' is used to denote 3-axial signals in the X, Y, and Z directions.

    tBodyAcc-XYZ tGravityAcc-XYZ tBodyAccJerk-XYZ tBodyGyro-XYZ tBodyGyroJerk-XYZ tBodyAccMag tGravityAccMag tBodyAccJerkMag tBodyGyroMag tBodyGyroJerkMag fBodyAcc-XYZ fBodyAccJerk-XYZ fBodyGyro-XYZ fBodyAccMag fBodyAccJerkMag fBodyGyroMag fBodyGyroJerkMag

    The set of variables that were estimated from these signals are:

    mean(): Mean value std(): Standard deviation mad(): Median absolute deviation max(): Largest value in array min(): Smallest value in array sma(): Signal magnitude area energy(): Energy measure. Sum of the squares divided by the number of values. iqr(): Interquartile range entropy(): Signal entropy arCoeff(): Autorregresion coefficients with Burg order equal to 4 correlation(): correlation coefficient between two signals maxInds(): index of the frequency component with the largest magnitude meanFreq(): Weighted average of the frequency components to obtain a mean frequency skewness(): skewness of the frequency domain signal kurtosis(): kurtosis of the frequency domain signal bandsEnergy(): Energy of a frequency interval within the 64 bins of the FFT of each window. angle(): Angle between two vectors.

    Additional vectors are obtained by averaging the signals in a signal window sample. These are used on the angle() variable:

    gravityMean tBodyAccMean tBodyAccJerkMean tBodyGyroMean tBodyGyroJerkMean

    This data set consists of the following columns:

    1 tBodyAcc-mean()-X 2 tBodyAcc-mean()-Y 3 tBodyAcc-mean()-Z 4 tBodyAcc-std()-X 5 tBodyAcc-std()-Y 6 tBodyAcc-std()-Z 7 tBodyAcc-mad()-X 8 tBodyAcc-mad()-Y 9 tBodyAcc-mad()-Z 10 tBodyAcc-max()-X 11 tBodyAcc-max()-Y 12 tBodyAcc-max()-Z 13 tBodyAcc-min()-X 14 tBodyAcc-min()-Y 15 tBodyAcc-min()-Z 16 tBodyAcc-sma() 17 tBodyAcc-energy()-X 18 tBodyAcc-energy()-Y 19 tBodyAcc-energy()-Z 20 tBodyAcc-iqr()-X 21 tBodyAcc-iqr()-Y 22 tBodyAcc-iqr()-Z 23 tBodyAcc-entropy()-X 24 tBodyAcc-entropy()-Y 25 tBodyAcc-entropy()-Z 26 tBodyAcc-arCoeff()-X,1 27 tBodyAcc-arCoeff()-X,2 28 tBodyAcc-arCoeff()-X,3 29 tBodyAcc-arCoeff()-X,4 30 tBodyAcc-arCoeff()-Y,1 31 tBodyAcc-arCoeff()-Y,2 32 tBodyAcc-arCoeff()-Y,3 33 tBodyAcc-arCoeff()-Y,4 34 tBodyAcc-arCoeff()-Z,1 35 tBodyAcc-arCoeff()-Z,2 36 tBodyAcc-arCoeff()-Z,3 37 tBodyAcc-arCoeff()-Z,4 38 tBodyAcc-correlation()-X,Y 39 tBodyAcc-correlation()-X,Z 40 tBodyAcc-correlation()-Y,Z 41 tGravityAcc-mean()-X 42 tGravit...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Whaley, Dana H.; Denis, Max; Alizad, Azra; Pruthi, Sandhya; Mehrmohammadi, Mohammad; Chen, Shigao; Song, Pengfei; Meixner, Duane D.; Fatemi, Mostafa; Fazzio, Robert T. (2015). Median values, interquartile range (IQR) and Number of outliers. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001856309

Median values, interquartile range (IQR) and Number of outliers.

Explore at:
Dataset updated
Mar 16, 2015
Authors
Whaley, Dana H.; Denis, Max; Alizad, Azra; Pruthi, Sandhya; Mehrmohammadi, Mohammad; Chen, Shigao; Song, Pengfei; Meixner, Duane D.; Fatemi, Mostafa; Fazzio, Robert T.
Description

Median values, interquartile range (IQR) and Number of outliers.

Search
Clear search
Close search
Google apps
Main menu