Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.
Facebook
Twitterhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/29606https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/29606
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependence, difficult computation, or inapplicability with multiple mismeasured variables. We develop an easy-to-use alternative without these problems; it generalizes the popular multiple imputation (MI) framework by treating missing data problems as a limiting special case of extreme measurement error, and corrects for both. Like MI, the proposed framework is a simple two-step procedure, so that in the second step researchers can use whatever statistical method they would have if there had been no problem in the first place. We also offer empirical illustrations, open source software that implements all the methods described herein, and a companion paper with technical details and extensions (Blackwell, Honaker, and King, 2014b). Notes: This is the first of two articles to appear in the same issue of the same journal by the same authors. The second is “A Unified Approach to Measurement Error and Missing Data: Details and Extensions.” See also: Missing Data
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Ecologists use classifications of individuals in categories to understand composition of populations and communities. These categories might be defined by demographics, functional traits, or species. Assignment of categories is often imperfect, but frequently treated as observations without error. When individuals are observed but not classified, these “partial” observations must be modified to include the missing data mechanism to avoid spurious inference.
We developed two hierarchical Bayesian models to overcome the assumption of perfect assignment to mutually exclusive categories in the multinomial distribution of categorical counts, when classifications are missing. These models incorporate auxiliary information to adjust the posterior distributions of the proportions of membership in categories. In one model, we use an empirical Bayes approach, where a subset of data from one year serves as a prior for the missing data the next. In the other approach, we use a small random sample of data within a year to inform the distribution of the missing data.
We performed a simulation to show the bias that occurs when partial observations were ignored and demonstrated the altered inference for the estimation of demographic ratios. We applied our models to demographic classifications of elk (Cervus elaphus nelsoni) to demonstrate improved inference for the proportions of sex and stage classes.
We developed multiple modeling approaches using a generalizable nested multinomial structure to account for partially observed data that were missing not at random for classification counts. Accounting for classification uncertainty is important to accurately understand the composition of populations and communities in ecological studies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Download all files to get the dataset. Please check the MD5 checksums after download.
cat ppt4j_data.tar.xz.part* > ppt4j_data.tar.xz to get the complete archive.awk '{print $2 " " $1}' MD5.txt > MD5.chk && md5sum --ignore-missing --check MD5.chk to check the integrity of downloaded files. The format of MD5.txt is not compatible with md5sum, so the awk command is employed to fix this. Sorry for the inconvenience.ppt4j_data.tar.xz, then follow the instructions at https://github.com/pan2013e/ppt4j. The tarball file is created in macOS with bsdtar, and you may notice warnings like tar: Ignoring unknown extended header keyword 'XXX' if you extract it in Linux with gnutar. You can just ignore these warnings, as long as the checksum is okay.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundUsing current climate models, regional-scale changes for Florida over the next 100 years are predicted to include warming over terrestrial areas and very likely increases in the number of high temperature extremes. No uniform definition of a heat wave exists. Most past research on heat waves has focused on evaluating the aftermath of known heat waves, with minimal consideration of missing exposure information.ObjectivesTo identify and discuss methods of handling and imputing missing weather data and how those methods can affect identified periods of extreme heat in Florida.MethodsIn addition to ignoring missing data, temporal, spatial, and spatio-temporal models are described and utilized to impute missing historical weather data from 1973 to 2012 from 43 Florida weather monitors. Calculated thresholds are used to define periods of extreme heat across Florida.ResultsModeling of missing data and imputing missing values can affect the identified periods of extreme heat, through the missing data itself or through the computed thresholds. The differences observed are related to the amount of missingness during June, July, and August, the warmest months of the warm season (April through September).ConclusionsMissing data considerations are important when defining periods of extreme heat. Spatio-temporal methods are recommended for data imputation. A heat wave definition that incorporates information from all monitors is advised.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset contains 545 entries (rows) and 13 features (columns). It is a clean dataset with no missing values across all columns, meaning you can skip the standard null-value imputation step. The dataset consists of 7 numerical columns and 6 categorical columns (including the target price): Given that the data is clean (no missing values), the best next step is to start your Exploratory Data Analysis (EDA).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary statistics of 581 patients’ number of observations
Facebook
TwitterThe integration of proteomic datasets, generated by non-cooperating laboratories using different LC-MS/MS setups can overcome limitations in statistically underpowered sample cohorts but has not been demonstrated to this day. In proteomics, differences in sample preservation and preparation strategies, chromatography and mass spectrometry approaches and the used quantification strategy distort protein abundance distributions in integrated datasets. The Removal of these technical batch effects requires setup-specific normalization and strategies that can deal with missing at random (MAR) and missing not at random (MNAR) type values at a time. Algorithms for batch effect removal, such as the ComBat-algorithm, commonly used for other omics types, disregard proteins with MNAR missing values and reduce the informational yield and the effect size for combined datasets significantly. Here, we present a strategy for data harmonization across different tissue preservation techniques, LC-MS/MS instrumentation setups and quantification approaches. To enable batch effect removal without the need for data reduction or error-prone imputation we developed an extension to the ComBat algorithm, ´ComBat HarmonizR, that performs data harmonization with appropriate handling of MAR and MNAR missing values by matrix dissection The ComBat HarmonizR based strategy enables the combined analysis of independently generated proteomic datasets for the first time. Furthermore, we found ComBat HarmonizR to be superior for removing batch effects between different Tandem Mass Tag (TMT)-plexes, compared to commonly used internal reference scaling (iRS). Due to the matrix dissection approach without the need of data imputation, the HarmonizR algorithm can be applied to any type of -omics data while assuring minimal data loss
Facebook
TwitterBy Granger Huntress [source]
This dataset provides a comprehensive look at the world of men's professional tennis throughout the Open Era. Every year, a new crop of tennis players has emerged to challenge long-standing traditions, while others have continued to maintain their place near the top. Through this dataset you will uncover which players succeeded in reaching or maintaining their ranking positions in the record books and how they navigated through changing eras in men’s professional tennis. Dive deep into what makes these successful athletes stand out from the rest and make impacts on their game year after year with an overview of invaluable data provided by this collection from first name, birthdate, country of origin, handedness, date range for records kept and more importantly their ATP career end rankings. Whether you are interested in a snapshot view to analyze long term trends or want to get inside insights on why top players succeed—this analysis provides invaluable resources to explore men's ATP rankings throughout its Open Era journey
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is a comprehensive source of men's year-end rankings during the Open Era. Each record includes information on the player's ranking, name, birthdate, country of origin, and handedness. This dataset can be used to study the trends in professional tennis throughout this time period and analyze how they have changed over time.
To use this dataset effectively one should first explore the data by reviewing some basic statistics. Examples include summary statistics such as total players by country or average ranking across years. Summarizing the data will help get a quick understanding of what the data is composed of and any existing patterns that may be present in it.
Another important step you can take before analyzing this data deeper would be to check for missing values or outliers within it that could affect your results if ignored or not handled appropriately. Having an understanding about any potential issues with your data like these can save you from potentially misinterpreting results due to an incomplete analysis process at some later point in time after further exploration and analysis has been done with it.
Once an overview of your dataset has been established and potential issues have been addressed it is now time to start conducting a more detailed exploration into what insights our data holds us answer questions related to Professional Tennis during this time period such as: How did various nations perform over different years? Who was consistently ranked among the top 10 players throughout this period? Any trends we see associated with handedness preference? etc… Answering questions like these properly requires finding appropriate ways analyze them given our available set up variables so keep that in mind when trying pin down connections between our variables using techniques like correlations, linear regression etc… In addition, visualizations can also help you make sense out of large amounts complex multivariable relationships which may exist between varying sets up parameters all at once so don't forget including those whenever possible! This way you'e able maximize accuracy when uncovering hidden intricacies regarding both individual components and holistic summary statistics for tennis rankings all over years covered within this open era range
- Analyzing the global trends in men's tennis in the Open Era over time by examining shifts in countries represented at each year-end ranking.
- Examining the effectiveness of different opponents based on the nature of their hands (right or left) when compared to men's hand-edness throughout the Open Era.
- Tracking and predicting future player rankings based on birthdates, country, and other relevant factors that influence performance
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: ltdPlayerMaster.csv | Column name | Description | |:--------------|:---------------------------------------------------| | FIRST | First name of the player. (String) | | LAST | Last name of the player. (String) | | HAND | Handedness of the player (Right or Left). ...
Facebook
TwitterVarious instruments are used to create images of the Earth and other objects in the universe in a diverse set of wavelength bands with the aim of understanding natural phenomena. Sometimes these instruments are built in a phased approach, with additional measurement capabilities added in later phases. In other cases, technology may mature to the point that the instrument offers new measurement capabilities that were not planned in the original design of the instrument. In still other cases, high resolution spectral measurements may be too costly to perform on a large sample and therefore lower resolution spectral instruments are used to take the majority of measurements. Many applied science questions that are relevant to the earth science remote sensing community require analysis of enormous amounts of data that were generated by instruments with disparate measurement capabilities. This paper addresses this problem using Virtual Sensors: a method that uses modelstrained on spectrally rich (high spectral resolution) data to "fill in" unmeasured spectral channels in spectrally poor (low spectral resolution) data. The models we use in this paper are Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs) with Radial Basis Function (RBF) kernels and SVMs with Mixture Density Mercer Kernels (MDMK). We demonstrate this method by using models trained on the high spectral resolution Terra MODIS instrument to estimate what the equivalent of the MODIS 1.6 micron channel would be for the NOAA AVHRR/2 instrument. The scientific motivation for the simulation of the 1.6 micron channel is to improve the ability of the AVHRR/2 sensor to detect clouds over snow and ice.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Bandit algorithms defined by allocation probability πk,t or index value Ik,t.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Data needed for macroecological analyses are difficult to compile and often hidden away in supplementary material under non-standardized formats. Phylogenies, range data, and trait data often use conflicting taxonomies and require ad hoc decisions to synonymize species or fill in large amounts of missing data. Furthermore, most available data sets ignore the large impact that humans have had on species ranges and diversity. Ignoring these impacts can lead to drastic differences in diversity patterns and estimates of the strength of biological rules. To help overcome these issues, we assembled PHYLACINE, The Phylogenetic Atlas of Mammal Macroecology. This taxonomically integrated platform contains phylogenies, range maps, trait data, and threat status for all 5,831 known mammal species that lived since the last interglacial (~130,000 years ago until present). PHYLACINE is ready to use directly, as all taxonomy and metadata are consistent across the different types of data, and files are provided in easy-to-use formats. The atlas includes both maps of current species ranges and present natural ranges, which represent estimates of where species would live without anthropogenic pressures. Trait data include body mass and coarse measures of life habit and diet. Data gaps have been minimized through extensive literature searches and clearly labelled imputation of missing values. The PHYLACINE database will be archived here as well as hosted online so that users may easily contribute updates and corrections to continually improve the data. This database will be useful to any researcher who wishes to investigate large scale ecological patterns. Previous versions of the database has already provided valuable information and have for instance shown that megafauna extinctions caused substantial changes in vegetation structure and nutrient transfer patterns across the globe. All data and metadata provided here represent PHYLACINE Version 1.2.0.
Facebook
Twitterhttps://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de441083https://search.gesis.org/research_data/datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de441083
Abstract (en): Of the 14 nations included in the original study, these data cover the following ten: Brazil, Cuba, Dominican Republic, India, Israel, Nigeria, Panama, United States, West Germany, and Yugoslavia. (The data for Egypt, Japan, the Philippines, and Poland are not available through ICPSR.) In India and Israel the interviews were conducted in two waves, with different samples. Besides ascertaining the usual personal information, the study employed a "Self-Anchoring Striving Scale," an open-ended scale asking the respondent to define hopes and fears for self and the nation, to determine the two extremes of a self-defined spectrum on each of several variables. After these subjective ratings were obtained, the respondents indicated their perceptions of where they and their nations stood on a hypothetical ladder at three different points in time. Demographic variables include the respondents' age, gender, marital status, and level of education. For more information on the samples, coding, and the means of measurement, see the related publication listed below. ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection: Checked for undocumented or out-of-range codes.. Adult population of Brazil, Cuba, Dominican Republic, India, Israel, Nigeria, Panama, United States, West Germany and Yugoslavia. Separate samples were drawn in each country. All samples were intended to be crossnational, except for the kibbutz sample in Israel. However, both India samples underrepresent females, and the sample from Cuba was drawn exclusively from urban areas. In addition, the samples from Brazil, Cuba, the Dominican Republic, India, Nigeria, Panama, and the United States were weighted to achieve the intended representation. 2006-01-12 All files were removed from dataset 13 and flagged as study-level files, so that they will accompany all downloads. (1) Because the original data format included some multiply punched variables, it is inappropriate to assume that the first response of a multiple response variable is more important than the rest: the current order of responses is an artifact of the technology used to record and recover them. It is even possible to have a missing data code followed by further substantive responses in some cases. (2) These data files were originally released separately, under ICPSR study numbers 7023-7031, 7085-7086, and 7258. They are now concatenated into one data collection as 7023. References in the codebooks to the old study numbers should be ignored. (3) The codebooks are also available together in one bound volume available upon request from ICPSR. 4) The codebook is provided by ICPSR as a Portable Document Format (PDF) file. The PDF file format was developed by Adobe Systems Incorporated and can be accessed using PDF reader software, such as Adobe Acrobat Reader. Information on how to obtain a copy of the Acrobat Reader is provided on the ICPSR Web site.
Facebook
TwitterSome of the information in the open data files below may not yet reflect the data used to calculate the most recent tax year’s property value. If you see missing or incorrect info about your property, use this form to contact OPA to report the issue. Property characteristic and assessment history from the Office of Property Assessment for all properties in Philadelphia. See more information on how OPA assesses property and their reports on the quality of assessments. This data updates nightly. Please ignore the 'created by' date below - the date of August 2015 shows when this webpage, not the data, was created.
Facebook
TwitterGPS locations for an adult female wild boarSequence of locations for an adult female wild boar fitted with a GPS collar. Values are times in minutes and co-ordinates in metres (from an arbitrary origin). Data are extracted from the study described in Quy, R. J., Massei, G., Lambert, M. S., Coats, J., Miller, L. A., and Cowan, D. P. (2014) Effects of a GnRH vaccine on the movement and activity of free-living wild boar (Sus scrofa), Wildlife Research 41, 185-193.WildBoarMEE.txt
Facebook
TwitterThis data package describes long-term trends in metrics describing population stability and used as statistical early warnings of regime shifts in 29 fish species that inhabit the San Francisco Bay-Delta in central California, USA. Metrics used in this study include spatial synchrony, temporal coefficient of variation (CV), and lag-1 temporal autocorrelation. Trends were measured using ordinary least squares linear regression.
These derived data were developed from abundance (as CPUE) time series based on three long-term fish monitoring studies included in https://doi.org/10.6073/pasta/a29a6e674b0f8797e13fbc4b08b92e5b; the Fall Midwater Trawl Survey, Delta Juvenile Monitoring Program, and Bay Study. Selected data were from fall months (September to December) in 1980-2023, from midwater trawl and beach seine surveys for which sampling effort (e.g., tow volume) was recorded. Data on fish exceeding maximum length thresholds for age-0 fish were discarded, except for white sturgeon, where the maximum length threshold corresponded to approximately 10 years of age, the onset of reproductive maturity. Observations from different sampling stations were aggregated into 10 sub-regions (South San Francisco Bay, Central San Francisco Bay, San Pablo Bay, Napa River, Suisun Bay, Delta Confluence, South Delta, North Delta, San Joaquin River, Sacramento River, and midwater trawl samples and beach seine samples were considered separately because the methods sample distinct habitat types. Combinations of sub-region and sampling method were considered distinct spatial units.
EWI metrics were measured in 5-year rolling windows to permit assessment of changes over time. The temporal CV and lag-1 autocorrelation were measured on individual spatial unit time series, ignoring windows with >1 year of missing data. The coefficient of variation divides the standard deviation by the mean. Lag-1 autocorrelation was measured as Pearson correlation. Spatial synchrony was measured across spatial units, ignoring spatial units with >1 year of missing data, and ignoring rolling windows where <3 spatial units had sufficient data. Spatial synchrony was measured as the mean of pairwise Spearman correlations. Trends in EWI metrics were measured only when there were at least 5 rolling window measurements spanning at least 10 years.
Facebook
TwitterOur proprietary intent data is more expansive than what is available from data co-ops or single-source providers, delivering a comprehensive base for your advanced intent analysis. We monitor intent behavior by both executive and managerial customer personas, to help you develop a complete picture of an organizations’ buying dynamics.
Our exclusive Identity Graph technology goes beyond simple reverse IP lookup to identify small and midsize companies that do not have dedicated IP addresses. Our advanced triangulation technologies are based on dozens of variables and pinpoints accounts, locations, and specific individuals who are expressing intent. This critical intent intelligence is either missing or ignored in most other data streams.
Our AI, machine learning, and natural language analysis of content identifies precise topical interest and maps intent activity to our taxonomy of more than 7,000 B2B topics. And we can easily add new topics based upon customer requirements.
The True Influence Relevance Engine™ analyzes intent activity on more than just frequency. We include activity type, topical relevance, and historical trends to find patterns that make intent a strategic differentiator for your solution. Our intent data can take your data-driven sales and marketing solution or service to the next level.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The present data compilation includes ciliates growth rate, grazing rate and gross growth efficiency determined either in the field or in laboratory experiments. From the existing literature, we synthesized all data that we could find on cilliate. Some sources might be missing but none were purposefully ignored. Field data on microzooplankton grazing are mostly comprised of grazing rate using the dilution technique with a 24h incubation period. Laboratory grazing and growth data are focused on pelagic ciliates and heterotrophic dinoflagellates. The experiment measured grazing or growth as a function of prey concentration or at saturating prey concentration (maximal grazing rate). When considering every single data point available (each measured rate for a defined predator-prey pair and a certain prey concentration) there is a total of 1485 data points for the ciliates, counting experiments that measured growth and grazing simultaneously as 1 data point.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This synthetic dataset contains 4,362 rows and five columns, including both numerical and categorical data. It is designed for data cleaning, imputation, and analysis tasks, featuring structured missing values at varying percentages (63%, 4%, 47%, 31%, and 9%).
The dataset includes:
- Category (Categorical): Product category (A, B, C, D)
- Price (Numerical): Randomized product prices
- Rating (Numerical): Ratings between 1 to 5
- Stock (Categorical): Availability status (In Stock, Out of Stock)
- Discount (Numerical): Discount percentage
This dataset is ideal for practicing missing data handling, exploratory data analysis (EDA), and machine learning preprocessing.