In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used in machine learning for data compaction and efficient learning by eliminating the curse of dimensionality. There exist many solutions for feature selection when the data is located at a central location. However, it becomes extremely challenging to perform the same when the data is distributed across a large number of peers or machines. Centralizing the entire dataset or portions of it can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, dynamic nature of the data/network and privacy concerns. The solution proposed in this paper allows us to perform feature selection in an asynchronous fashion with a low communication overhead where each peer can specify its own privacy constraints. The algorithm works based on local interactions among participating nodes. We present results on real-world datasets in order to performance of the proposed algorithm.
This dataset contains a selection of six socioeconomic indicators of public health significance and a “hardship index,” by Chicago community area, for the years 2008 – 2012. The indicators are the percent of occupied housing units with more than one person per room (i.e., crowded housing); the percent of households living below the federal poverty level; the percent of persons in the labor force over the age of 16 years that are unemployed; the percent of persons over the age of 25 years without a high school diploma; the percent of the population under 18 or over 64 years of age (i.e., dependency); and per capita income. Indicators for Chicago as a whole are provided in the final row of the table. See the full dataset description for more information at: https://data.cityofchicago.org/api/views/fwb8-6aw5/files/A5KBlegGR2nWI1jgP6pjJl32CTPwPbkl9KU3FxlZk-A?download=true&filename=P:\EPI\OEPHI\MATERIALS\REFERENCES\ECONOMIC_INDICATORS\Dataset_Description_socioeconomic_indicators_2012_FOR_PORTAL_ONLY.pdf
As of 2024, around 72 percent of organizations chose databases (NoSQL, SQL etc.) on Kubernetes environments. Additionally, 67 percent of organizations utilized analytics (Data processing/ELT/ETL).
The information in the dataset provides information on the MCG Recruitment and Selection Activities which includes the volume of applications received for each job vacancy, number of applicants hired, applicant statuses and the type of hires (Permanent, Temporary, Rehire) for the respective fiscal year. Update Frequency : Annually
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We consider estimating binary response models on an unbalanced panel, where the outcome of the dependent variable may be missing due to nonrandom selection, or there is self-selection into a treatment. In the present paper, we first consider estimation of sample selection models and treatment effects using a fully parametric approach, where the error distribution is assumed to be normal in both primary and selection equations. Arbitrary time dependence in errors is permitted. Estimation of both coefficients and partial effects, as well as tests for selection bias, are discussed. Furthermore, we consider a semiparametric estimator of binary response panel data models with sample selection that is robust to a variety of error distributions. The estimator employs a control function approach to account for endogenous selection and permits consistent estimation of scaled coefficients and relative effects.
Simulation code for Warren et al. 2019 - Journal of BiogeographySimulation code to accompany Warren et al. 2019, examining the relationship between discrimination accuracy and functional accuracy for ENM/SDM studiessim-code-Warren-et-al-2019-master.zip
File S11) AlphaDrop: executable for Linux
2) macs: MaCS executable for linux
3) msformatter: MaCS executable for linux
4) Seed.txt: a file containing a random seed for initialising AlphaDrop
5) RunMacs.sh: a shell script called by AlphaDrop when it runs MaCS
6) AlphaDropSpec.txt: the specification file for AlphaDrop
7) Pedigree.txt: an example externally supplied pedigree file
8) MaCsSimulationParameters.xlsx: an excel sheet with which MaCS parameters can be calculated
9) Ne100.sh: example of what to put into RunMacs.sh (Ne100 population of Hickey et al., 2011 Genetics Selection Evolution)
10) Ne1000.sh: example of what to put into RunMacs.sh (Ne1000 population of Hickey et al., 2011 Genetics Selection Evolution)FileS1.zipSimulated Data - Part 1Ten replicates of a livestock data structure were simulated. The structure was designed to cover a spectrum of QTL distributions, relationship structures, and SNP densities and to mimic some of the scenarios where genomic selection is ap...
During a survey carried out among decision-makers in charge of customer engagement/retention strategy from 20 countries worldwide, 84 percent of respondents stated that they thought it was important or critical to collect customer channel engagement data; three in four named real-time experience in this context.
At Echo, our dedication to data curation is unmatched; we focus on providing our clients with an in-depth picture of a physical location based on activity in and around the point of interest (POI) over time. Our dataset empowers you to explore the cross-shopping patterns from your visitors by allowing you to dig deeper into consumer profiles, eliminate gaps in your trade area and discover untapped sites of action.
This sample of our Market Analysis solution helps you determine the geographical reach of your store or facility based on the brands or categories most visited by consumers who visit your specific POI. This empowers your location strategy. This particular dataset is for Europe.
Additional Information:
Information about our country offering and data schema can be found here:
1) Data Schema: https://docs.echo-analytics.com/activity/data-schema 2) Country Availability: https://docs.echo-analytics.com/activity/country-coverage 3) Methodology: https://docs.echo-analytics.com/activity/methodology
Echo's commitment to customer service is evident in our exceptional data quality and dedicated team, providing 360° support throughout your location intelligence journey. We handle the complex tasks to deliver analysis-ready datasets to you.
Business Needs: - Site Selection and Lease Renegotiation: Leverage foot traffic data for optimal site selection and advantageous lease renegotiations. This approach enables you to pinpoint ideal store locations and secure lease terms that align with business objectives, optimizing operational efficiency and cost-effectiveness.
-Market Intelligence: Outsmart your competition by understanding competitor foot traffic trends, allowing you to identify growth opportunities and gain a competitive advantage. Analyze regional consumer behaviors and preferences to pinpoint new markets and assess the competitive landscape for strategic expansion.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The usefulness of genomic prediction in crop and livestock breeding programs has prompted efforts to develop new and improved genomic prediction algorithms, such as artificial neural networks and gradient tree boosting. However, the performance of these algorithms has not been compared in a systematic manner using a wide range of datasets and models. Using data of 18 traits across six plant species with different marker densities and training population sizes, we compared the performance of six linear and six non-linear algorithms. First, we found that hyperparameter selection was necessary for all non-linear algorithms and that feature selection prior to model training was critical for artificial neural networks when the markers greatly outnumbered the number of training lines. Across all species and trait combinations, no one algorithm performed best, however predictions based on a combination of results from multiple algorithms (i.e. ensemble predictions) performed consistently well. While linear and non-linear algorithms performed best for a similar number of traits, the performance of non-linear algorithms vary more between traits. Although artificial neural networks did not perform best for any trait, we identified strategies (i.e. feature selection, seeded starting weights) that boosted their performance to near the level of other algorithms. Our results highlight the importance of algorithm selection for the prediction of trait values.
The relations between unobserved events and observed outcomes can be characterized by a bipartite graph. We propose an algorithm that explores the structure of the graph to construct the "exact Core Determining Class," i.e., the set of irredudant inequalities. We prove that in general the exact Core Determining Class does not depend on the probability measure of the outcomes but only on the structure of the graph. For more general linear inequalities selection problems, we propose a statistical procedure similar to the Dantzig Selector to select the truly informative constraints. We demonstrate performances of our procedures in Monte-Carlo experiments.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data period selection for the EU ETS and China’s carbon trading pilots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We focus on the general partially linear model without any structure assumption on the nonparametric component. For such a model with both linear and nonlinear predictors being multivariate, we propose a new variable selection method. Our new method is a unified approach in the sense that it can select both linear and nonlinear predictors simultaneously by solving a single optimization problem. We prove that the proposed method achieves consistency. Both simulation examples and a real data example are used to demonstrate the new method’s competitive finite-sample performance. Supplementary materials for this article are available online.
Xverum’s Point of Interest (POI) Data is a comprehensive dataset of 230M+ verified locations, covering businesses, commercial properties, and public places across 5000+ industry categories. Our dataset enables retailers, investors, and GIS professionals to make data-driven decisions for business expansion, location intelligence, and geographic analysis.
With regular updates and continuous POI discovery, Xverum ensures your mapping and business location models have the latest data on business openings, closures, and geographic trends. Delivered in bulk via S3 Bucket or cloud storage, our dataset integrates seamlessly into geospatial analysis, market research, and navigation platforms.
🔥 Key Features:
📌 Comprehensive POI Coverage ✅ 230M+ global business & location data points, spanning 5000+ industry categories. ✅ Covers retail stores, corporate offices, hospitality venues, service providers & public spaces.
🌍 Geographic & Business Location Insights ✅ Latitude & longitude coordinates for accurate mapping & navigation. ✅ Country, state, city, and postal code classifications. ✅ Business status tracking – Open, temporarily closed, permanently closed.
🆕 Continuous Discovery & Regular Updates ✅ New business locations & POIs added continuously. ✅ Regular updates to reflect business openings, closures & relocations.
📊 Rich Business & Location Data ✅ Company name, industry classification & category insights. ✅ Contact details, including phone number & website (if available). ✅ Consumer review insights, including rating distribution (optional feature).
📍 Optimized for Business & Geographic Analysis ✅ Supports GIS, navigation systems & real estate site selection. ✅ Enhances location-based marketing & competitive analysis. ✅ Enables data-driven decision-making for business expansion & urban planning.
🔐 Bulk Data Delivery (NO API) ✅ Delivered in bulk via S3 Bucket or cloud storage. ✅ Available in structured formats (.csv, .json, .xml) for seamless integration.
🏆 Primary Use Cases:
📈 Business Expansion & Market Research 🔹 Identify key business locations & competitors for strategic growth. 🔹 Assess market saturation & regional industry presence.
📊 Geographic Intelligence & Mapping Solutions 🔹 Enhance GIS platforms & navigation systems with precise POI data. 🔹 Support smart city & infrastructure planning with location insights.
🏪 Retail Site Selection & Consumer Insights 🔹 Analyze high-traffic locations for new store placements. 🔹 Understand customer behavior through business density & POI patterns.
🌍 Location-Based Advertising & Geospatial Analytics 🔹 Improve targeted marketing with location-based insights. 🔹 Leverage geographic data for precision advertising & customer segmentation.
💡 Why Choose Xverum’s POI Data? - 230M+ Verified POI Records – One of the largest & most structured business location datasets available. - Global Coverage – Spanning 249+ countries, covering all major business categories. - Regular Updates & New POI Discoveries – Ensuring accuracy. - Comprehensive Geographic & Business Data – Coordinates, industry classifications & category insights. - Bulk Dataset Delivery (NO API) – Direct access via S3 Bucket or cloud storage. - 100% GDPR & CCPA-Compliant – Ethically sourced & legally compliant.
Access Xverum’s 230M+ POI Data for business location intelligence, geographic analysis & market research. Request a free sample or contact us to customize your dataset today!
Most U.S. consumers are open to sharing information with insurance providers, although a 2019 survey finds that this willingness quickly decreases the more personal the information becomes. According to the survey, around two-thirds of consumers would be willing to share driving and claims history. However, just 31 percent of respondents are willing to share social media information, and only 28 percent are comfortable sharing mobile phone data.
During the second quarter of 2024, the largest number of Smart Home mobile applications examined reported crash data to their publishers. Overall, 325 mobile apps in this category collected crash reports for functioning analytics. Approximately 294 apps collected e-mail addresses, while 286 collected product interaction data from their users. Smart Home applications can have several functions, such regulating homes' thermostats to operating motion sensors and pet cameras.
Phenotypic data on flowering time, size, and relative fitnessSee read me file.Dryad_control_data.txt
carter-houle-evol2011-descriptionThis file contains descriptions of data column headings in other files. It is attached to each other file as a readme.carter-houle-evol2011-U1Data for the U1 line as described in the paper.carter-houle-evol2011-U2Data for the U2 line as described in the paper.carter-houle-evol2011-D1Data for the D1 line as described in the paper.carter-houle-evol2011-D2Data for the D2 line as described in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data sets are used in the linked publication proposing the two novel approaches mutual forest impact (MFI) and mutual impurity reduction (MIR). Simulation study 1 was conducted to analyze the bias of importance and relation measures and contains two null scenarios with increasing number of expression possibilities (A) and with increasing minor allele frequencies (B). For each scenario, a classification, regression and survival outcome was simulated. The data contains scripts for simulation and the simulated data. Simulation study 2 was conducted to analyze the selection of variables in the presence of correlations. The data contains scripts for simulation and the simulated data. Simulation study 3 was conducted for the comparison of the feature selection approaches under realistic correlation structures. It is based on a realistic covariance matrix (mvn.RData) generated from an RNA-microarray dataset of breast cancer patients with 12,592 genes obtained from The Cancer Genome Atlas. The data contains only scripts for simulation. The data of the real data application is published in two csv files: "vcf.csv" contains the SNP data of the subset of the plastid genome data set of Solanum Section Petota species (Huang et al., 2019) in a variant calling format file. For this, multiple sequence alignments of 43 genes were conducted with QIAGEN CLC Genomics Workbench 22.0.2 (digitalinsights.qiagen.com) and SNP-sites was subsequently used to generate variant call format (VCF) files. These files were merged into a file of 257 SNPs for further analysis. "vcf_input_withCountry.csv" contains the same data but with the additional country category. Also the data is in a ready to use format for further analysis.
https://doi.org/10.5061/dryad.3r2280ggv
Data was collected for the analysis of the evolutionary relationships among milkweeds. The remaining data was used to test the PickMe algorithm for sample selection in the context of phylogenomic analysis.
Data Descriptions
- Milkweed-Sequence-Files.zip: Contains sequence data for the analysis. By the time of publication, all sequences will be referenced on GenBank.
- estimated-gene-trees-NJ-Uncorrected and **estimated-gene-trees-RAxML ** estimated-gene-trees-NJ-Uncorrected: Contain all estimated Milkweed gene trees as described in the associated article. Sample names were cleaned up for the main manuscript. A log for matching is listed in a text file.
- OldSpeciesTree.cf.tree: The species tree referenced in the paper, based ...
In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used in machine learning for data compaction and efficient learning by eliminating the curse of dimensionality. There exist many solutions for feature selection when the data is located at a central location. However, it becomes extremely challenging to perform the same when the data is distributed across a large number of peers or machines. Centralizing the entire dataset or portions of it can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, dynamic nature of the data/network and privacy concerns. The solution proposed in this paper allows us to perform feature selection in an asynchronous fashion with a low communication overhead where each peer can specify its own privacy constraints. The algorithm works based on local interactions among participating nodes. We present results on real-world datasets in order to performance of the proposed algorithm.