Collect and combine data from multiple internal and external data sources for exposure to consumers. Data for any individual is made available via a standard set of hierarchical HTTP resources through the Read Service. The VRS calls the ISIC external Producer endpoints to fetch and aggregate Care Coordinator Profiles VLER document type data and convert it to an XML Atom feed format for the Consumer.
Envestnet®| Yodlee®'s Retail Transaction Data (Aggregate/Row) Panels consist of de-identified, near-real time (T+1) USA credit/debit/ACH transaction level data – offering a wide view of the consumer activity ecosystem. The underlying data is sourced from end users leveraging the aggregation portion of the Envestnet®| Yodlee®'s financial technology platform.
Envestnet | Yodlee Consumer Panels (Aggregate/Row) include data relating to millions of transactions, including ticket size and merchant location. The dataset includes de-identified credit/debit card and bank transactions (such as a payroll deposit, account transfer, or mortgage payment). Our coverage offers insights into areas such as consumer, TMT, energy, REITs, internet, utilities, ecommerce, MBS, CMBS, equities, credit, commodities, FX, and corporate activity. We apply rigorous data science practices to deliver key KPIs daily that are focused, relevant, and ready to put into production.
We offer free trials. Our team is available to provide support for loading, validation, sample scripts, or other services you may need to generate insights from our data.
Investors, corporate researchers, and corporates can use our data to answer some key business questions such as: - How much are consumers spending with specific merchants/brands and how is that changing over time? - Is the share of consumer spend at a specific merchant increasing or decreasing? - How are consumers reacting to new products or services launched by merchants? - For loyal customers, how is the share of spend changing over time? - What is the company’s market share in a region for similar customers? - Is the company’s loyal user base increasing or decreasing? - Is the lifetime customer value increasing or decreasing?
Additional Use Cases: - Use spending data to analyze sales/revenue broadly (sector-wide) or granular (company-specific). Historically, our tracked consumer spend has correlated above 85% with company-reported data from thousands of firms. Users can sort and filter by many metrics and KPIs, such as sales and transaction growth rates and online or offline transactions, as well as view customer behavior within a geographic market at a state or city level. - Reveal cohort consumer behavior to decipher long-term behavioral consumer spending shifts. Measure market share, wallet share, loyalty, consumer lifetime value, retention, demographics, and more.) - Study the effects of inflation rates via such metrics as increased total spend, ticket size, and number of transactions. - Seek out alpha-generating signals or manage your business strategically with essential, aggregated transaction and spending data analytics.
Use Cases Categories (Our data provides an innumerable amount of use cases, and we look forward to working with new ones): 1. Market Research: Company Analysis, Company Valuation, Competitive Intelligence, Competitor Analysis, Competitor Analytics, Competitor Insights, Customer Data Enrichment, Customer Data Insights, Customer Data Intelligence, Demand Forecasting, Ecommerce Intelligence, Employee Pay Strategy, Employment Analytics, Job Income Analysis, Job Market Pricing, Marketing, Marketing Data Enrichment, Marketing Intelligence, Marketing Strategy, Payment History Analytics, Price Analysis, Pricing Analytics, Retail, Retail Analytics, Retail Intelligence, Retail POS Data Analysis, and Salary Benchmarking
Investment Research: Financial Services, Hedge Funds, Investing, Mergers & Acquisitions (M&A), Stock Picking, Venture Capital (VC)
Consumer Analysis: Consumer Data Enrichment, Consumer Intelligence
Market Data: AnalyticsB2C Data Enrichment, Bank Data Enrichment, Behavioral Analytics, Benchmarking, Customer Insights, Customer Intelligence, Data Enhancement, Data Enrichment, Data Intelligence, Data Modeling, Ecommerce Analysis, Ecommerce Data Enrichment, Economic Analysis, Financial Data Enrichment, Financial Intelligence, Local Economic Forecasting, Location-based Analytics, Market Analysis, Market Analytics, Market Intelligence, Market Potential Analysis, Market Research, Market Share Analysis, Sales, Sales Data Enrichment, Sales Enablement, Sales Insights, Sales Intelligence, Spending Analytics, Stock Market Predictions, and Trend Analysis
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Protein aggregation occurs when misfolded or unfolded proteins physically bind together and can promote the development of various amyloid diseases. This study aimed to construct surrogate models for predicting protein aggregation via data-driven methods using two types of databases. First, an aggregation propensity score database was constructed by calculating the scores for protein structures in the Protein Data Bank using Aggrescan3D 2.0. Moreover, feature- and graph-based models for predicting protein aggregation have been developed by using this database. The graph-based model outperformed the feature-based model, resulting in an R2 of 0.95, although it intrinsically required protein structures. Second, for the experimental data, a feature-based model was built using the Curated Protein Aggregation Database 2.0 to predict the aggregated intensity curves. In summary, this study suggests approaches that are more effective in predicting protein aggregation, depending on the type of descriptor and the database.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains interpolated and aggregated soil and climate data of the region of North Rhine-Westphalia (Germany). The data is provided for grids of 1, 10, 25, 50 and 100 km resolutions. These data grids represent spatial aggregations of the climate of approximately 1 km resolution and soil data of approximately 300 m resolution raster. The purpose of this data is the use as input for crop models. It thus contains the key relevant soil and climate variables for running crop models. Additionally, the data is specifically designed to analyze effects of scale and resolution in crop models, e.g. data aggregation effects. It has been used for several studies on spatial scales with regard to different scaling approaches, crops, crop models, model output variables, production situations and crop management among others.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this study, we merge results of two recent directions in efficiency analysis research-aggregation and bootstrap-applied, as an example, to one of the most popular point estimators of individual efficiency: the data envelopment analysis (DEA) estimator. A natural context of the methodology developed here is a study of efficiency of a particular economic system (e.g., an industry) as a whole, or a comparison of efficiencies of distinct groups within such a system (e.g., regulated vs. non-regulated firms or private vs. public firms). Our methodology is justified by the (neoclassical) economic theory and is supported by carefully adapted statistical methods.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MATLAB code to reproduce results presented in the paper "Privacy-Preserving Data Aggregation with Probabilistic Range Validation".
The source code is available as a git repository.
The source code was published by the paper's authors several years after the paper was published.
Git repository
Relevant code is stored in the src
directory.
Measures and visualises various metrics shown in the paper. The settings in the various scripts correspond exactly to those used to achieve the results in the paper. The code is fully deterministic and gives the exact same results each time.
Unfortunately, Figure 6 in the paper was generated with a version of this code in which the seed for the random number generator was not configured correctly, and as a result Figure 6 cannot be recreated exactly. However, the outputs of the scripts in the repository are not significantly different from the published Figure 6, and do not undermine or alter the conclusions in any significant way.
See ARTIFACT-EVALUATION.md
in the root folder for detailed end-user instructions.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/OOIEAOhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7910/DVN/OOIEAO
Most measures of social conflict processes are derived from primary and secondary source reports. In many cases, reports are used to create event-level data sets by aggregating information from multiple, and often conflicting, reports to single event observations. We argue this pre-aggregation is less innocuous than it seems, costing applied researchers opportunities for improved inference. First, researchers cannot evaluate the consequences of different methods of report aggregation. Second, aggregation discards report-level information (i.e., variation across reports) that is useful in addressing measurement error inherent in event data. Therefore, we advocate that data should be supplied and analyzed at the report level. We demonstrate the consequences of using aggregated event data as a predictor or outcome variable, and how analysis can be improved using report-level information directly. These gains are demonstrated with simulated-data experiments and in the analysis of real-world data, using the newly available Mass Mobilization in Autocracies Database (MMAD)
This table includes platform data for Facebook participants in the Deactivation experiment. Each row of the dataset corresponds to data from a participant’s Facebook user account. Each column contains a value, or set of values, that aggregates log data for this specific participant over a certain period of time.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data used in the forthcoming “The modifiable areal unit problem in geospatial least-cost electrification modelling” publication.
The work describes how different methods of aggregation of population data effects the results produced by the Open Source Spatial Electrification Tool (OnSSET, https://github.com/OnSSET). In the initial study three countries have been assessed: Benin, Malawi and Namibia. The choice of countries is due to their different national population densities and starting electrification rates. The following repository includes three zipped files, one for each country, containing the 26 input files used in the study. These input files are generated with the QGIS tools published in the OnSSET repository (https://github.com/onsset). This data repository also contains a file describing the naming conventions for the results used and the summary files generated with OnSSET.
For more information on how to generate these datasets, please refer to the following GitHub repository https://github.com/babakkhavari/MAUP and the corresponding publication (To Be Added)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Main soil types in North Rhine-Westphalia as influenced by aggregation.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Recent changes in institutional cyberinfrastructure and collections data storage methods have dramatically improved accessibility of specimen-based data through the use of digital databases and data aggregators. This analysis of digitized fish collections in the U.S. demonstrates how information from data aggregators, in this case iDigBio, can be extracted and analyzed. Data from U.S. institutional fish collections in iDigBio were explored through a strictly programmatic approach using the ridigbio package and fishfindR web application. iDigBio facilitates the aggregation of collections data on a purely voluntary fashion that requires collection staff to consent to sharing of their data. Not all collections are sharing their data with iDigBio, but the data harvested from 38 of the 143 known fish collections in the U.S. that are in iDigBio account for the majority of fish specimens housed in U.S. collections. In the 22 years since publication of the last survey providing information on these 38 collections, 1,219,168 specimen records (lots), 15,225,744 specimens, 3,192 primary types, and 32,868 records of secondary types have been added. This is an increase of 65.1% in the number of cataloged records and an increase of 56.1% in the number of specimens. In addition to providing specimen-based data for research, education, and various outreach activities, data that are accessible via data aggregators can be used to develop accurate, up-to-date reports of information on institutional collections. Such reports present collections data in an organized and accessible fashion and can guide targeted efforts by collections personnel to meet discipline-specific needs and make data more transparent to downstream users. Data from this survey will be updated and published regularly in a dynamic web application that will aid collections staff in communicating collections value while simultaneously giving stakeholders a way to explore collections holdings as they relate to the institutions in which they are housed. It is through this resource that collections will be able to leverage their data against those of similar collections to aid in the procurement of financial and institutional support.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Health data and environmental data are commonly collected at different levels of aggregation. A persistent challenge of using a spatial regression model to link these data is that their associations can vary as a function of aggregation. This results into ecological fallacy if association at one aggregation level is used for inferencing at another level. We address this challenge by presenting a hierarchically adaptable spatial regression model. In essence, the model extends the spatially varying coefficient model to allow the response to be count data at larger aggregation levels than that of the covariates. A Bayesian hierarchical approach is used for inferencing the model parameters. Robust inference and optimal prediction over geographical space and at different spatial aggregation levels are studied by simulated data sets. The spatial associations at different spatial supports are largely different, but can be efficiently inferred when prior knowledge of the associations is available. The model is applied to study hand, foot and mouth disease (HFMD) in Da Nang city, Viet Nam. Decrease in vegetated areas corresponds with elevated HFMD risks. A study to the identifiability of the parameters shows a strong need for a highly informative prior distribution. We conclude that the model is robust to the underlying aggregation levels of the calibrating data for association inference and it is ready for application in health geography.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The DIAMAS project investigates Institutional Publishing Service Providers (IPSP) in the broadest sense, with a special focus on those publishing initiatives that do not charge fees to authors or readers. To collect information on Institutional Publishing in the ERA, a survey was conducted among IPSPs between March-May 2024. This dataset contains aggregated data from the 685 valid responses to the DIAMAS survey on Institutional Publishing.
The dataset supplements D2.3 Final IPSP landscape Report Institutional Publishing in the ERA: results from the DIAMAS survey.
The data
Basic aggregate tabular data
Full individual survey responses are not being shared to prevent the easy identification of respondents (in line with conditions set out in the survey questionnaire). This dataset contains full tables with aggregate data for all questions from the survey, with the exception of free-text responses, from all 685 survey respondents. This includes, per question, overall totals and percentages for the answers given as well the breakdown by both IPSP-types: institutional publishers (IPs) and service providers (SPs). Tables at country level have not been shared, as cell values often turned out to be too low to prevent potential identification of respondents. The data is available in csv and docx formats, with csv files grouped and packaged into ZIP files. Metadata describing data type, question type, as well as question response rate, is available in csv format. The R code used to generate the aggregate tables is made available as well.
Files included in this dataset
survey_questions_data_description.csv - metadata describing data type, question type, as well as question response rate per survey question.
tables_raw_all.zip - raw tables (csv format) with aggregated data per question for all respondents, with the exception of free-text responses. Questions with multiple answers have a table for each answer option. Zip file contains 180 csv files.
tables_raw_IP.zip - as tables_raw_all.zip, for responses from institutional publishers (IP) only. Zip file contains 180 csv files.
tables_raw_SP.zip - as tables_raw_all.zip, for responses from service providers (SP) only. Zip file contains 170 csv files.
tables_formatted_all.docx - formatted tables (docx format) with aggregated data per question for all respondents, with the exception of free-text responses. Questions with multiple answers have a table for each answer option.
tables_formatted_IP.docx - as tables_formatted_all.docx, for responses from institutional publishers (IP) only.
tables_formatted_SP.docx - as tables_formatted_all.docx, for responses from service providers (SP) only.
DIAMAS_Tables_single.R - R script used to generate raw tables with aggregated data for all single response questions
DIAMAS_Tables_multiple.R - R script used to generate raw tables with aggregated data for all multiple response questions
DIAMAS_Tables_layout.R - R script used to generate document with formatted tables from raw tables with aggregated data
DIAMAS Survey on Instititutional Publishing - data availability statement (pdf)
All data are made available under a CC0 license.
Biomass, Biodiversity Effects, and Diversity for all three yearsSPACEdata_Dryad.xlsx
We investigate the potential of transparency to influence committee decision-making. We present a model in which career concerned committee members receive private information of different type-dependent accuracy, deliberate, and vote. We study three levels of transparency under which career concerns are predicted to affect behavior differently and test the model's key predictions in a laboratory experiment. The model's predictions are largely borne out—transparency negatively affects information aggregation at the deliberation and voting stages, leading to sharply different committee error rates than under secrecy. This occurs despite subjects revealing more information under transparency than theory predicts.
Aggregation of generic tables describing the Noise Zones, for an infrastructure, the type of infrastructure concerned ROUTE (R), card type C and LD index.
Road infrastructure concerned: A68, C1_albi, C1_castres, D100, D1012, D13, D612, D622, D630, D631, D69, D800, D81, D84, D87, D88, D912, D926, D968, D988, D999A, D999, N112, N126, N88
Limit value exceedance maps (or “type c” maps) maps to be made within the framework of the CBS pursuant to Article 3-II-1°-c of the Decree of 24 March 2006. These are two maps representing the areas where the Lden limit values are exceeded for the year in which the maps are drawn up.
Lden sound level indicator means Level Day-Evening-Night. It corresponds to an equivalent 24-hour sound level in which evening and night noise levels are increased by 5 and 10 dB(A), respectively, to reflect greater discomfort during these periods.
Aggregation obtained by the QGIS MIZOGEO plugin made available by CEREMA.
Data source by infrastructure: CEREMA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and Inferred Networks accompanying the manuscript entitled - “Aggregation of recount3 RNA-seq data improves the inference of consensus and context-specific gene co-expression networks”
Authors: Prashanthi Ravichandran, Princy Parsana, Rebecca Keener, Kaspar Hansen, Alexis Battle
Affiliations: Johns Hopkins University School of Medicine, Johns Hopkins University Department of Computer Science, Johns Hopkins University Bloomberg School of Public Health
Description:
This folder includes data produced in the analysis contained in the manuscript and inferred consensus and context-specific networks from graphical lasso and WGCNA with varying numbers of edges. Contents include:
all_metadata.rds: File including meta-data columns of study accession ID, sample ID, assigned tissue category, cancer status and disease status obtained through manual curation for the 95,484 RNA-seq samples used in the study.
all_counts.rds: log2 transformed RPKM normalized read counts for 5999 genes and 95,484 RNA-seq samples which was utilized for dimensionality reduction and data exploration
precision_matrices.zip: Zipped folder including networks inferred by graphical lasso for different experiments presented in the paper using weighted covariance aggregation following PC correction.
The networks can be found as follows. First, select the folder corresponding to the network of interest - for example, Blood, this will then include two or more folders which indicate the data aggregation utilized, select the folder corresponding appropriate level of data aggregation - either all samples/ GTEx for blood-specific networks, this includes precision matrices inferred across a range of penalization parameters. To view the precision matrix inferred for a particular value of the penalization parameter X, select the file labeled lambda_X.rds
For select networks, we have included the computed centrality measures which can be accessed at centrality_X.rds for a particular value of the penalization parameter X.
We have also included .rds files that list the hub genes from the consensus networks inferred from non-cancerous samples at “normal_hubs.rds”, and the consensus networks inferred from cancerous samples at “cancer_hubs.rds”
The file “context_specific_selected_networks.csv” includes the networks that were selected for downstream biological interpretation based on the scale-free criterion which is also summarized in the Supplementary Tables.
WGCNA.zip: A zipped folder containing gene modules inferred from WGCNA for sequentially aggregated GTEx, SRA, and blood studies. Select the data aggregated, and the number of studies based on folder names. For example, blood networks inferred from 20 studies can be accessed at blood/consensus/net_20. The individual networks correspond to distinct cut heights, and include information on the cut height used, the genes that the network was inferred over merged module labels, and merged module colors.
This dataset collects the slides that were presented at the Data Collaborations Across Boundaries session in SciDataCon 2022, part of the International Data Week.
The following session proposal was prepared by Tyng-Ruey Chuang and submitted to SciDataCon 2022 organizers for consideration on 2022-02-28. The proposal was accepted on 2022-03-28. Six abstracts were submitted and accepted to this session. Five presentations were delivered online in a virtual session on 2022-06-21.
Data Collaborations Across Boundaries
There are many good stories about data collaborations across boundaries. We need more. We also need to share the lessons each of us has learned from collaborating with parties and communities not in our familiar circles.
By boundaries, we mean not just the regulatory borders in between the nation states about data sharing but the various barriers, readily conceivable or not, that hinder collaboration in aggregating, sharing, and reusing data for social good. These barriers to collaboration exist between the academic disciplines, between the economic players, and between the many user communities, just to name a few. There are also cross-domain barriers, for example those that lay among data practitioners, public administrators, and policy makers when they are articulating the why, what, and how of "open data" and debating its economic significance and fair distribution. This session aims to bring together experiences and thoughts on good data practices in facilitating collaborations across boundaries and domains.
The success of Wikipedia proves that collaborative content production and service, by ways of copyleft licenses, can be sustainable when coordinated by a non-profit and funded by the general public. Collaborative code repositories like GitHub and GitLab demonstrate the enormous value and mass scale of systems-facilitated integration of user contributions that run across multiple programming languages and developer communities. Research data aggregators and repositories such as GBIF, GISAID, and Zenodo have served numerous researchers across academic disciplines. Citizen science projects and platforms, for instance eBird, Galaxy Zoo, and Taiwan Roadkill Observation Network (TaiRON), not only collect data from diverse communities but also manage and release datasets for research use and public benefit (e.g. TaiRON datasets being used to improve road design and reduce animal mortality). At the same time large scale data collaborations depend on standards, protocols, and tools for building registries (e.g. Archival Resource Key), ontologies (e.g. Wikidata and schema.org), repositories (e.g. CKAN and Omeka), and computing services (e.g. Jupyter Notebook). There are many types of data collaborations. The above lists only a few.
This session proposal calls for contributions to bring forward lessons learned from collaborative data projects and platforms, especially about those that involve multiple communities and/or across organizational boundaries. Presentations focusing on the following (non-exclusive) topics are sought after:
Support mechanisms and governance structures for data collaborations across organizations/communities.
Data policies --- such as data sharing agreements, memorandum of understanding, terms of use, privacy policies, etc. --- for facilitating collaborations across organizations/communities.
Traditional and non-traditional funding sources for data collaborations across multiple parties; sustainability of data collaboration projects, platforms, and communities.
Data workflows --- collection, processing, aggregation, archiving, and publishing, etc. --- designed with considerations of (external) collaboration.
Collaborative web platforms for data acquisition, curation, analysis, visualization, and education.
Examples and insights from data trusts, data coops, as well as other formal and informal forms of data stewardship.
Debates on the pros and cons of centralized, distributed, and/or federated data services.
Practical lessons learned from data collaboration stories: failure, success, incidence, unexpected turn of event, aftermath, etc. (no story is too small!).
On the 8th of September 2022 we carried out a search in the Web of Science with the search string “(Ripley's K function) AND (forest)”. The search yielded 356 hits. We screened those 356 studies for eligibility, first based on the suitability of their article titles and second based on their abstracts (Figure S1). The 240 eligible studies were subsequently screened manually upon reading the entire article based on the following inclusion criteria: (1) The study reported on univariate Ripley's K or L statistics or else it was possible to extract those from figures or maps. (2) The study had been carried out in a woody ecosystem or a rangeland. (3) The univariate Ripley’s K statistics described the distribution of individuals from a single plant species. (4) &...
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
This table contains 1140 series, with data for years 1961 - 2008 (not all combinations necessarily have data for all years). This table contains data described by the following dimensions (Not all combinations are available): Geography (1 items: Canada ...) Final demand categories (42 items: Total final demand: final expenditure on gross domestic product (GDP); Personal expenditures; furniture and household appliances; Personal expenditures; motor vehicles; parts and repairs; Personal expenditures; other durable goods ...) Commodity (104 items: Total; final demand; Grains; Live animals; Other agricultural products ...).
Collect and combine data from multiple internal and external data sources for exposure to consumers. Data for any individual is made available via a standard set of hierarchical HTTP resources through the Read Service. The VRS calls the ISIC external Producer endpoints to fetch and aggregate Care Coordinator Profiles VLER document type data and convert it to an XML Atom feed format for the Consumer.