100+ datasets found

d
An example data set for exploration of Multiple Linear Regression
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
B
Data Management Plan Examples Database
borealisdata.ca
search.dataone.org
Updated Aug 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Aug 27, 2024
Dataset provided by
Borealis
Authors
Rebeca Gaston Jothyraj; Shrey Acharya; Isaac Pratt; Danica Evering; Sarthak Behal
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Time period covered
2011 - 2024
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined in the README. Data included/extracted from the examples included the discipline and field of study, author, institutional affiliation and funding information, location, date modified, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications, grant pages, or French language versions. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
f
Clustering of samples and variables with mixed-type data
plos.figshare.com
tiff
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider (2023). Clustering of samples and variables with mixed-type data [Dataset]. http://doi.org/10.1371/journal.pone.0188274
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0188274
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Manuela Hummel; Dominic Edelmann; Annette Kopp-Schneider
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods ClustOfVar and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package CluMix available from https://cran.r-project.org/web/packages/CluMix.
Data Integration Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Integration Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-integration-market-global-industry-analysis
Explore at:
pptx, csv, pdfAvailable download formats
Dataset updated
Jun 28, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Integration Market Outlook

According to our latest research, the global Data Integration market size reached USD 15.2 billion in 2024, propelled by the increasing need for seamless data management across organizations worldwide. The market is witnessing a robust growth trajectory, registering a CAGR of 11.3% from 2025 to 2033. By the end of 2033, the Data Integration market is forecasted to achieve a remarkable value of USD 40.1 billion. This growth is primarily attributed to the rapid adoption of cloud-based solutions, the proliferation of big data analytics, and the rising demand for real-time data access and management across diverse industry verticals.

One of the most significant growth factors driving the Data Integration market is the exponential rise in data volumes generated by organizations, particularly due to the widespread adoption of digital technologies. Enterprises are increasingly leveraging data integration tools and services to consolidate disparate data sources, streamline business processes, and enhance decision-making capabilities. The shift towards data-driven business models necessitates robust data integration frameworks that can manage structured, semi-structured, and unstructured data efficiently. Furthermore, the growing prevalence of IoT devices and the surge in cloud computing adoption have amplified the need for advanced data integration solutions that can handle real-time data processing and synchronization across multiple platforms.

Another key growth driver is the escalating demand for business intelligence and analytics solutions. Organizations are recognizing the strategic value of integrating data from various sources to gain actionable insights and maintain a competitive edge. Data integration solutions are increasingly being implemented to support advanced analytics, machine learning, and artificial intelligence applications. This trend is particularly pronounced in industries such as BFSI, healthcare, and retail, where timely and accurate data integration is critical for operational efficiency, regulatory compliance, and personalized customer experiences. The integration of data silos also enhances data quality, governance, and security, further fueling market growth.

The surge in regulatory requirements and data privacy mandates across regions has also contributed to the expansion of the Data Integration market. Organizations must ensure compliance with standards such as GDPR, HIPAA, and CCPA, which demand robust data management and integration practices. This has led to increased investments in data integration tools that offer features like data lineage, auditing, and secure data transfer. Additionally, the growing trend of mergers and acquisitions across industries necessitates seamless data integration to unify disparate IT systems and databases, creating further opportunities for market expansion.

From a regional perspective, North America continues to dominate the Data Integration market, accounting for the largest share in 2024, followed closely by Europe and Asia Pacific. The presence of leading technology providers, high adoption rates of advanced IT solutions, and a mature digital infrastructure in North America are key factors supporting this dominance. Meanwhile, Asia Pacific is experiencing the fastest growth, driven by rapid digital transformation initiatives, increasing investments in cloud infrastructure, and the expansion of SMEs. Latin America and the Middle East & Africa are also witnessing steady growth, albeit at a slower pace, as organizations in these regions increasingly recognize the value of data integration for business agility and innovation.

Component Analysis

The Data Integration market is segmented by component into Tools and Services, each playing a pivotal role in enabling organizations to achieve seamless data management and integration. Data integration tools are at the core of this market, providing the essential software infrastructure needed for extracting, tr
d
Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Distributed Anomaly Detection using 1-class SVM for Vertically Partitioned Data [Dataset]. https://catalog.data.gov/dataset/distributed-anomaly-detection-using-1-class-svm-for-vertically-partitioned-data
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
There has been a tremendous increase in the volume of sensor data collected over the last decade for different monitoring tasks. For example, petabytes of earth science data are collected from modern satellites, in-situ sensors and different climate models. Similarly, huge amount of flight operational data is downloaded for different commercial airlines. These different types of datasets need to be analyzed for finding outliers. Information extraction from such rich data sources using advanced data mining methodologies is a challenging task not only due to the massive volume of data, but also because these datasets are physically stored at different geographical locations with only a subset of features available at any location. Moving these petabytes of data to a single location may waste a lot of bandwidth. To solve this problem, in this paper, we present a novel algorithm which can identify outliers in the entire data without moving all the data to a single location. The method we propose only centralizes a very small sample from the different data subsets at different locations. We analytically prove and experimentally verify that the algorithm offers high accuracy compared to complete centralization with only a fraction of the communication cost. We show that our algorithm is highly relevant to both earth sciences and aeronautics by describing applications in these domains. The performance of the algorithm is demonstrated on two large publicly available datasets: (1) the NASA MODIS satellite images and (2) a simulated aviation dataset generated by the ‘Commercial Modular Aero-Propulsion System Simulation’ (CMAPSS).
E
The Human Know-How Dataset
dtechtive.com
find.data.gov.scot
pdf, zip
Updated Apr 29, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). The Human Know-How Dataset [Dataset]. http://doi.org/10.7488/ds/1394
Explore at:
pdf(0.0582 MB), zip(19.67 MB), zip(0.0298 MB), zip(9.433 MB), zip(13.06 MB), zip(0.2837 MB), zip(5.372 MB), zip(69.8 MB), zip(20.43 MB), zip(5.769 MB), zip(14.86 MB), zip(19.78 MB), zip(43.28 MB), zip(62.92 MB), zip(92.88 MB), zip(90.08 MB)Available download formats
Unique identifier
https://doi.org/10.7488/ds/1394
Dataset updated
Apr 29, 2016
Description
The Human Know-How Dataset describes 211,696 human activities from many different domains. These activities are decomposed into 2,609,236 entities (each with an English textual label). These entities represent over two million actions and half a million pre-requisites. Actions are interconnected both according to their dependencies (temporal/logical orders between actions) and decompositions (decomposition of complex actions into simpler ones). This dataset has been integrated with DBpedia (259,568 links). For more information see: - The project website: http://homepages.inf.ed.ac.uk/s1054760/prohow/index.htm - The data is also available on datahub: https://datahub.io/dataset/human-activities-and-instructions ---------------------------------------------------------------- * Quickstart: if you want to experiment with the most high-quality data before downloading all the datasets, download the file '9of11_knowhow_wikihow', and optionally files 'Process - Inputs', 'Process - Outputs', 'Process - Step Links' and 'wikiHow categories hierarchy'. * Data representation based on the PROHOW vocabulary: http://w3id.org/prohow# Data extracted from existing web resources is linked to the original resources using the Open Annotation specification * Data Model: an example of how the data is represented within the datasets is available in the attached Data Model PDF file. The attached example represents a simple set of instructions, but instructions in the dataset can have more complex structures. For example, instructions could have multiple methods, steps could have further sub-steps, and complex requirements could be decomposed into sub-requirements. ---------------------------------------------------------------- Statistics: * 211,696: number of instructions. From wikiHow: 167,232 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 44,464 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 2,609,236: number of RDF nodes within the instructions From wikiHow: 1,871,468 (datasets 1of11_knowhow_wikihow to 9of11_knowhow_wikihow). From Snapguide: 737,768 (datasets 10of11_knowhow_snapguide to 11of11_knowhow_snapguide). * 255,101: number of process inputs linked to 8,453 distinct DBpedia concepts (dataset Process - Inputs) * 4,467: number of process outputs linked to 3,439 distinct DBpedia concepts (dataset Process - Outputs) * 376,795: number of step links between 114,166 different sets of instructions (dataset Process - Step Links)
d
Coresignal | Employee Data | From the Largest Professional Network | Global...
datarade.ai
.json, .csv
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coresignal, Coresignal | Employee Data | From the Largest Professional Network | Global / 712M+ Records / 5 Years of Historical Data / Updated Daily [Dataset]. https://datarade.ai/data-products/public-resume-data-coresignal
Explore at:
.json, .csvAvailable download formats
Dataset authored and provided by
Coresignal
Area covered
Christmas Island, Latvia, Eritrea, Réunion, Russian Federation, French Guiana, Palestine, Macao, Bosnia and Herzegovina, Brunei Darussalam
Description
➡️ You can choose from multiple data formats, delivery frequency options, and delivery methods;

➡️ You can select raw or clean and AI-enriched datasets;

➡️ Multiple APIs designed for effortless search and enrichment (accessible using a user-friendly self-service tool);

➡️ Fresh data: daily updates, easy change tracking with dedicated data fields, and a constant flow of new data;

➡️ You get all necessary resources for evaluating our data: a free consultation, a data sample, or free credits for testing our APIs.

Coresignal's employee data enables you to create and improve innovative data-driven solutions and extract actionable business insights. These datasets are popular among companies from different industries, including HR and sales technology and investment.

Employee Data use cases:

✅ Source best-fit talent for your recruitment needs

Coresignal's Employee Data can help source the best-fit talent for your recruitment needs by providing the most up-to-date information on qualified candidates globally.

✅ Fuel your lead generation pipeline

Enhance lead generation with 712M+ up-to-date employee records from the largest professional network. Our Employee Data can help you develop a qualified list of potential clients and enrich your own database.

✅ Analyze talent for investment opportunities

Employee Data can help you generate actionable signals and identify new investment opportunities earlier than competitors or perform deeper analysis of companies you're interested in.

➡️ Why 400+ data-powered businesses choose Coresignal:

Experienced data provider (in the market since 2016);

Exceptional client service;

Responsible and secure data collection.
f
Collection of example datasets used for the book - R Programming -...
figshare.com
txt
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24728073.v1
Dataset updated
Dec 4, 2023
Dataset provided by
figshare
Authors
Kingsley Okoye; Samira Hosseini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
d
GapMaps Live Location Intelligence Platform | Map Data | Easy-to-use| One...
datarade.ai
.csv
Updated Aug 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GapMaps (2024). GapMaps Live Location Intelligence Platform | Map Data | Easy-to-use| One Login for Global access [Dataset]. https://datarade.ai/data-products/gapmaps-live-location-intelligence-platform-map-data-easy-gapmaps
Explore at:
.csvAvailable download formats
Dataset updated
Aug 14, 2024
Dataset authored and provided by
GapMaps
Area covered
Malaysia, Thailand, United Arab Emirates, India, Oman, Morocco, Kenya, Egypt, United States of America, Hong Kong
Description
GapMaps Live is an easy-to-use location intelligence platform available across 25 countries globally that allows you to visualise your own store data, combined with the latest demographic, economic and population movement intel right down to the micro level so you can make faster, smarter and surer decisions when planning your network growth strategy.

With one single login, you can access the latest estimates on resident and worker populations, census metrics (eg. age, income, ethnicity), consuming class, retail spend insights and point-of-interest data across a range of categories including fast food, cafe, fitness, supermarket/grocery and more.

Some of the world's biggest brands including McDonalds, Subway, Burger King, Anytime Fitness and Dominos use GapMaps Live Map Data as a vital strategic tool where business success relies on up-to-date, easy to understand, location intel that can power business case validation and drive rapid decision making.

Primary Use Cases for GapMaps Live Map Data include:

Retail Site Selection - Identify optimal locations for future expansion and benchmark performance across existing locations.

Customer Profiling: get a detailed understanding of the demographic profile of your customers and where to find more of them.

Analyse your catchment areas at a granular grid levels using all the key metrics

Target Marketing: Develop effective marketing strategies to acquire more customers.

Marketing / Advertising (Billboards/OOH, Marketing Agencies, Indoor Screens)

Customer Profiling

Target Marketing

Market Share Analysis

Some of features our clients love about GapMaps Live Map Data include: - View business locations, competitor locations, demographic, economic and social data around your business or selected location - Understand consumer visitation patterns (“where from” and “where to”), frequency of visits, dwell time of visits, profiles of consumers and much more. - Save searched locations and drop pins - Turn on/off all location listings by category - View and filter data by metadata tags, for example hours of operation, contact details, services provided - Combine public data in GapMaps with views of private data Layers - View data in layers to understand impact of different data Sources - Share maps with teams - Generate demographic reports and comparative analyses on different locations based on drive time, walk time or radius. - Access multiple countries and brands with a single logon - Access multiple brands under a parent login - Capture field data such as photos, notes and documents using GapMaps Connect and integrate with GapMaps Live to get detailed insights on existing and proposed store locations.
Data Warehousing Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Warehousing Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-warehousing-market-global-industry-analysis
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Warehousing Market Outlook

According to our latest research, the global data warehousing market size reached USD 29.7 billion in 2024, reflecting robust demand across a range of industries. Driven by the growing need for advanced analytics, real-time data integration, and scalable storage solutions, the market is expected to register a CAGR of 9.2% during the forecast period. By 2033, the market size is projected to reach USD 65.2 billion, underscoring the transformative impact of data-driven decision-making and digital transformation initiatives worldwide. The expansion is propelled by the proliferation of big data, cloud adoption, and the increasing complexity of business operations as organizations strive for enhanced agility and competitiveness.

A significant growth factor for the data warehousing market is the accelerating adoption of cloud-based solutions. Enterprises are increasingly migrating from traditional on-premises data warehouses to cloud-native platforms due to their scalability, cost-effectiveness, and ability to handle vast volumes of structured and unstructured data. The flexibility offered by cloud deployment enables organizations to scale resources dynamically based on workload demands, driving operational efficiencies and reducing capital expenditures. Furthermore, the integration of artificial intelligence and machine learning within cloud data warehouses is empowering businesses to extract actionable insights, automate data management tasks, and support predictive analytics, further fueling market growth.

Another key driver is the surge in demand for advanced analytics and business intelligence tools. As organizations recognize the value of data-driven decision-making, there is a heightened focus on leveraging data warehousing solutions to consolidate disparate data sources, enable real-time analytics, and foster collaboration across business units. The rise of self-service analytics platforms and intuitive data visualization tools is democratizing data access, allowing non-technical users to generate insights independently and accelerating the pace of innovation. Additionally, regulatory compliance and data governance requirements are compelling enterprises to invest in robust data warehousing infrastructure to ensure data accuracy, security, and traceability.

The rapid digital transformation across verticals such as BFSI, healthcare, retail, and manufacturing is also contributing to the expansion of the data warehousing market. In sectors like healthcare and finance, the need for secure, compliant, and high-performance data storage and analytics solutions is paramount due to the sensitive nature of the data involved. Retailers and e-commerce platforms are leveraging data warehousing to personalize customer experiences, optimize inventory management, and enhance supply chain visibility. Meanwhile, manufacturers are utilizing data warehouses to improve operational efficiency, monitor equipment performance, and drive innovation through predictive maintenance and IoT integration.

From a regional perspective, North America continues to dominate the data warehousing market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology vendors, early adoption of advanced analytics, and a strong emphasis on digital transformation among enterprises. Europe follows closely, supported by stringent data privacy regulations and increasing investments in cloud infrastructure. The Asia Pacific region is witnessing the fastest growth, driven by rapid urbanization, expanding digital economies, and government initiatives promoting smart city development and digital governance. Latin America and the Middle East & Africa are also emerging as promising markets, with organizations in these regions gradually embracing data-driven strategies to enhance competitiveness and operational resilience.

Offering Analysis

The data warehousing market by offering is segmented into ETL solutions, data warehouse databases, data warehouse software, and services. ETL (Extract, Transform, L
Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic (2023). Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm [Dataset]. http://doi.org/10.1371/journal.pbio.1002128
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pbio.1002128
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Tracey L. Weissgerber; Natasa M. Milic; Stacey J. Winham; Vesna D. Garovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Figures in scientific publications are critically important because they often show the data supporting key findings. Our systematic review of research articles published in top physiology journals (n = 703) suggests that, as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies. Papers rarely included scatterplots, box plots, and histograms that allow readers to critically evaluate continuous data. Most papers presented continuous data in bar and line graphs. This is problematic, as many different data distributions can lead to the same bar or line graph. The full data may suggest different conclusions from the summary statistics. We recommend training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies. Investigators can quickly make univariate scatterplots for small sample size studies using our Excel templates.
Enterprise Data Management (Edm) Market Analysis North America, Europe,...
technavio.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio, Enterprise Data Management (Edm) Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, Canada, China, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/enterprise-data-management-market-industry-analysis
Explore at:
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Canada, United Kingdom, Global
Description
Snapshot img

Enterprise Data Management Market Size 2024-2028

The enterprise data management market size is estimated to grow by USD 126.2 billion, at a CAGR of 16.83% between 2023 and 2028. The market is experiencing significant growth, driven by the increasing demand for data integration and visual analytics to support informed business decision-making. Technological developments, such as cloud computing, artificial intelligence, and machine learning, are revolutionizing data management processes, enabling organizations to handle large volumes of data more efficiently. However, integration challenges persist, particularly with unscalable applications and disparate data sources. Addressing these challenges requires strong EDM solutions that ensure data accuracy, security, and accessibility. The market is expected to continue its expansion, fueled by the growing recognition of data as a strategic asset and the need for organizations to derive actionable insights from their data to gain a competitive edge.

What will be the Size of the Enterprise Data Management Market During the Forecast Period?

To learn more about this enterprise data management market report, Request Free Sample

Enterprise Data Management Market Segmentation

The enterprise data management market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD Billion' for the period 2024 to 2028, as well as historical data from 2018 to 2022 for the following segments.

End-user Outlook BFSI Healthcare Manufacturing Retail Others Deployment Outlook On-premise Cloud-based Region Outlook North America The U.S. Canada Europe U.K. Germany France Rest of Europe APAC China India Middle East & Africa Saudi Arabia South Africa Rest of the Middle East & Africa South America Chile Brazil Argentina

By End User

The market share growth by the BFSI segment will be significant during the forecast period. The BFSI segment dominated the market and will continue to hold a major share of the market during the forecast period. The complete digitization of core processes, the adoption of customer-centric approaches, and the rising volume of data drive the growth of the segment. The enterprise data management market is growing with advancements in data governance, master data management (MDM), and cloud-based data management. Solutions such as data integration, big data management, and data security ensure seamless operations. Enterprise data analytics, data warehousing, and real-time data processing enhance decision-making. With data quality management, business intelligence tools, and data as a service (DaaS), businesses achieve robust insights and efficient data handling.

Get a glance at the market contribution of various segments. Request PDF Sample

The BFSI segment was valued at USD 18.30 billion in 2018. The deployment allows financial institutions to manage data generated from diverse systems and processes such as loan processing, claims management, customer data management, and financial transactions electronically. Hence, it improves customer-centricity. The deployment also allows financial institutions to address sectoral challenges, which range from compliance requirements to data management, data security, transparency, and availability across platforms, time, and geographies. The growth of the BFSI segment is also driven by the need to reduce processing costs, improve operational efficiency, and ensure adherence to compliance standards.

Moreover, solutions such applications provide enterprises with financial planning, budgeting, forecasting, and financial and operational reporting abilities. BFSI companies adopt to streamline their financial planning and budgeting processes in line with their business strategies and plans. The adoption enables funds transfer pricing and provides suitable applications for the accurate calculation of the profitability of the enterprise. Thus, the growth of the BFSI will positively impact enterprise data management market growth during the forecast period.

Regional Analysis

For more insights on the market share of various regions, Request PDF Sample now!

North America is estimated to contribute 38% to the growth of the global enterprise data management market during the forecast period. Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period. Several industries, especially in the US and Canada, are early adopters of advanced technologies. Hence, the volume of data generated is high, which necessitates its use in North America. The US is the leading market in North America. It is the technological capital of the world and is one of the early adopters of cutting-edge innov
d
Data to create and evaluate distribution models for invasive species for...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data to create and evaluate distribution models for invasive species for different geographic extents [Dataset]. https://catalog.data.gov/dataset/data-to-create-and-evaluate-distribution-models-for-invasive-species-for-different-geograp
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
We developed habitat suitability models for invasive plant species selected by Department of Interior land management agencies. We applied the modeling workflow developed in Young et al. 2020 to species not included in the original case studies. Our methodology balanced trade-offs between developing highly customized models for a few species versus fitting non-specific and generic models for numerous species. We developed a national library of environmental variables known to physiologically limit plant distributions (Engelstad et al. 2022 Table S1: https://doi.org/10.1371/journal.pone.0263056) and relied on human input based on natural history knowledge to further narrow the variable set for each species before developing habitat suitability models. We developed models using five algorithms with VisTrails: Software for Assisted Habitat Modeling [SAHM 2.1.2]. We accounted for uncertainty related to sampling bias by using two alternative sources of background samples, and constructed model ensembles using the 10 models for each species (five algorithms by two background methods) for three different thresholds (conservative to targeted). The mergedDataset_regionalization.csv file contains predictor values associated with pixels underlying each presence and background point. The testStripPoints_regionalization.csv file contains the locations of the modeled species occurring in the different geographic test strips.
C
Synthetic Integrated Services Data
data.wprdc.org
csv, html, pdf, zip
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allegheny County (2024). Synthetic Integrated Services Data [Dataset]. https://data.wprdc.org/dataset/synthetic-integrated-services-data
Explore at:
csv(1375554033), html, pdf, zip(39231637)Available download formats
Dataset updated
Jun 25, 2024
Dataset provided by
Allegheny County
Description
Motivation

This dataset was created to pilot techniques for creating synthetic data from datasets containing sensitive and protected information in the local government context. Synthetic data generation replaces actual data with representative data generated from statistical models; this preserves the key data properties that allow insights to be drawn from the data while protecting the privacy of the people included in the data. We invite you to read the Understanding Synthetic Data white paper for a concise introduction to synthetic data.

This effort was a collaboration of the Urban Institute, Allegheny County’s Department of Human Services (DHS) and CountyStat, and the University of Pittsburgh’s Western Pennsylvania Regional Data Center.

Collection

The source data for this project consisted of 1) month-by-month records of services included in Allegheny County's data warehouse and 2) demographic data about the individuals who received the services. As the County’s data warehouse combines this service and client data, this data is referred to as “Integrated Services data”. Read more about the data warehouse and the kinds of services it includes here.

Preprocessing

Synthetic data are typically generated from probability distributions or models identified as being representative of the confidential data. For this dataset, a model of the Integrated Services data was used to generate multiple versions of the synthetic dataset. These different candidate datasets were evaluated to select for publication the dataset version that best balances utility and privacy. For high-level information about this evaluation, see the Synthetic Data User Guide.

For more information about the creation of the synthetic version of this data, see the technical brief for this project, which discusses the technical decision making and modeling process in more detail.

Recommended Uses

This disaggregated synthetic data allows for many analyses that are not possible with aggregate data (summary statistics). Broadly, this synthetic version of this data could be analyzed to better understand the usage of human services by people in Allegheny County, including the interplay in the usage of multiple services and demographic information about clients.

Known Limitations/Biases

Some amount of deviation from the original data is inherent to the synthetic data generation process. Specific examples of limitations (including undercounts and overcounts for the usage of different services) are given in the Synthetic Data User Guide and the technical report describing this dataset's creation.

Feedback

Please reach out to this dataset's data steward (listed below) to let us know how you are using this data and if you found it to be helpful. Please also provide any feedback on how to make this dataset more applicable to your work, any suggestions of future synthetic datasets, or any additional information that would make this more useful. Also, please copy wprdc@pitt.edu on any such feedback (as the WPRDC always loves to hear about how people use the data that they publish and how the data could be improved).

Further Documentation and Resources

1) A high-level overview of synthetic data generation as a method for protecting privacy can be found in the Understanding Synthetic Data white paper.
2) The Synthetic Data User Guide provides high-level information to help users understand the motivation, evaluation process, and limitations of the synthetic version of Allegheny County DHS's Human Services data published here.
3) Generating a Fully Synthetic Human Services Dataset: A Technical Report on Synthesis and Evaluation Methodologies describes the full technical methodology used for generating the synthetic data, evaluating the various options, and selecting the final candidate for publication.
4) The WPRDC also hosts the Allegheny County Human Services Community Profiles dataset, which provides annual updates on human-services usage, aggregated by neighborhood/municipality. That data can be explored using the County's Human Services Community Profile web site.
S
The global industrial value-added dataset under different global change...
scidb.cn
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Song Wei; li huan huan; Duan Jianping; Li Han; Xue Qian; Zhang Xuyang (2024). The global industrial value-added dataset under different global change scenarios (2010, 2030, and 2050) [Dataset]. http://doi.org/10.57760/sciencedb.11406
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.11406
Dataset updated
Aug 6, 2024
Dataset provided by
Science Data Bank
Authors
Song Wei; li huan huan; Duan Jianping; Li Han; Xue Qian; Zhang Xuyang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Temporal Coverage of Data: The data collection periods are 2010, 2030, and 2050.2. Spatial Coverage and Projection:Spatial Coverage: GlobalLongitude: -180° - 180°Latitude: -90° - 90°Projection: GCS_WGS_19843. Disciplinary Scope: The data pertains to the fields of Earth Sciences and Geography.4. Data Volume: The total data volume is approximately 31.5 MB.5. Data Type: Raster (GeoTIFF)6. Thumbnail (illustrating dataset content or observation process/scene): · 7. Field (Feature) Name Explanation:a. Name Explanation: IND: Industrial Value Addedb. Unit of Measurement: Unit: US Dollars (USD)8. Data Source Description:a. Remote Sensing Data:2010 Global Vegetation Index data (Enhanced Vegetation Index, EVI, from MODIS monthly average data) and 2010 Nighttime Light Remote Sensing data (DMSP/OLS)b. Meteorological Data:From the CMCC-CM model in the Fifth International Coupled Model Intercomparison Project (CMIP5) published by the United Nations Intergovernmental Panel on Climate Change (IPCC)c. Statistical Data:From the World Development Indicators dataset of the World Bank and various national statistical agenciesd. Gross Domestic Product Data:Sourced from the project "Study on the Harmful Processes of Population and Economic Systems under Global Change" under the National Key R&D Program "Mechanisms and Assessment of Risks in Population and Economic Systems under Global Change," led by Researcher Sun Fubao at the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciencese. Other Data:Rivers, roads, settlements, and DEM, sourced from the National Oceanic and Atmospheric Administration (NOAA), Global Risk Data Platform, and Natural Earth9. Data Processing Methods(1) Spatialization of Baseline Industrial Value Added: Using 2010 global EVI vegetation index data and nighttime light remote sensing data, we addressed the oversaturation issue in nighttime light data by constructing an adjusted nighttime light index to obtain the optimal global light data. The EANTIL model was developed using NTL, NTLn, and EVI data, with the following formula:Here, EANTLI represents the adjusted nighttime light index, NTL represents the original nighttime light intensity value, and NTLn represents the normalized nighttime light intensity value. Based on the optimal light index EANTLI and the industrial value-added data from the World Bank, we constructed a regression allocation model to derive industrial value added (I), generating the global 2010 industrial value-added data with the formula:Here, I represents the industrial value added for each grid cell, and Ii represents the industrial value added for each country, EANTLi derived from ArcGIS statistical analysis and the regression allocation model.(2) Spatial Boundaries for Future Industrial Value Added: Using the Logistic-CA-Markov simulation principle and global land use data from 2010 and 2015 (from the European Space Agency), we simulated national land use changes for 2030 and 2050 and extracted urban land data as the spatial boundaries for future industrial value added. To comprehensively characterize the influence of different factors on land use and considering the research scale, we selected elevation, slope, population, GDP, distance to rivers, and distance to roads as land use driving factors. Accuracy validation using global 2015 land use data showed an average accuracy of 91.89%.(3) Estimation of Future Industrial Value Added: Based on machine learning and using the random forest model, we constructed spatialization models for industrial value added under different climate change scenarios: Here, tem represents temperature, prep represents precipitation, GDP represents national economic output, L represents urban land, D represents slope, and P represents population. The random forest model was constructed using factors such as 2010 industrial value added, urban land distribution, elevation, slope, distances to rivers, roads, railways (considering transportation), and settlements (considering noise and environmental pollution from industrial buildings), along with temperature and precipitation as climate scenario data. Except for varying temperature and precipitation values across scenarios, other variables remained constant. The model comprised 100 decision trees, with each iteration randomly selecting 90% of the samples for model construction and using the remaining 10% as test data, achieving a training sample accuracy of 0.94 and a test sample accuracy of 0.81.By analyzing the proportion of industrial value added to GDP (average from 2000 to 2020, data from the World Bank) and projected GDP under future Shared Socioeconomic Pathways (SSPs), we derived future industrial value added for each country under different SSP scenarios. Using these projections, we constructed regression models to allocate future industrial value added proportionally, resulting in spatial distribution data for 2030 and 2050 under different SSP scenarios.10. Applications and Achievements of the Dataseta. Primary Application Areas: This dataset is mainly applied in environmental protection, ecological construction, pollution prevention and control, and the prevention and forecasting of natural disasters.b. Achievements in Application (Awards, Published Reports and Articles):Achievements: Developed a method for downscaling national-scale industrial value-added data by integrating DMSP/OLS nighttime light data, vegetation distribution, and other data. Published the global industrial value-added dataset.
Multi-race Human Body Data | 300,000 ID | Computer Vision Data| Image/Video...
datarade.ai
Updated Mar 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Multi-race Human Body Data | 300,000 ID | Computer Vision Data| Image/Video Deep Learning (DL) Data [Dataset]. https://datarade.ai/data-products/nexdata-multi-race-human-body-data-300-000-id-image-vi-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Mar 16, 2024
Dataset authored and provided by
Nexdata
Area covered
Dominican Republic, El Salvador, State of, Albania, Japan, Macedonia (the former Yugoslav Republic of), Vietnam, Peru, Armenia, Latvia
Description
Specifications Data size : 200,000 ID

Race distribution : Asians, Caucasians, black people

Gender distribution : gender balance

Age distribution : ranging from teenager to the elderly, the middle-aged and young people are the majorities

Collecting environment : including indoor and outdoor scenes

Data diversity : different shooting heights, different ages, different light conditions, different collecting environment, clothes in different seasons, multiple human poses

Device : cameras

Data format : the data format is .jpg/mp4, the annotation file format is .json, the camera parameter file format is .json, the point cloud file format is .pcd

Accuracy : based on the accuracy of the poses, the accuracy exceeds 97%;the accuracy of labels of gender, race, age, collecting environment and clothes are more than 97%

About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go machine learning (ML) data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at hhttps://www.nexdata.ai/datasets/computervision?source=Datarade
Z
Data from: Open-data release of aggregated Australian school-level...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monteiro Lobato, (2020). Open-data release of aggregated Australian school-level information. Edition 2016.1 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_46086
Explore at:
Dataset updated
Jan 24, 2020
Dataset authored and provided by
Monteiro Lobato,
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The file set is a freely downloadable aggregation of information about Australian schools. The individual files represent a series of tables which, when considered together, form a relational database. The records cover the years 2008-2014 and include information on approximately 9500 primary and secondary school main-campuses and around 500 subcampuses. The records all relate to school-level data; no data about individuals is included. All the information has previously been published and is publicly available but it has not previously been released as a documented, useful aggregation. The information includes: (a) the names of schools (b) staffing levels, including full-time and part-time teaching and non-teaching staff (c) student enrolments, including the number of boys and girls (d) school financial information, including Commonwealth government, state government, and private funding (e) test data, potentially for school years 3, 5, 7 and 9, relating to an Australian national testing programme know by the trademark 'NAPLAN'

Documentation of this Edition 2016.1 is incomplete but the organization of the data should be readily understandable to most people. If you are a researcher, the simplest way to study the data is to make use of the SQLite3 database called 'school-data-2016-1.db'. If you are unsure how to use an SQLite database, ask a guru.

The database was constructed directly from the other included files by running the following command at a command-line prompt: sqlite3 school-data-2016-1.db < school-data-2016-1.sql Note that a few, non-consequential, errors will be reported if you run this command yourself. The reason for the errors is that the SQLite database is created by importing a series of '.csv' files. Each of the .csv files contains a header line with the names of the variable relevant to each column. The information is useful for many statistical packages but it is not what SQLite expects, so it complains about the header. Despite the complaint, the database will be created correctly.

Briefly, the data are organized as follows. (a) The .csv files ('comma separated values') do not actually use a comma as the field delimiter. Instead, the vertical bar character '|' (ASCII Octal 174 Decimal 124 Hex 7C) is used. If you read the .csv files using Microsoft Excel, Open Office, or Libre Office, you will need to set the field-separator to be '|'. Check your software documentation to understand how to do this. (b) Each school-related record is indexed by an identifer called 'ageid'. The ageid uniquely identifies each school and consequently serves as the appropriate variable for JOIN-ing records in different data files. For example, the first school-related record after the header line in file 'students-headed-bar.csv' shows the ageid of the school as 40000. The relevant school name can be found by looking in the file 'ageidtoname-headed-bar.csv' to discover that the the ageid of 40000 corresponds to a school called 'Corpus Christi Catholic School'. (3) In addition to the variable 'ageid' each record is also identified by one or two 'year' variables. The most important purpose of a year identifier will be to indicate the year that is relevant to the record. For example, if one turn again to file 'students-headed-bar.csv', one sees that the first seven school-related records after the header line all relate to the school Corpus Christi Catholic School with ageid of 40000. The variable that identifies the important differences between these seven records is the variable 'studentyear'. 'studentyear' shows the year to which the student data refer. One can see, for example, that in 2008, there were a total of 410 students enrolled, of whom 185 were girls and 225 were boys (look at the variable names in the header line). (4) The variables relating to years are given different names in each of the different files ('studentsyear' in the file 'students-headed-bar.csv', 'financesummaryyear' in the file 'financesummary-headed-bar.csv'). Despite the different names, the year variables provide the second-level means for joining information acrosss files. For example, if you wanted to relate the enrolments at a school in each year to its financial state, you might wish to JOIN records using 'ageid' in the two files and, secondarily, matching 'studentsyear' with 'financialsummaryyear'. (5) The manipulation of the data is most readily done using the SQL language with the SQLite database but it can also be done in a variety of statistical packages. (6) It is our intention for Edition 2016-2 to create large 'flat' files suitable for use by non-researchers who want to view the data with spreadsheet software. The disadvantage of such 'flat' files is that they contain vast amounts of redundant information and might not display the data in the form that the user most wants it. (7) Geocoding of the schools is not available in this edition. (8) Some files, such as 'sector-headed-bar.csv' are not used in the creation of the database but are provided as a convenience for researchers who might wish to recode some of the data to remove redundancy. (9) A detailed example of a suitable SQLite query can be found in the file 'school-data-sqlite-example.sql'. The same query, used in the context of analyses done with the excellent, freely available R statistical package (http://www.r-project.org) can be seen in the file 'school-data-with-sqlite.R'.
w
Multiple Indicator Cluster Survey 2006 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
+2more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Social and Environmental Statistics Department (2023). Multiple Indicator Cluster Survey 2006 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/31
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
Social and Environmental Statistics Department
Time period covered
2006
Area covered
Vietnam
Description
Abstract

The Multiple Indicator Cluster Survey (MICS) is a household survey programme developed by UNICEF to assist countries in filling data gaps for monitoring human development in general and the situation of children and women in particular. MICS is capable of producing statistically sound, internationally comparable estimates of social indicators. The Viet Nam Multiple Indicator Cluster Survey provides valuable information on the situation of children and women in Viet Nam, and was based, in large part, on the needs to monitor progress towards goals and targets emanating from recent international agreements: the Millennium Declaration, adopted by all 191 United Nations Member States in September 2000, and the Plan of Action of A World Fit For Children, adopted by 189 Member States at the United Nations Special Session on Children in May 2002. Both of these commitments build upon promises made by the international community at the 1990 World Summit for Children.

Survey Objectives: The 2006 Viet Nam Multiple Indicator Cluster Survey has as its primary objectives: - To provide up-to-date information for assessing the situation of children and women in Viet Nam; - To furnish data needed for monitoring progress toward goals established by the Millennium Development Goals, the goals of A World Fit For Children (WFFC), and other internationally agreed upon goals, as a basis for future action; - To provide valuable information for the 3rd and 4th National Report of Vietnam's implementation of the Convention on the child rights in the period 2002-2007 as well as for monitoring the National Plan of Action for Children 2001-2010.
- To contribute to the improvement of data and monitoring systems in Viet Nam and to strengthen technical expertise in the design, implementation, and analysis of such systems.

Survey Content Following the MICS global questionnaire templates, the questionnaires were designed in a modular fashion customized to the needs of Viet Nam. The questionnaires consist of a household questionnaire, a questionnaire for women aged 15-49 and a questionnaire for children under the age of five (to be administered to the mother or caretaker).

Survey Implementation The Viet Nam Multiple Indicator Cluster Survey (MICS) was carried by General Statistics Office of Viet Nam (GSO) in collaboration with Viet Nam Committee for Population, Family and Children (VCPFC). Financial and technical support was provided by the United Nations Children's Fund (UNICEF). Technical assistance and training for the survey was provided through a series of regional workshops organised by UNICEF covering questionnaire content, sampling and survey implementation; data processing; data quality and data analysis; report writing and dissemination.

Geographic coverage

The survey is nationally representative and covers the whole of Viet Nam.

Analysis unit

Households (defined as a group of persons who usually live and eat together)

Household members (defined as members of the household who usually live in the household, which may include people who did not sleep in the household the previous night, but does not include visitors who slept in the household the previous night but do not usually live in the household)

Women aged 15-49

Children aged 0-4

Universe

The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample for the Viet Nam Multiple Indicator Cluster Survey (MICS) was designed to provide reliable estimates on a large number of indicators on the situation of children and women at the national level, for urban and rural areas, and for 8 regions: Red River Delta, North West, North East, North Central Coast, South Central Coast, Central Highlands, South East, and Mekong River Delta. Regions were identified as the main sampling domains and the sample was selected in two stages. At the first stage 250 census enumeration areas (EA) were selected, of which all 240 EAs of MICS2 with systematic method were reselected and 10 new EAs were added. The addition of 10 more EAs (together with the increase in the sample size) was to increase the reliability level for regional estimates. Consequently, within each region, 30-33 EAs were selected for MICS3. After a household listing was carried out within the selected enumeration areas, a systematic sample of 1/3 of households in each EA was drawn. The survey managed to visit all of 250 selected EAs during the fieldwork period. The sample was stratified by region and is not self-weighting. For reporting national level results, sample weights are used. A more detailed description of the sample design can be found in the technical documents and in Appendix A of the final report.

Sampling deviation

No major deviations from the original sample design were made. All sample enumeration areas were accessed and successfully interviewed with good response rates.

Mode of data collection

Face-to-face

Research instrument

The questionnaires are based on the MICS3 model questionnaire. From the MICS3 model English version, the questionnaires were translated in to Vietnamese and were pretested in one province (Bac Giang) during July 2006. Based on the results of this pre-test, modifications were made to the wording and translation of the questionnaires.

Cleaning operations

Data editing took place at a number of stages throughout the processing (see Other processing), including: a) Office editing and coding b) During data entry c) Structure checking and completeness d) Secondary editing e) Structural checking of SPSS data files

Detailed documentation of the editing of data can be found in the data processing guidelines in the MICS manual http://www.childinfo.org/mics/mics3/manual.php.

Response rate

8356 households were selected for the sample. Of these all were found to be occupied households and 8355 were successfully interviewed for a response rate of 100%. Within these households, 10063 eligible women aged 15-49 were identified for interview, of which 9473 were successfully interviewed (response rate 94.1%), and 2707 children aged 0-4 were identified for whom the mother or caretaker was successfully interviewed for 2680 children (response rate 99%).

Sampling error estimates

Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling errors. Non-sampling errors are the results of mistakes made in the implementation of data collection and data processing. Numerous efforts were made during implementation of the MICS - 3 to minimize this type of error, however, non-sampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors can be evaluated statistically. The sample of respondents to the MICS - 3 is only one of many possible samples that could have been selected from the same population, using the same design and expected size. Each of these samples would yield results that different somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability in the results of the survey between all possible samples, and, although, the degree of variability is not known exactly, it can be estimated from the survey results. The sampling errors are measured in terms of the standard error for a particular statistic (mean or percentage), which is the square root of the variance. Confidence intervals are calculated for each statistic within which the true value for the population can be assumed to fall. Plus or minus two standard errors of the statistic is used for key statistics presented in MICS, equivalent to a 95 percent confidence interval.

If the sample of respondents had been a simple random sample, it would have been possible to use straightforward formulae for calculating sampling errors. However, the MICS - 3 sample is the result of a two-stage stratified design, and consequently needs to use more complex formulae. The SPSS complex samples module has been used to calculate sampling errors for the MICS - 3. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. This method is documented in the SPSS file CSDescriptives.pdf found under the Help, Algorithms options in SPSS.

Sampling errors have been calculated for a select set of statistics (all of which are proportions due to the limitations of the Taylor linearization method) for the national sample, urban and rural areas, and for each of the five regions. For each statistic, the estimate, its standard error, the coefficient of variation (or relative error -- the ratio between the standard error and the estimate), the design effect, and the square root design effect (DEFT -- the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used), as well as the 95 percent confidence intervals (+/-2 standard errors).

Data appraisal

A series of data quality tables and graphs are available to review the quality of the data and include the following:

Age distribution of the household population Age distribution of eligible women and interviewed women Age distribution of eligible children and children for whom the mother or caretaker was interviewed Age distribution of children under age 5 by 3 month groups Age and period ratios at
Enterprise Survey 2009-2019, Panel Data - Slovenia
microdata.worldbank.org
catalog.ihsn.org
Updated Aug 6, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank Group (WBG) (2020). Enterprise Survey 2009-2019, Panel Data - Slovenia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3762
Explore at:
Dataset updated
Aug 6, 2020
Dataset provided by
European Bank for Reconstruction and Developmenthttp://ebrd.com/
World Bankhttps://www.worldbank.org/
European Investment Bank (EIB)
Time period covered
2008 - 2019
Area covered
Slovenia
Description
Abstract

The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.

The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.

As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.

Geographic coverage

National

Analysis unit

The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.

Universe

As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.

Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.

For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.

For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).

Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).

For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.

For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.

For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.

Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.

Response rate

Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.

Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.

For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.

For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.

Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
w
Multiple Indicator Cluster Survey 2000 - Viet Nam
microdata.worldbank.org
catalog.ihsn.org
+2more
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
General Statistics Office (2023). Multiple Indicator Cluster Survey 2000 - Viet Nam [Dataset]. https://microdata.worldbank.org/index.php/catalog/722
Explore at:
Dataset updated
Oct 26, 2023
Dataset authored and provided by
General Statistics Office
Time period covered
2000
Area covered
Vietnam
Description
Abstract

The Viet Nam Multiple Indicator Cluster Survey (MICS) was carried by General Statistics Office of Viet Nam (GSO) in collaboration with Viet Nam Committee for Population, Family and Children (VCPFC). Financial and technical support by the United Nations Children's Fund (UNICEF).

In the World Summit for children held in New York in 1990, the Government of Vietnam committed itself to the implementation of the World Declaration and Plan of Action for children.

In implementation of directive 34/1999/CT-TTg on 27 December 1999 on promoting the implementation of the end-decade goals for children, reviewing the National Plan of Action for children, 1991-2000 and designing the National Plan of Action for children, 2001-2010, in the framework of the “Development of Social Indicators” project, the General Statistical Office (GSO) has chaired and coordinated with the Viet Nam Committee for the Protection and Care for Children (CPCC) to conduct the survey evaluating the end- decade goals for children, 1991-2000 (MICS). MICS has covered a sample size of 7628 households in 240 communes and wards representing the whole country, the urban area, the rural area and the 8 geographical areas in 61 towns/provinces. Field activities to collect data lasted 2 months, May- June/2000. The survey was technically supported by statisticians from EAPRO, UNICEF regional offices, UNICEF Hanoi on sample and questionnaire designing, data input software, not least the software analyzing and calculating the estimates generalizing the results of survey.

Survey Objectives: The end-decade survey on children is aimed at. · Providing up-to-date and reliable data to analyse the situation of children and women in 2000. · Providing data to assess the implementation of the World summit goals for children and of the National Plan of Action for Vietnamese Children, 1991-2000. · Serving as a basis (with baseline data and information) for development of the National Plan of Action for Children, 2001-2010. · Building professional capacity in monitoring, managing and evaluating all the goals of child protection, care and education at all levels.

Geographic coverage

The 2000 MICS of Vietnam was a nationally representative sample survey.

Analysis unit

Households, Women, Child.

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample for the Viet Nam Multiple Indicator Cluster Survey (MICSII) was designed to provide reliable estimates on a large number of indicators on the situation of children and women at the national level, for urban and rural areas, and for 8 regions: Red River Delta, North West, North East, North Central Coast, South Central Coast, Central Highlands, South East, and Mekong River Delta. Regions were identified as the main sampling domains and the sample was selected in two stages: At the first stage, 240 EAs are sellected. After a household listing was carried out within the selected enumeration areas, a systematic sample of 1/3 of households in each EA was drawn. The survey managed to visit all of 240 selected EAs during the fieldwork period. The sample was stratified by region and is not self-weighting. For reporting national level results, sample weights are used.

Sampling deviation

No major deviations from the original sample design were made. All sample enumeration areas were accessed and successfully interviewed with good response rates.

Mode of data collection

Face-to-face [f2f]

Research instrument

The questionnaires for MICS in Vietnam are based on the New York UNICEF module questionnaires with some modifications and additions to fit in with Vietnam's context and to evaluate the goals set out in the National Plan of Action. The questionnaires have been arranged in such a way as to prevent the loss of questionnaire sheets and to facilitate the logic control between the items in the modules. Questionnaires include 3 sections. Section 1: general questions to be administered to families and family members. Section 2: questions for child bearing-age women (aged 15-49). Section 3: for children under 5.

Section 1: Household questionnaire Part A: Household information panel Part B: Household listing form Part C: Education Part D: Child labour Part E: Maternal mortality Part F: Water and sanitation Part G: Salt iodization

Section 2: Questionnaire for child bearing-age women Part A: Child mortality Part B: Tetanus toxoid (TT) Part C: Maternal and newborn health Part D: Contraceptive use Part E: HIV/AIDS

Section 3: Questionnaire for children under five Part A:Birth registration and early learning Part B: Vitamin A Part C: Breastfeeding Part D: Care of illness Part E: Malaria Part F: Immunization Part G: Anthropometry

Apart from the questionnaires to collect information at family level, questionnaires are also designed to gather information at community level supplementary to some indicators that can not have data collected at family level. The information garnered includes local population, socio-economic and physical conditions, education, health and progress of projects/plans of actions for children.

Cleaning operations

To minimize the errors made by data entry staff members, all the records were double- entered by two different members. Any error detected between the two entries was re-checked to find out which one is wrong. Data cleaning started in to early September. This process was closely observed to ensure the accuracy, quality and practicality of all the data collected.

To minimize the errors due to wrong statements of respondents or wrong registration by interviewers, a cleaning programme was used to check the consistency and logic in the items of questionnaires and between the questionnaires. The cleaning programme printed out all the errors, then questionnaires were checked by qualified officials.

Response rate

8356 households were selected for the sample. Of these all were found to be occupied households and 8355 were successfully interviewed for a response rate of 100%. Within these households, 10063 eligible women aged 15-49 were identified for interview, of which 9473 were successfully interviewed (response rate 94.1%), and 2707 children aged 0-4 were identified for whom the mother or caretaker was successfully interviewed for 2680 children (response rate 99%).

Sampling error estimates

Estimates from a sample survey are affected by two types of errors: 1) non-sampling errors and 2) sampling errors. Non-sampling errors are the results of mistakes made in the implementation of data collection and data processing. Numerous efforts were made during implementation of the MICS - 3 to minimize this type of error, however, non-sampling errors are impossible to avoid and difficult to evaluate statistically.

Sampling errors can be evaluated statistically. The sample of respondents to the MICS - 3 is only one of many possible samples that could have been selected from the same population, using the same design and expected size. Each of these samples would yield results that different somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability in the results of the survey between all possible samples, and, although, the degree of variability is not known exactly, it can be estimated from the survey results. The sampling errors are measured in terms of the standard error for a particular statistic (mean or percentage), which is the square root of the variance. Confidence intervals are calculated for each statistic within which the true value for the population can be assumed to fall. Plus or minus two standard errors of the statistic is used for key statistics presented in MICS, equivalent to a 95 percent confidence interval.

If the sample of respondents had been a simple random sample, it would have been possible to use straightforward formulae for calculating sampling errors. However, the MICS - 3 sample is the result of a two-stage stratified design, and consequently needs to use more complex formulae. The SPSS complex samples module has been used to calculate sampling errors for the MICS - 3. This module uses the Taylor linearization method of variance estimation for survey estimates that are means or proportions. This method is documented in the SPSS file CSDescriptives.pdf found under the Help, Algorithms options in SPSS.

Sampling errors have been calculated for a select set of statistics (all of which are proportions due to the limitations of the Taylor linearization method) for the national sample, urban and rural areas, and for each of the five regions. For each statistic, the estimate, its standard error, the coefficient of variation (or relative error -- the ratio between the standard error and the estimate), the design effect, and the square root design effect (DEFT -- the ratio between the standard error using the given sample design and the standard error that would result if a simple random sample had been used), as well as the 95 percent confidence intervals (+/-2 standard errors).

Data appraisal

A series of data quality tables and graphs are available to review the quality of the data and include the following:

Age distribution of the household population Age distribution of eligible women and interviewed women Age distribution of eligible children and children for whom the mother or caretaker was interviewed Age distribution of children under age 5 by 3 month groups Age and period ratios at boundaries of eligibility Percent of observations with missing information on selected variables Presence of mother in

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. Geological Survey (2024). An example data set for exploration of Multiple Linear Regression [Dataset]. https://catalog.data.gov/dataset/an-example-data-set-for-exploration-of-multiple-linear-regression

An example data set for exploration of Multiple Linear Regression

Explore at:

Dataset updated

Jul 6, 2024

Dataset provided by

United States Geological Surveyhttp://www.usgs.gov/

Description

This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.

Clear search

Close search

Google apps

Main menu

An example data set for exploration of Multiple Linear Regression

Data Management Plan Examples Database

Clustering of samples and variables with mixed-type data

Data Integration Market Research Report 2033

Data Integration Market Outlook

Component Analysis

Data from: Distributed Anomaly Detection using 1-class SVM for Vertically...

The Human Know-How Dataset

Coresignal | Employee Data | From the Largest Professional Network | Global...

Collection of example datasets used for the book - R Programming -...

GapMaps Live Location Intelligence Platform | Map Data | Easy-to-use| One...

Data Warehousing Market Research Report 2033

Data Warehousing Market Outlook

Offering Analysis

Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Enterprise Data Management (Edm) Market Analysis North America, Europe,...

Snapshot img

Data to create and evaluate distribution models for invasive species for...

Synthetic Integrated Services Data

Motivation

Collection

Preprocessing

Recommended Uses

Known Limitations/Biases

Feedback

Further Documentation and Resources

The global industrial value-added dataset under different global change...

Multi-race Human Body Data | 300,000 ID | Computer Vision Data| Image/Video...

Data from: Open-data release of aggregated Australian school-level...

Multiple Indicator Cluster Survey 2006 - Viet Nam

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

Enterprise Survey 2009-2019, Panel Data - Slovenia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Response rate

Multiple Indicator Cluster Survey 2000 - Viet Nam

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Sampling deviation

Mode of data collection

Research instrument

Cleaning operations

Response rate

Sampling error estimates

Data appraisal

An example data set for exploration of Multiple Linear Regression