Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by yvonne gatwiri
Released under Apache 2.0
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please also see the latest version of the repository: |
The explosion in the volume of biological imaging data challenges the available technologies for data interrogation and its intersection with related published bioinformatics data sets. Moreover, intersection of highly rich and complex datasets from different sources provided as flat csv files requires advanced informatics skills, which is time consuming and not accessible to all. Here, we provide a “user manual” to our new paradigm for systematically filtering and analysing a dataset with more than 1300 microscopy data figures using Multi-Dimensional Viewer (MDV) -link, a solution for interactive multimodal data visualisation and exploration. The primary data we use are derived from our published systematic analysis of 200 YFP traps reveals common discordance between mRNA and protein across the nervous system (eprint link). This manual provides the raw image data together with the expert annotations of the mRNA and protein distribution as well as associated bioinformatics data. We provide an explanation, with specific examples, of how to use MDV to make the multiple data types interoperable and explore them together. We also provide the open-source python code (github link) used to annotate the figures, which could be adapted to any other kind of data annotation task.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The explosion in biological data generation challenges the available technologies and methodologies for data interrogation. Moreover, highly rich and complex datasets together with diverse linked data are difficult to explore when provided in flat files. Here we provide a way to filter and analyse in a systematic way a dataset with more than 18 thousand data points using Zegami, a solution for interactive data visualisation and exploration. The primary data we use are derived from a systematic analysis of 200 YFP gene traps reveals common discordance between mRNA and protein across the nervous system which is submitted elsewhere. This manual provides the raw image data together with annotations and associated data and explains how to use Zegami for exploring all these data types together by providing specific examples. We also provide the open source python code used to annotate the figures.
https://www.mordorintelligence.com/privacy-policyhttps://www.mordorintelligence.com/privacy-policy
Big Data in the oil and gas exploration and production market is segmented by Product (Hardware, Software, and Services) and Geography (North America, Europe, Asia-Pacific, South America, and the Middle-East and Africa).
This data set contains example data for exploration of the theory of regression based regionalization. The 90th percentile of annual maximum streamflow is provided as an example response variable for 293 streamgages in the conterminous United States. Several explanatory variables are drawn from the GAGES-II data base in order to demonstrate how multiple linear regression is applied. Example scripts demonstrate how to collect the original streamflow data provided and how to recreate the figures from the associated Techniques and Methods chapter.
https://www.kaggle.com/tpmeli/missing-data-exploration-mean-iterative-more
These data contain the results of GC-MS, LC-MS and immunochemistry analyses of mask sample extracts. The data include tentatively identified compounds through library searches and compound abundance. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The data can not be accessed. Format: The dataset contains the identification of compounds found in the mask samples as well as the abundance of those compounds for individuals who participated in the trial. This dataset is associated with the following publication: Pleil, J., M. Wallace, J. McCord, M. Madden, J. Sobus, and G. Ferguson. How do cancer-sniffing dogs sort biological samples? Exploring case-control samples with non-targeted LC-Orbitrap, GC-MS, and immunochemistry methods. Journal of Breath Research. Institute of Physics Publishing, Bristol, UK, 14(1): 016006, (2019).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The values of betweenness, closeness, and Eigenvector centrality for one particular subset within the analyzed medical curriculum.
These interview data are part of the project "Looking for data: information seeking behaviour of survey data users", a study of secondary data users’ information-seeking behaviour. The overall goal of this study was to create evidence of actual information practices of users of one particular retrieval system for social science data in order to inform the development of research data infrastructures that facilitate data sharing. In the project, data were collected based on a mixed methods design. The research design included a qualitative study in the form of expert interviews and – building on the results found therein – a quantitative web survey of secondary survey data users. For the qualitative study, expert interviews with six reference persons of a large social science data archive have been conducted. They were interviewed in their role as intermediaries who provide guidance for secondary users of survey data. The knowledge from their reference work was expected to provide a condensed view of goals, practices, and problems of people who are looking for survey data. The anonymized transcripts of these interviews are provided here. They can be reviewed or reused upon request. The survey dataset from the quantitative study of secondary survey data users is downloadable through this data archive after registration. The core result of the Looking for data study is that community involvement plays a pivotal role in survey data seeking. The analyses show that survey data communities are an important determinant in survey data users' information seeking behaviour and that community involvement facilitates data seeking and has the capacity of reducing problems or barriers. The qualitative part of the study was designed and conducted using constructivist grounded theory methodology as introduced by Kathy Charmaz (2014). In line with grounded theory methodology, the interviews did not follow a fixed set of questions, but were conducted based on a guide that included areas of exploration with tentative questions. This interview guide can be obtained together with the transcript. For the Looking for data project, the data were coded and scrutinized by constant comparison, as proposed by grounded theory methodology. This analysis resulted in core categories that make up the "theory of problem-solving by community involvement". This theory was exemplified in the quantitative part of the study. For this exemplification, the following hypotheses were drawn from the qualitative study: (1) The data seeking hypotheses: (1a) When looking for data, information seeking through personal contact is used more often than impersonal ways of information seeking. (1b) Ways of information seeking (personal or impersonal) differ with experience. (2) The experience hypotheses: (2a) Experience is positively correlated with having ambitious goals. (2b) Experience is positively correlated with having more advanced requirements for data. (2c) Experience is positively correlated with having more specific problems with data. (3) The community involvement hypothesis: Experience is positively correlated with community involvement. (4) The problem solving hypothesis: Community involvement is positively correlated with problem solving strategies that require personal interactions.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.
By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.
Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.
The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!
While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.
The files contained here are a subset of the KernelVersions
in Meta Kaggle. The file names match the ids in the KernelVersions
csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.
The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.
The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads
. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays
We love feedback! Let us know in the Discussion tab.
Happy Kaggling!
Data Visualization Tools Market Size 2025-2029
The data visualization tools market size is forecast to increase by USD 7.95 billion at a CAGR of 11.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the increasing demand for business intelligence and AI-powered insights. With the rising complexity and voluminous data being generated across industries, there is a pressing need for effective data visualization tools to make data-driven decisions. This trend is particularly prominent in sectors such as healthcare, finance, and retail, where large datasets are common. Moreover, the automation of data visualization is another key driver, enabling organizations to save time and resources by streamlining the data analysis process. However, challenges such as data security concerns, lack of standardization, and integration issues persist, necessitating continuous innovation and investment in advanced technologies. Companies seeking to capitalize on this market opportunity must focus on addressing these challenges through user-friendly interfaces, security features, and seamless integration capabilities. Additionally, partnerships and collaborations with industry leaders and emerging technologies, such as machine learning and artificial intelligence, can provide a competitive edge in this rapidly evolving market.
What will be the Size of the Data Visualization Tools Market during the forecast period?
Request Free SampleThe market is experiencing growth, driven by the increasing demand for intuitive and interactive ways to analyze complex data. The market encompasses a range of solutions, including visual analytics tools and cloud-based services. The services segment, which includes integration services, is also gaining traction due to the growing need for customized and comprehensive data visualization solutions. Small and Medium-sized Enterprises (SMEs) are increasingly adopting these tools to gain insights into customer behavior and enhance decision-making. Cloud-based data visualization tools are becoming increasingly popular due to their flexibility, scalability, and cost-effectiveness. Security remains a key concern, with data security features becoming a priority for companies. Additionally, the integration of advanced technologies such as artificial intelligence (AI), machine learning (ML), augmented reality (AR), and virtual reality (VR) is transforming the market, enabling more and interactive data exploration experiences. Overall, the market is poised for continued expansion, offering significant opportunities for businesses seeking to gain a competitive edge through data-driven insights.
How is this Data Visualization Tools Industry segmented?
The data visualization tools industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudCustomer TypeLarge enterprisesSMEsComponentSoftwareServicesApplicationHuman resourcesFinanceOthersEnd-userBFSIIT and telecommunicationHealthcareRetailOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyItalyUKAPACChinaIndiaJapanSouth AmericaBrazilMiddle East and Africa
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period.The market has experienced substantial growth due to the increasing demand for data-driven insights in businesses. On-premises deployment of these tools allows organizations to maintain control over their data, ensuring data security, privacy, and adherence to regulatory requirements. This deployment model is ideal for enterprises dealing with sensitive information, as it restricts data transmission to cloud-based solutions. In addition, cloud-based solutions offer real-time data analysis, innovative solutions, integration services, customized dashboards, and mobile access. Advanced technologies like artificial intelligence (AI), machine learning (ML), Augmented Reality (AR), Virtual Reality (VR), and Business Intelligence (BI) are integrated into these tools to provide strategic insights from unstructured data. Data collection, maintenance, sharing, and analysis are simplified, enabling businesses to make informed decisions based on customer behavior and preferences. Key players in this market include , , and others, providing professional expertise and resources for data scientists and programmers using various programming languages.
Get a glance at the market report of share of various segments Request Free Sample
The On-premises segment was valued at USD 4.15 billion in 2019 and showed a gradual increase during the forecast period.
Regional Analysis
North America is estimated to contribute 31% to the growth of the global market during the forecast period.Technavio’s an
Australian mineral exploration is at a 20 year low in real terms. After doubling in line with global exploration activity during the 1990s, exploration expenditure peaked in 1996/97 and then fell sharply. The current decline differs from previous downturns in exploration that have occurred as part of the economic cycle as it is accompanied by major structural changes in the industry. Forces resulting in these changes are strongly inter-related and include:
• cost cutting to stay competitive in the face of low (declining) commodity prices • demand for greater return on shareholder investment • consolidation in response to globalisation • intense competition for risk capital (particularly for junior companies) from new sources • loss of confidence in exploration as an economic activity following declining rates of discovery and land access issues.
These factors have changed and continue to change the face of the industry.
Published in the Australasian Institute of Mining and Metallurgy Bulletin No. 1 Jan/Feb 2002, 45-52.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset comprises metadata for 225,819 train files Google Research - Identify Contrails to Reduce Global Warming challenge.
The code was obtained by using a simple bash script:
shopt -s globstar dotglob nullglob
for pathname in train/**/*; do
if [[ -f $pathname ]] && [[ ! -h $pathname ]]; then
stat -c $'%s\t%n' "$pathname"
fi
done >train_file_sizes.csv
After the bash script, the file was preprocessed with the following python code:
train_sizes = pd.read_csv('data/train_file_sizes.csv', delim_whitespace=True, names=['file_size', 'file_path'])
train_sizes['record_id'] = train_sizes.file_path.str.split('/', expand=True)[1].astype(int)
train_sizes.to_csv('data/train_file_sizes.csv', index=False)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistics for ecologists using R and Excel : data collection, exploration, analysis and presentation is a book. It was written by Mark Gardener and published by Pelagic in 2012.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Introduction to Primate Data Exploration and Linear Modeling with R was created with the goal of providing training to undergraduate biology research students on data management and statistical analysis using authentic data of Cayo Santiago rhesus macaques.
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global Exploration Software market is projected to reach $230.5 million by 2033, expanding at a CAGR of 6.9% from 2025 to 2033. The increasing demand for efficient and cost-effective exploration solutions, coupled with the growing adoption of digital technologies in the oil and gas industry, is driving market growth. The market is segmented based on type (cloud-based and web-based) and application (large enterprises and SMEs). Key market players include Schlumberger, Sintef, Petrel E&P, Quorum, geoSCOUT, Exprodat, and others. The market is primarily driven by the rising need for accurate and real-time data in exploration activities. Exploration software provides comprehensive data analysis, visualization, and modeling capabilities, enabling geologists and engineers to make informed decisions. The adoption of cloud-based solutions is further fueling market growth, as it offers flexibility, scalability, and cost-effectiveness. However, factors such as data security concerns and the availability of skilled professionals may restrain market growth to some extent. Geographically, North America and Europe are expected to be major contributors to the market, while Asia Pacific is projected to witness significant growth potential in the coming years.
This submission contains an update to the previous Exploration Gap Assessment funded in 2012, which identify high potential hydrothermal areas where critical data are needed (gap analysis on exploration data). The uploaded data are contained in two data files for each data category: A shape (SHP) file containing the grid, and a data file (CSV) containing the individual layers that intersected with the grid. This CSV can be joined with the map to retrieve a list of datasets that are available at any given site. A grid of the contiguous U.S. was created with 88,000 10-km by 10-km grid cells, and each cell was populated with the status of data availability corresponding to five data types: well data geologic maps fault maps geochemistry data geophysical data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DEEPEN stands for DE-risking Exploration of geothermal Plays in magmatic ENvironments.
As part of the development of the DEEPEN 3D play fairway analysis (PFA) methodology for magmatic plays (conventional hydrothermal, superhot EGS, and supercritical), weights needed to be developed for use in the weighted sum of the different favorability index models produced from geoscientific exploration datasets. This was done using two different approaches: one based on expert opinions, and one based on statistical learning. This GDR submission includes the datasets used to produce the statistical learning-based weights.
While expert opinions allow us to include more nuanced information in the weights, expert opinions are subject to human bias. Data-centric or statistical approaches help to overcome these potential human biases by focusing on and drawing conclusions from the data alone. The drawback is that, to apply these types of approaches, a dataset is needed. Therefore, we attempted to build comprehensive standardized datasets mapping anomalies in each exploration dataset to each component of each play. This data was gathered through a literature review focused on magmatic hydrothermal plays along with well-characterized areas where superhot or supercritical conditions are thought to exist. Datasets were assembled for all three play types, but the hydrothermal dataset is the least complete due to its relatively low priority.
For each known or assumed resource, the dataset states what anomaly in each exploration dataset is associated with each component of the system. The data is only a semi-quantitative, where values are either high, medium, or low, relative to background levels. In addition, the dataset has significant gaps, as not every possible exploration dataset has been collected and analyzed at every known or suspected geothermal resource area, in the context of all possible play types. The following training sites were used to assemble this dataset: - Conventional magmatic hydrothermal: Akutan (from AK PFA), Oregon Cascades PFA, Glass Buttes OR, Mauna Kea (from HI PFA), Lanai (from HI PFA), Mt St Helens Shear Zone (from WA PFA), Wind River Valley (From WA PFA), Mount Baker (from WA PFA). - Superhot EGS: Newberry (EGS demonstration project), Coso (EGS demonstration project), Geysers (EGS demonstration project), Eastern Snake River Plain (EGS demonstration project), Utah FORGE, Larderello, Kakkonda, Taupo Volcanic Zone, Acoculco, Krafla. - Supercritical: Coso, Geysers, Salton Sea, Larderello, Los Humeros, Taupo Volcanic Zone, Krafla, Reyjanes, Hengill. **Disclaimer: Treat the supercritical fluid anomalies with skepticism. They are based on assumptions due to the general lack of confirmed supercritical fluid encounters and samples at the sites included in this dataset, at the time of assembling the dataset. The main assumption was that the supercritical fluid in a given geothermal system has shared properties with the hydrothermal fluid, which may not be the case in reality.
Once the datasets were assembled, principal component analysis (PCA) was applied to each. PCA is an unsupervised statistical learning technique, meaning that labels are not required on the data, that summarized the directions of variance in the data. This approach was chosen because our labels are not certain, i.e., we do not know with 100% confidence that superhot resources exist at all the assumed positive areas. We also do not have data for any known non-geothermal areas, meaning that it would be challenging to apply a supervised learning technique. In order to generate weights from the PCA, an analysis of the PCA loading values was conducted. PCA loading values represent how much a feature is contributing to each principal component, and therefore the overall variance in the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This submission contains raster files associated with several datasets that include earthquake density, Na/K geothermometers, fault density, heat flow, and gravity. Integrated together using spatial modeler tools in ArcGIS, these files can be used for play fairway analysis in regard to geothermal exploration.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by yvonne gatwiri
Released under Apache 2.0