8 datasets found

Prosper loan data.
kaggle.com
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shikhar Sharma
Description
Context

Loan Data from Prosper.

Content

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Data Visualization Tools Market Analysis, Size, and Forecast 2025-2029:...
technavio.com
pdf
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Visualization Tools Market Analysis, Size, and Forecast 2025-2029: North America (Mexico), Europe (France, Germany, and UK), Middle East and Africa (UAE), APAC (Australia, China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-visualization-tools-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Feb 6, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
Europe, Japan, United Kingdom, Germany, Mexico
Description
Snapshot img

Data Visualization Tools Market Size 2025-2029

The data visualization tools market size is forecast to increase by USD 7.95 billion at a CAGR of 11.2% between 2024 and 2029.

The market is experiencing significant growth due to the increasing demand for business intelligence and AI-powered insights. Companies are recognizing the value of transforming complex data into easily digestible visual representations to inform strategic decision-making. However, this market faces challenges as data complexity and massive data volumes continue to escalate. Organizations must invest in advanced data visualization tools to effectively manage and analyze their data to gain a competitive edge. The ability to automate data visualization processes and integrate AI capabilities will be crucial for companies to overcome the challenges posed by data complexity and volume. By doing so, they can streamline their business operations, enhance data-driven insights, and ultimately drive growth in their respective industries.

What will be the Size of the Data Visualization Tools Market during the forecast period?

Request Free SampleIn today's data-driven business landscape, the market continues to evolve, integrating advanced capabilities to support various sectors in making informed decisions. Data storytelling and preparation are crucial elements, enabling organizations to effectively communicate complex data insights. Real-time data visualization ensures agility, while data security safeguards sensitive information. Data dashboards facilitate data exploration and discovery, offering data-driven finance, strategy, and customer experience. Big data visualization tackles complex datasets, enabling data-driven decision making and innovation. Data blending and filtering streamline data integration and analysis. Data visualization software supports data transformation, cleaning, and aggregation, enhancing data-driven operations and healthcare. On-premises and cloud-based solutions cater to diverse business needs. Data governance, ethics, and literacy are integral components, ensuring data-driven product development, government, and education adhere to best practices. Natural language processing, machine learning, and visual analytics further enrich data-driven insights, enabling interactive charts and data reporting. Data connectivity and data-driven sales fuel business intelligence and marketing, while data discovery and data wrangling simplify data exploration and preparation. The market's continuous dynamism underscores the importance of data culture, data-driven innovation, and data-driven HR, as organizations strive to leverage data to gain a competitive edge.

How is this Data Visualization Tools Industry segmented?

The data visualization tools industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudCustomer TypeLarge enterprisesSMEsComponentSoftwareServicesApplicationHuman resourcesFinanceOthersEnd-userBFSIIT and telecommunicationHealthcareRetailOthersGeographyNorth AmericaUSMexicoEuropeFranceGermanyUKMiddle East and AfricaUAEAPACAustraliaChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.The market has experienced notable expansion as businesses across diverse sectors acknowledge the significance of data analysis and representation to uncover valuable insights and inform strategic decisions. Data visualization plays a pivotal role in this domain. On-premises deployment, which involves implementing data visualization tools within an organization's physical infrastructure or dedicated data centers, is a popular choice. This approach offers organizations greater control over their data, ensuring data security, privacy, and adherence to data governance policies. It caters to industries dealing with sensitive data, subject to regulatory requirements, or having stringent security protocols that prohibit cloud-based solutions. Data storytelling, data preparation, data-driven product development, data-driven government, real-time data visualization, data security, data dashboards, data-driven finance, data-driven strategy, big data visualization, data-driven decision making, data blending, data filtering, data visualization software, data exploration, data-driven insights, data-driven customer experience, data mapping, data culture, data cleaning, data-driven operations, data aggregation, data transformation, data-driven healthcare, on-premises data visualization, data governance, data ethics, data discovery, natural language processing, data reporting, data visualization platforms, data-driven innovation, data wrangling, data-driven s
Additional file 1 of Genomic data integration and user-defined sample-set...
springernature.figshare.com
xlsx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli (2023). Additional file 1 of Genomic data integration and user-defined sample-set extraction for population variant analysis [Dataset]. http://doi.org/10.6084/m9.figshare.21251612.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21251612.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tommaso Alfonsi; Anna Bernasconi; Arif Canakoglu; Marco Masseroli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Example of translation from VCF into GDM format for genomic region data: This .xlsx (MS Excel) spreadsheet exemplifies the transformation of the original 1KGP mutations—expressed in VCF format—into GDM genomic regions. As a demonstrative example, some variants about chromosome X have been selected from the source data (in VCF format) and listed in the first table at the top of the file. The values of columns #CHROM, POS, REF and ALT appear as in the source. We removed the details that are unnecessary for the transformation from the column INFO. In the column FORMAT it is indicated exclusively the value “GT”, meaning that the next columns contain only the genotype of the samples (this and other conventions are expressed in the VCF specification document and in the header section of each VCF file). In multiallelic variants (examples e, f.1 and f.2), the genotype indicates with a number which of the alternative alleles in ALT is present in the corresponding samples (e.g., the number 2 means that the second variant is present); otherwise, it only assumes the values 0—mutation absent, or 1—the mutation is present. Additionally, the genotype indicates whether one or both chromosome copies contain the mutation and which one, i.e., the left one or the right one; the mutated alleles are normally separated by a pipe (“|”), if not otherwise specified in the header section; we do not know which chromosome copy is maternal or paternal, but as the 1KGP mutations are “phased”, we know that the “left chromosome” is the same in every mutation located in the same chromosome of the same donor. As in this example we have only one column after the FORMAT one, the mutations described are relative to only one sample, called “HG123456”. Actually, this sample does not exist in the source, but serves the purpose of demonstrating several mutation types that are found in the original data. The table reports six variants in VCF format, with the last one repeated two times to show how different values of genotype lead to a different translation (indeed, examples f.1 and f.2 differ only for the last column). Below in the same file, the same variants appear converted in GDM format. The transformation outputs the chr, left, right, strand, AL1, AL2, ref, alt, mut_type and length columns. The value of strand is positive in every mutation, as clarified by the 1KGP Consortium after the release of the data collections. Values of AL1 and AL2 express on which chromatid the mutation occur and depend on the value of the original genotype (column HG123456). The values of the other columns, namely chr, left, right, ref, alt, mut_type and length, are obtained from the variant original values after the split of multi-allelic variants, the transformation of the original position into 0-based coordinates, and the removal of repeated nucleotide bases from the original REF and ALT columns. In 0-based coordinates, a nucleotide base occupies the space between the coordinates x and x + 1. So, SNPs (examples a and f.2) are encoded as the replacement of ref at position between left and right with alt. Insertions (examples c and f.1) are described as the addition of the sequence of bases in alt at the position indicated in left and right, i.e., in between two nucleotide bases. Deletions (example b) are represented as the substitution of ref between positions left and right with an empty value (alt is indeed empty in this case). Finally, structural variants (examples d and e) such as copy number variations and large deletions have an empty ref because, according to the VCF specification document, the original column REF reports a nucleotide (called padding-base) that is located before the scope of the variant on the genome and is unnecessary in a 0-based representation. In this file, we reported only the columns relevant for the understanding of the transformation method regarding the mutation coordinates, reference and alternative alleles. Actually, in addition to the ones reported in the second table, the transformation adds some more columns, called as the attributes in the original INFO column to capture a selection of the attributes present in the original file.
q
Data from: Invasion, Restoration, and Response: Assessing changes in...
qubeshub.org
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Cameron (2023). Invasion, Restoration, and Response: Assessing changes in arthropod community assemblage using both parametric and non-parametric approaches [Dataset]. http://doi.org/10.25334/0BXB-WP34
Explore at:
Unique identifier
https://doi.org/10.25334/0BXB-WP34
Dataset updated
Aug 1, 2023
Dataset provided by
QUBES
Authors
Andrew Cameron
Description
This lesson, which is based around a recently published paper in the Journal of Ecological Applications, introduces students to both parametric and non-parametric approaches to statistical analysis using R. Key concepts covered include Analysis of Variance and Kruskal-Wallis tests, Shapiro-Wilk and Levene's testing, least significant difference post-hoc testing, and the Shannon Diversity Index. In addition to providing a detailed guide to these quantitative approaches, the lesson includes environmental background context and a section on the art of data wrangling to prepare raw data for analysis. The entire lesson is R-based, meaning that all parts of the lesson are intended to be viewed in RStudio, or some other Integrated Development Environment.
Z
Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...
data.niaid.nih.gov
zenodo.org
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Upham, Nathan (2024). Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3974999
Explore at:
Dataset updated
Jul 18, 2024
Dataset provided by
Dalton, Trevor
Krimmel, Erica R.
Paul, Deborah L.
Mast, Austin R.
Rios, Nelson
Sherman, Aja
Soltis, Pam
Bruhn, Robert
Upham, Nathan
Abibou, Djihbrihou
Pearson, Katelin D.
Shorthouse, David P.
Simmons, Nancy B.
License
https://creativecommons.org/licenses/publicdomain/https://creativecommons.org/licenses/publicdomain/
Area covered
World
Description
This repository is associated with NSF DBI 2033973, RAPID Grant: Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats and Relatives, a Known Reservoir for Coronaviruses (https://www.nsf.gov/awardsearch/showAward?AWD_ID=2033973). Specifically, this repository contains (1) raw data from iDigBio (http://portal.idigbio.org) and GBIF (https://www.gbif.org), (2) R code for reproducible data wrangling and improvement, (3) protocols associated with data enhancements, and (4) enhanced versions of the dataset published at various project milestones. Additional code associated with this grant can be found in the BIOSPEX repository (https://github.com/iDigBio/Biospex). Long-term data management of the enhanced specimen data created by this project is expected to be accomplished by the natural history collections curating the physical specimens, a list of which can be found in this Zenodo resource.

Grant abstract: "The award to Florida State University will support research contributing to the development of georeferenced, vetted, and versioned data products of the world's specimens of horseshoe bats and their relatives for use by researchers studying the origins and spread of SARS-like coronaviruses, including the causative agent of COVID-19. Horseshoe bats and other closely related species are reported to be reservoirs of several SARS-like coronaviruses. Species of these bats are primarily distributed in regions where these viruses have been introduced to populations of humans. Currently, data associated with specimens of these bats are housed in natural history collections that are widely distributed both nationally and globally. Additionally, information tying these specimens to localities are mostly vague, or in many instances missing. This decreases the utility of the specimens for understanding the source, emergence, and distribution of SARS-COV-2 and similar viruses. This project will provide quality georeferenced data products through the consolidation of ancillary information linked to each bat specimen, using the extended specimen model. The resulting product will serve as a model of how data in biodiversity collections might be used to address emerging diseases of zoonotic origin. Results from the project will be disseminated widely in opensource journals, at scientific meetings, and via websites associated with the participating organizations and institutions. Support of this project provides a quality resource optimized to inform research relevant to improving our understanding of the biology and spread of SARS-CoV-2. The overall objectives are to deliver versioned data products, in formats used by the wider research and biodiversity collections communities, through an open-access repository; project protocols and code via GitHub and described in a peer-reviewed paper, and; sustained engagement with biodiversity collections throughout the project for reintegration of improved data into their local specimen data management systems improving long-term curation.

This RAPID award will produce and deliver a georeferenced, vetted and consolidated data product for horseshoe bats and related species to facilitate understanding of the sources, distribution, and spread of SARS-CoV-2 and related viruses, a timely response to the ongoing global pandemic caused by SARS-CoV-2 and an important contribution to the global effort to consolidate and provide quality data that are relevant to understanding emergent and other properties the current pandemic. This RAPID award is made by the Division of Biological Infrastructure (DBI) using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria."

Files included in this resource

9d4b9069-48c4-4212-90d8-4dd6f4b7f2a5.zip: Raw data from iDigBio, DwC-A format

0067804-200613084148143.zip: Raw data from GBIF, DwC-A format

0067806-200613084148143.zip: Raw data from GBIF, DwC-A format

1623690110.zip: Full export of this project's data (enhanced and raw) from BIOSPEX, CSV format

bionomia-datasets-attributions.zip: Directory containing 103 Frictionless Data packages for datasets that have attributions made containing Rhinolophids or Hipposiderids, each package also containing a CSV file for mismatches in person date of birth/death and specimen eventDate. File bionomia-datasets-attributions-key_2021-02-25.csv included in this directory provides a key between dataset identifier (how the Frictionless Data package files are named) and dataset name.

bionomia-problem-dates-all-datasets_2021-02-25.csv: List of 21 Hipposiderid or Rhinolophid records whose eventDate or dateIdentified mismatches a wikidata recipient’s date of birth or death across all datasets.

flagEventDate.txt: file containing term definition to reference in DwC-A

flagExclude.txt: file containing term definition to reference in DwC-A

flagGeoreference.txt: file containing term definition to reference in DwC-A

flagTaxonomy.txt: file containing term definition to reference in DwC-A

georeferencedByID.txt: file containing term definition to reference in DwC-A

identifiedByNames.txt: file containing term definition to reference in DwC-A

instructions-to-get-people-data-from-bionomia-via-datasetKey: instructions given to data providers

RAPID-code_collection-date.R: code associated with enhancing collection dates

RAPID-code_compile-deduplicate.R: code associated with compiling and deduplicating raw data

RAPID-code_external-linkages-bold.R: code associated with enhancing external linkages

RAPID-code_external-linkages-genbank.R: code associated with enhancing external linkages

RAPID-code_external-linkages-standardize.R: code associated with enhancing external linkages

RAPID-code_people.R: code associated with enhancing data about people

RAPID-code_standardize-country.R: code associated with standardizing country data

RAPID-data-dictionary.pdf: metadata about terms included in this project’s data, in PDF format

RAPID-data-dictionary.xlsx: metadata about terms included in this project’s data, in spreadsheet format

rapid-data-providers_2021-05-03.csv: list of data providers and number of records provided to rapid-joined-records_country-cleanup_2020-09-23.csv

rapid-final-data-product_2021-06-29.zip: Enhanced data from BIOSPEX, DwC-A format

rapid-final-gazetteer.zip: Gazetteer providing georeference data and metadata for 10,341 localities assessed as part of this project

rapid-joined-records_country-cleanup_2020-09-23.csv: data product initial version where raw data has been compiled and deduplicated, and country data has been standardized

RAPID-protocol_collection-date.pdf: protocol associated with enhancing collection dates

RAPID-protocol_compile-deduplicate.pdf: protocol associated with compiling and deduplicating raw data

RAPID-protocol_external-linkages.pdf: protocol associated with enhancing external linkages

RAPID-protocol_georeference.pdf: protocol associated with georeferencing

RAPID-protocol_people.pdf: protocol associated with enhancing data about people

RAPID-protocol_standardize-country.pdf: protocol associated with standardizing country data

RAPID-protocol_taxonomic-names.pdf: protocol associated with enhancing taxonomic name data

RAPIDAgentStrings1_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

recordedByNames.txt: file containing term definition to reference in DwC-A

Rhinolophid-HipposideridAgentStrings_and_People2_archivedCopy_30March2021.ods: resource used in conjunction with RAPID people protocol

wikidata-notes-for-bat-collectors_leachman_2020: please see https://zenodo.org/record/4724139 for this resource
f
Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon (2023). Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways [Dataset]. http://doi.org/10.1371/journal.pone.0052884
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0052884
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.
Bank dataset
kaggle.com
Updated Aug 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santhosh (2023). Bank dataset [Dataset]. https://www.kaggle.com/datasets/santhoshs623/bank-dataset/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Santhosh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description: The dataset is intentionally provided for data cleansing and applying EDA techniques. This brings fun exploring and wrangling for data geeks. The data is very original so dive-in and Happy Exploring.

Features: In total the dataset contains 121 Features. Details given below.

SK_ID_CURR ID of loan in our sample TARGET Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases) NAME_CONTRACT_TYPE Identification if loan is cash or revolving CODE_GENDER Gender of the client FLAG_OWN_CAR Flag if the client owns a car FLAG_OWN_REALTY Flag if client owns a house or flat CNT_CHILDREN Number of children the client has AMT_INCOME_TOTAL Income of the client AMT_CREDIT Credit amount of the loan AMT_ANNUITY Loan annuity AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave,…) NAME_EDUCATION_TYPE Level of highest education the client achieved NAME_FAMILY_STATUS Family status of the client NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) DAYS_BIRTH Client's age in days at the time of application DAYS_EMPLOYED How many days before the application the person started current employment DAYS_REGISTRATION How many days before the application did client change his registration DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan OWN_CAR_AGE Age of client's car FLAG_MOBIL Did client provide mobile phone (1=YES, 0=NO) FLAG_EMP_PHONE Did client provide work phone (1=YES, 0=NO) **FLAG_WORK_PHONE ** Did client provide home phone (1=YES, 0=NO) FLAG_CONT_MOBILE Was mobile phone reachable (1=YES, 0=NO) FLAG_PHONE Did client provide home phone (1=YES, 0=NO) FLAG_EMAIL Did client provide email (1=YES, 0=NO) OCCUPATION_TYPE What kind of occupation does the client have CNT_FAM_MEMBERS How many family members does client have REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) ORGANIZATION_TYPE Type of organization where client works EXT_SOURCE_1 Normalized score from external data source EXT_SOURCE_2 Normalized score from external data source EXT_SOURCE_3 Normalized score from external data source APARTMENTS_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor BASEMENTAREA_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BEGINEXPLUATATION_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor YEARS_BUILD_AVG Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MED...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data/discussion

Prosper loan data.

Data wrangling and Data visualization

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 7, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Shikhar Sharma

Description

Context

Loan Data from Prosper.

Content

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.

Clear search

Close search

Google apps

Main menu

Prosper loan data.

Context

Content

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data Visualization Tools Market Analysis, Size, and Forecast 2025-2029:...

Snapshot img

Additional file 1 of Genomic data integration and user-defined sample-set...

Data from: Invasion, Restoration, and Response: Assessing changes in...

Rapid Creation of a Data Product for the World's Specimens of Horseshoe Bats...

Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways

Bank dataset

Prosper loan data.

Data wrangling and Data visualization

Context

Content