100+ datasets found
  1. f

    DataSheet1_Exploratory data analysis (EDA) machine learning approaches for...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    World
    Description

    Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

  2. Collection of example datasets used for the book - R Programming -...

    • figshare.com
    txt
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye; Samira Hosseini (2023). Collection of example datasets used for the book - R Programming - Statistical Data Analysis in Research [Dataset]. http://doi.org/10.6084/m9.figshare.24728073.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Kingsley Okoye; Samira Hosseini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.

  3. w

    Dataset of books called Data analysis in business research : a step-by-step...

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books called Data analysis in business research : a step-by-step nonparametric approach [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Data+analysis+in+business+research+%3A+a+step-by-step+nonparametric+approach
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 1 row and is filtered where the book is Data analysis in business research : a step-by-step nonparametric approach. It features 7 columns including author, publication date, language, and book publisher.

  4. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Washington and Lee University
    College of William and Mary
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  5. Basic R for Data Analysis

    • kaggle.com
    zip
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kebba Ndure (2024). Basic R for Data Analysis [Dataset]. https://www.kaggle.com/datasets/kebbandure/basic-r-for-data-analysis/code
    Explore at:
    zip(279031 bytes)Available download formats
    Dataset updated
    Dec 8, 2024
    Authors
    Kebba Ndure
    Description

    ABOUT DATASET

    This is the R markdown notebook. It contains step by step guide for working on Data Analysis with R. It helps you with installing the relevant packages and how to load them. it also provides a detailed summary of the "dplyr" commands that you can use to manipulate your data in the R environment.

    Anyone new to R and wish to carry out some data analysis on R can check it out!

  6. A data analysis framework for biomedical big data: Application on mesoderm...

    • plos.figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren (2023). A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells [Dataset]. http://doi.org/10.1371/journal.pone.0179613
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X. Andersson; Peter Sartipy; Jane Synnergren
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The development of high-throughput biomolecular technologies has resulted in generation of vast omics data at an unprecedented rate. This is transforming biomedical research into a big data discipline, where the main challenges relate to the analysis and interpretation of data into new biological knowledge. The aim of this study was to develop a framework for biomedical big data analytics, and apply it for analyzing transcriptomics time series data from early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. To this end, transcriptome profiling by microarray was performed on differentiating human pluripotent stem cells sampled at eleven consecutive days. The gene expression data was analyzed using the five-stage analysis framework proposed in this study, including data preparation, exploratory data analysis, confirmatory analysis, biological knowledge discovery, and visualization of the results. Clustering analysis revealed several distinct expression profiles during differentiation. Genes with an early transient response were strongly related to embryonic- and mesendoderm development, for example CER1 and NODAL. Pluripotency genes, such as NANOG and SOX2, exhibited substantial downregulation shortly after onset of differentiation. Rapid induction of genes related to metal ion response, cardiac tissue development, and muscle contraction were observed around day five and six. Several transcription factors were identified as potential regulators of these processes, e.g. POU1F1, TCF4 and TBP for muscle contraction genes. Pathway analysis revealed temporal activity of several signaling pathways, for example the inhibition of WNT signaling on day 2 and its reactivation on day 4. This study provides a comprehensive characterization of biological events and key regulators of the early differentiation of human pluripotent stem cells towards the mesoderm and cardiac lineages. The proposed analysis framework can be used to structure data analysis in future research, both in stem cell differentiation, and more generally, in biomedical big data analytics.

  7. f

    A Two-Step Method for smFRET Data Analysis

    • datasetcatalog.nlm.nih.gov
    • acs.figshare.com
    Updated Jul 14, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Piecco, Kurt Waldo Sy; Pyle, Joseph R.; Chen, Jixin; Kolomeisky, Anatoly B.; Landes, Christy F. (2016). A Two-Step Method for smFRET Data Analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001570037
    Explore at:
    Dataset updated
    Jul 14, 2016
    Authors
    Piecco, Kurt Waldo Sy; Pyle, Joseph R.; Chen, Jixin; Kolomeisky, Anatoly B.; Landes, Christy F.
    Description

    We demonstrate a two-step data analysis method to increase the accuracy of single-molecule Förster Resonance Energy Transfer (smFRET) experiments. Most current smFRET studies are at a time resolution on the millisecond level. When the system also contains molecular dynamics on the millisecond level, simulations show that large errors are present (e.g., > 40%) because false state assignment becomes significant during data analysis. We introduce and confirm an additional step after normal smFRET data analysis that is able to reduce the error (e.g., < 10%). The idea is to use Monte Carlo simulation to search ideal smFRET trajectories and compare them to the experimental data. Using a mathematical model, we are able to find the matches between these two sets, and back guess the hidden rate constants in the experimental results.

  8. Google Data Analytics Capstone Project

    • kaggle.com
    zip
    Updated Sep 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anamingear (2024). Google Data Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/anamingear/google-data-analytics-capstone-project
    Explore at:
    zip(40894695 bytes)Available download formats
    Dataset updated
    Sep 27, 2024
    Authors
    Anamingear
    Description

    ## Case Study: How Does a Bike-Share Navigate Speedy Success?

    Introduction Welcome to the Cyclistic bike-share analysis case study! In this case study, you work for a fictional company, Cyclistic, along with some key team members. In order to answer the business questions, follow the steps of the data analysis process: Ask, Prepare, Process,Analyze, Share, and Act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

    Scenario You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

    Characters and teams ● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use the bikes to commute to work each day.

    ● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

    ● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals—as well as how you, as a junior data analyst, can help Cyclistic achieve them.

    ● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program. About the company In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

    Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

    Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

    Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

    This project will completed using the 6 stages of Data Analysis:

    Stage 1: Ask- How Do annual members and casual riders use Cyclistic Bikes differently?

    Stage 2: Prepare- Collect the data, identify how it is organized, and check for bias and determine credibility of the data.

    Stage 3: Process- select the tool for cleaning and organizing the data, ensure data's integrity, check for errors and transform data to work with effectively.

    Stage 4: Analyze- aggregate data so it is useful and accessible, organize and format data, perform calculation, and identify trends and relationships.

    Stage 5: Share- Determine how to best share findings, create effective visualizations, and present findings.

    Stage 6: Act- create portfolio, and ad...

  9. Daily Machine Learning Practice

    • kaggle.com
    zip
    Updated Nov 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astrid Villalobos (2025). Daily Machine Learning Practice [Dataset]. https://www.kaggle.com/datasets/astridvillalobos/daily-machine-learning-practice
    Explore at:
    zip(1019861 bytes)Available download formats
    Dataset updated
    Nov 9, 2025
    Authors
    Astrid Villalobos
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Daily Machine Learning Practice – 1 Commit per Day

    Author: Astrid Villalobos Location: Montréal, QC LinkedIn: https://www.linkedin.com/in/astridcvr/

    Objective The goal of this project is to strengthen Machine Learning and data analysis skills through small, consistent daily contributions. Each commit focuses on a specific aspect of data processing, feature engineering, or modeling using Python, Pandas, and Scikit-learn.

    Dataset Source: Kaggle – Sample Sales Data File: data/sales_data_sample.csv Variables: ORDERNUMBER, QUANTITYORDERED, PRICEEACH, SALES, COUNTRY, etc. Goal: Analyze e-commerce performance, predict sales trends, segment customers, and forecast demand.

    **Project Rules **Rule Description 🟩 1 Commit per Day Minimum one line of code daily to ensure consistency and discipline 🌍 Bilingual Comments Code and documentation in English and French 📈 Visible Progress Daily green squares = daily learning 🧰 Tech Stack

    Languages: Python Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn Tools: Jupyter Notebook, GitHub, Kaggle

    Learning Outcomes By the end of this challenge: Develop a stronger understanding of data preprocessing, modeling, and evaluation. Build consistent coding habits through daily practice. Apply ML techniques to real-world sales data scenarios.

  10. Additional file 1 of Simple but powerful interactive data analysis in R with...

    • springernature.figshare.com
    zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Svetlana Ovchinnikova; Simon Anders (2024). Additional file 1 of Simple but powerful interactive data analysis in R with R/LinkedCharts [Dataset]. http://doi.org/10.6084/m9.figshare.26677037.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Svetlana Ovchinnikova; Simon Anders
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Zip file containing the interactive supplement.

  11. f

    Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...

    • datasetcatalog.nlm.nih.gov
    • scielo.figshare.com
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helito, Camilo Partezani; Gonçalves, Romeu Krause; de Lima, Lana Lacerda; Clazzer, Renata; de Lima, Diego Ariel; de Camargo, Olavo Pires (2022). HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE USING R SOFTWARE AND RSTUDIO [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000403452
    Explore at:
    Dataset updated
    May 27, 2022
    Authors
    Helito, Camilo Partezani; Gonçalves, Romeu Krause; de Lima, Lana Lacerda; Clazzer, Renata; de Lima, Diego Ariel; de Camargo, Olavo Pires
    Description

    ABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.

  12. Netflix Data Analysis

    • kaggle.com
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankul Sharma (2024). Netflix Data Analysis [Dataset]. https://www.kaggle.com/datasets/ankulsharma150/netflix-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ankul Sharma
    Description

    Introduction

    This datasets about Netflix Movies & TV Shows. Datasets have 12 columns with some null values. To analysis of dataset are used Pandas, plotly.express and Datetime libraries. Analysis process I divided into several parts for step wise analysis and to find out trending questions on social media for Bollywood actors and actress.

    Data Manipulation

    Missing Data

    There are many representations of missing data. They are Null values, missing values. I used some of methods used in data analysis process to clean missing values.

    Data Munging

    String Method

    There I used some string method on column such as 'cast', 'Lested_in' to extract data

    Datetime data type

    Converting an object type into datatype objects with the to_datetime function then we have a datatime object, can extract various part of data such as year, month and day

    EDA

    Here, I find out several eye catching question. the following questions are like as- - Show the all Movies & TV Shows released by month - Count the all types of unique rating & which rating are with most number - Salman, Shah Rukh and Akshay Kumar all movie - Find out the Movies & Series have Maximum time length - Year on Year show added on Netflix by its type - Akshay Kumar all comedies movies, Shah Rukh movies with Kajol and Salman-Akshay Movies - Who Director has made the most TV Shows - Actors and Actress who have given most Number of Movies - Find out which types of genre has most movies and TV Shows

  13. f

    Data analysis scripts VR Acceptance

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated Nov 23, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schraepen, Brenda; Gillebert, Céline R.; Huygelier, Hanne; Abeele, Vero Vanden; van Ee, Raymond (2018). Data analysis scripts VR Acceptance [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000607141
    Explore at:
    Dataset updated
    Nov 23, 2018
    Authors
    Schraepen, Brenda; Gillebert, Céline R.; Huygelier, Hanne; Abeele, Vero Vanden; van Ee, Raymond
    Description

    These R scripts were used to preprocess and analyze the data of the VR acceptance study. The script "RunAnalyses" executes each data-analysis step. The script "SummarizeData" will preprocess and create the datasets used for the analyses. Some parts of this script may not run accurately on the raw dataset that is publicly shared, since the raw data of the demographics were adjusted to protect participant's privacy.

  14. f

    Data from: Inflect: Optimizing Computational Workflows for Thermal Proteome...

    • acs.figshare.com
    xlsx
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neil A. McCracken; Sarah A. Peck Justice; Aruna B. Wijeratne; Amber L. Mosley (2023). Inflect: Optimizing Computational Workflows for Thermal Proteome Profiling Data Analysis [Dataset]. http://doi.org/10.1021/acs.jproteome.0c00872.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    ACS Publications
    Authors
    Neil A. McCracken; Sarah A. Peck Justice; Aruna B. Wijeratne; Amber L. Mosley
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The CETSA and Thermal Proteome Profiling (TPP) analytical methods are invaluable for the study of protein–ligand interactions and protein stability in a cellular context. These tools have increasingly been leveraged in work ranging from understanding signaling paradigms to drug discovery. Consequently, there is an important need to optimize the data analysis pipeline that is used to calculate protein melt temperatures (Tm) and relative melt shifts from proteomics abundance data. Here, we report a user-friendly analysis of the melt shift calculation workflow where we describe the impact of each individual calculation step on the final output list of stabilized and destabilized proteins. This report also includes a description of how key steps in the analysis workflow quantitatively impact the list of stabilized/destabilized proteins from an experiment. We applied our findings to develop a more optimized analysis workflow that illustrates the dramatic sensitivity of chosen calculation steps on the final list of reported proteins of interest in a study and have made the R based program Inflect available for research community use through the CRAN repository [McCracken, N. Inflect: Melt Curve Fitting and Melt Shift Analysis. R package version 1.0.3, 2021]. The Inflect outputs include melt curves for each protein which passes filtering criteria in addition to a data matrix which is directly compatible with downstream packages such as UpsetR for replicate comparisons and identification of biologically relevant changes. Overall, this work provides an essential resource for scientists as they analyze data from TPP and CETSA experiments and implement their own analysis pipelines geared toward specific applications.

  15. e

    Data from: MGVB: a new proteomics toolset for fast and efficient data...

    • ebi.ac.uk
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Metodi Metodiev (2024). MGVB: a new proteomics toolset for fast and efficient data analysis [Dataset]. https://www.ebi.ac.uk/pride/archive/projects/PXD051331
    Explore at:
    Dataset updated
    Nov 15, 2024
    Authors
    Metodi Metodiev
    Variables measured
    Proteomics
    Description

    MGVB is a collection of tools for proteomics data analysis. It covers data processing from in silico digestion of protein sequences to comprehensive identification of postranslational modifications and solving the protein inference problem. The toolset is developed with efficiency in mind. It enables analysis at a fraction of the resources cost typically required by existing commercial and free tools. MGVB, as it is a native application, is much faster than existing proteomics tools such as MaxQuant and MSFragger and, in the same time, finds very similar, in some cases even larger number of peptides at a chosen level of statistical significance. It implements a probabilistic scoring function to match spectra to sequences, and a novel combinatorial search strategy for finding post-translational modifications, and a Bayesian approach to locate modification sites. This report describes the algorithms behind the tools, presents benchmarking data sets analysis comparing MGVB performance to MaxQuant/Andromeda, and provides step by step instructions for using it in typical analytical scenarios. The toolset is provided free to download and use for academic research and in software projects, but is not open source at the present. It is the intention of the author that it will be made open source in the near future—following rigorous evaluations and feedback from the proteomics research community.

  16. Titanic- exploratory data analysis

    • kaggle.com
    zip
    Updated Jul 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthik (2025). Titanic- exploratory data analysis [Dataset]. https://www.kaggle.com/datasets/pandureddy123/titanic-exploratory-data-analysis
    Explore at:
    zip(962897 bytes)Available download formats
    Dataset updated
    Jul 19, 2025
    Authors
    Karthik
    Description

    One more step towards Machine learning! This is a titatic dataset with exploratory data analysis html file. I used pandas-profiling for fast analysis.

  17. d

    WDFW Status and Trends Analysis of Salmon Abundance Data (Step 1, 2022)

    • catalog.data.gov
    • data.wa.gov
    • +1more
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.wa.gov (2025). WDFW Status and Trends Analysis of Salmon Abundance Data (Step 1, 2022) [Dataset]. https://catalog.data.gov/dataset/wdfw-status-and-trends-analysis-of-salmon-abundance-data-step-1-2022
    Explore at:
    Dataset updated
    Jun 29, 2025
    Dataset provided by
    data.wa.gov
    Description

    WDFW Status and Trends Analysis of Salmon Abundance Data (Step 1, 2022)

  18. t

    Data is Useless Unless You Can Act on It - Data Analysis

    • tomtunguz.com
    Updated Jul 15, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomasz Tunguz (2014). Data is Useless Unless You Can Act on It - Data Analysis [Dataset]. https://tomtunguz.com/data-is-useless-unless-you-can-act-on-it/
    Explore at:
    Dataset updated
    Jul 15, 2014
    Dataset provided by
    Theory Ventures
    Authors
    Tomasz Tunguz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Learn why data collection alone isn't enough for startups. Discover the 3-step pattern behind successful data-driven companies and how actionable insights drive real business value.

  19. i

    Data for Analysis of a Single-Phase Single-Stage Boost Inverter Topology for...

    • ieee-dataport.org
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Awingot Akparibo (2025). Data for Analysis of a Single-Phase Single-Stage Boost Inverter Topology for PV Applications [Dataset]. https://ieee-dataport.org/documents/data-analysis-single-phase-single-stage-boost-inverter-topology-pv-applications
    Explore at:
    Dataset updated
    Mar 6, 2025
    Authors
    Awingot Akparibo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this study

  20. d

    Data from: A protocol for conducting and presenting results of...

    • search.dataone.org
    • datadryad.org
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alain F. Zuur; Elena N. Ieno (2025). A protocol for conducting and presenting results of regression-type analyses [Dataset]. http://doi.org/10.5061/dryad.v4t42
    Explore at:
    Dataset updated
    Apr 18, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Alain F. Zuur; Elena N. Ieno
    Time period covered
    Jan 1, 2017
    Description

    Scientific investigation is of value only insofar as relevant results are obtained and communicated, a task that requires organizing, evaluating, analysing and unambiguously communicating the significance of data. In this context, working with ecological data, reflecting the complexities and interactions of the natural world, can be a challenge. Recent innovations for statistical analysis of multifaceted interrelated data make obtaining more accurate and meaningful results possible, but key decisions of the analyses to use, and which components to present in a scientific paper or report, may be overwhelming. We offer a 10-step protocol to streamline analysis of data that will enhance understanding of the data, the statistical models and the results, and optimize communication with the reader with respect to both the procedure and the outcomes. The protocol takes the investigator from study design and organization of data (formulating relevant questions, visualizing data collection, data...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst (2023). DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx [Dataset]. http://doi.org/10.3389/fspas.2023.1134141.s001

DataSheet1_Exploratory data analysis (EDA) machine learning approaches for ocean world analog mass spectrometry.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Victoria Da Poian; Bethany Theiling; Lily Clough; Brett McKinney; Jonathan Major; Jingyi Chen; Sarah Hörst
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
World
Description

Many upcoming and proposed missions to ocean worlds such as Europa, Enceladus, and Titan aim to evaluate their habitability and the existence of potential life on these moons. These missions will suffer from communication challenges and technology limitations. We review and investigate the applicability of data science and unsupervised machine learning (ML) techniques on isotope ratio mass spectrometry data (IRMS) from volatile laboratory analogs of Europa and Enceladus seawaters as a case study for development of new strategies for icy ocean world missions. Our driving science goal is to determine whether the mass spectra of volatile gases could contain information about the composition of the seawater and potential biosignatures. We implement data science and ML techniques to investigate what inherent information the spectra contain and determine whether a data science pipeline could be designed to quickly analyze data from future ocean worlds missions. In this study, we focus on the exploratory data analysis (EDA) step in the analytics pipeline. This is a crucial unsupervised learning step that allows us to understand the data in depth before subsequent steps such as predictive/supervised learning. EDA identifies and characterizes recurring patterns, significant correlation structure, and helps determine which variables are redundant and which contribute to significant variation in the lower dimensional space. In addition, EDA helps to identify irregularities such as outliers that might be due to poor data quality. We compared dimensionality reduction methods Uniform Manifold Approximation and Projection (UMAP) and Principal Component Analysis (PCA) for transforming our data from a high-dimensional space to a lower dimension, and we compared clustering algorithms for identifying data-driven groups (“clusters”) in the ocean worlds analog IRMS data and mapping these clusters to experimental conditions such as seawater composition and CO2 concentration. Such data analysis and characterization efforts are the first steps toward the longer-term science autonomy goal where similar automated ML tools could be used onboard a spacecraft to prioritize data transmissions for bandwidth-limited outer Solar System missions.

Search
Clear search
Close search
Google apps
Main menu