41 datasets found
  1. d

    Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary...

    • catalog.data.gov
    Updated May 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCO NITRD (2025). Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary [Dataset]. https://catalog.data.gov/dataset/frontiers-of-data-visualization-workshop-ii-data-wrangling-workshop-summary
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    NCO NITRD
    Description

    The Data Visualization Workshop II: Data Wrangling was a web-based event held on October 18, 2017. This workshop report summarizes the individual perspectives of a group of visualization experts from the public, private, and academic sectors who met online to discuss how to improve the creation and use of high-quality visualizations. The specific focus of this workshop was on the complexities of "data wrangling". Data wrangling includes finding the appropriate data sources that are both accessible and usable and then shaping and combining that data to facilitate the most accurate and meaningful analysis possible. The workshop was organized as a 3-hour web event and moderated by the members of the Human Computer Interaction and Information Management Task Force of the Networking and Information Technology Research and Development Program's Big Data Interagency Working Group. Report prepared by the Human Computer Interaction And Information Management Task Force, Big Data Interagency Working Group, Networking & Information Technology Research & Development Subcommittee, Committee On Technology Of The National Science & Technology Council...

  2. Prosper loan data.

    • kaggle.com
    zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data
    Explore at:
    zip(23591647 bytes)Available download formats
    Dataset updated
    Jun 7, 2021
    Authors
    Shikhar Sharma
    Description

    Context

    Loan Data from Prosper.

    Content

    This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.

  3. Enriched NYTimes COVID19 U.S. County Dataset

    • kaggle.com
    zip
    Updated Jun 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
    Explore at:
    zip(11291611 bytes)Available download formats
    Dataset updated
    Jun 14, 2020
    Authors
    ringhilterra17
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Overview and Inspiration

    I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

    I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

    After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

    This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

    UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

    How this data can be used

    Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

    Content Details

    See the column descriptions for more details on the dataset

    Visualizations and Analysis Examples

    COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

    https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

    Other Data Notes

    • Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/
    • The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

    Acknowledgements

  4. D

    Data Prep Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Data Prep Report [Dataset]. https://www.marketresearchforecast.com/reports/data-prep-547253
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Prep market is booming, projected to reach $12 Billion by 2033 with a 13.7% CAGR. Discover key trends, leading companies (Alteryx, Informatica, IBM), and regional insights in this comprehensive market analysis. Learn how self-service tools and cloud solutions are transforming data preparation.

  5. Airbnb-NYC-Cleaned

    • kaggle.com
    zip
    Updated Aug 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sandeep majumdar (2022). Airbnb-NYC-Cleaned [Dataset]. https://www.kaggle.com/sandeepmajumdar/airbnbnyccleaned
    Explore at:
    zip(7294486 bytes)Available download formats
    Dataset updated
    Aug 25, 2022
    Authors
    sandeep majumdar
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    New York
    Description

    IF YOU WANT TO START WITH DATA VISUALIZATION DIRECTLY, USE THIS DATASET

    But if you want to start with data cleaning, find the original dataset below: This is a cleaned version of the Airbnb open data found at the following link: https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata

    The original message from Arian: "This dataset is part of Airbnb Inside but I tried to make new columns and many data inconsistency issue to create a new dataset to practice data cleaning. The original source can be found here http://insideairbnb.com/explore/

    Arian Azmoudeh"

  6. H

    Drought Machine Learning Data Example

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Aug 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bryce Pulver (2023). Drought Machine Learning Data Example [Dataset]. https://www.hydroshare.org/resource/9024db8a67fd4afdab2358d1b75e7e85
    Explore at:
    zip(518.5 MB)Available download formats
    Dataset updated
    Aug 22, 2023
    Dataset provided by
    HydroShare
    Authors
    Bryce Pulver
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1980 - Dec 1, 2020
    Area covered
    Description

    This repository showcases some examples of data wrangling and visualization using the output of the USGS's output from a drought prediction model on the Colorado River Basin and example ecology site data.

  7. w

    Global Data Wrangling Market Research Report: By Application (Data...

    • wiseguyreports.com
    Updated Jan 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Data Wrangling Market Research Report: By Application (Data Preparation, Data Cleaning, Data Integration, Data Transformation), By Deployment Mode (Cloud-Based, On-Premises, Hybrid), By End User (BFSI, Healthcare, Retail, Telecommunications), By Tool Type (Self-Service Tools, Data Analytics Tools, Data Visualization Tools) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/de/reports/data-wrangling-market
    Explore at:
    Dataset updated
    Jan 1, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20245.18(USD Billion)
    MARKET SIZE 20255.7(USD Billion)
    MARKET SIZE 203515.0(USD Billion)
    SEGMENTS COVEREDApplication, Deployment Mode, End User, Tool Type, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSIncreasing data volume, Need for data quality, Rising adoption of AI, Growing demand for analytics, Cost-effective solutions
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDMicrosoft, Apache Software Foundation, Google, Pentaho, IBM, Talend, RapidMiner, SAS Institute, Informatica, Deloitte, TIBCO Software, Crimson Hexagon, Oracle, Trifacta, DataRobot, Alteryx
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESIncreased demand for big data, Growing need for real-time analytics, Rise of AI-driven data solutions, Expansion in cloud-based services, Emerging trends in data privacy compliance
    COMPOUND ANNUAL GROWTH RATE (CAGR) 10.1% (2025 - 2035)
  8. n

    Data from: Designing data science workshops for data-intensive environmental...

    • data.niaid.nih.gov
    • datasetcatalog.nlm.nih.gov
    • +1more
    zip
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Montana State University
    California State Polytechnic University
    Authors
    Allison Theobold; Stacey Hancock; Sara Mannheimer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

    Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

    Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

    The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
    The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    
      The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
    
    
    The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. 
    The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. 
    The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
    
  9. NiftyOptionChainDataset

    • kaggle.com
    zip
    Updated May 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikhil Ulahannan (2021). NiftyOptionChainDataset [Dataset]. https://www.kaggle.com/nikhilulahannan/niftyoptionchaindataset
    Explore at:
    zip(31131 bytes)Available download formats
    Dataset updated
    May 13, 2021
    Authors
    Nikhil Ulahannan
    Description

    Context

    Option Chain data is a product of complex calculations yet unorganised because of its inherent non uniform data relevance structure which makes it harder to use for data analytics.

    Content

    Dataset contains 3 adjacent week raw option chain data(calls, puts, iv etc) in the month of May 2021. An additional data file is added with minor modification(clean-sample) for better utilization of data explorer features.

    Acknowledgements

    National Stock Exchange (NSE) website.

    Inspiration

    Develop code framework for data cleaning, wrangling and visualization of option chain data. Exploratory Data Analysis (EDA) Analyse evolution of option premiums, iv etc and its impact over a month. Insights for better straddles and strangles(option strategies).

  10. Netflix

    • kaggle.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasanna@82 (2025). Netflix [Dataset]. https://www.kaggle.com/datasets/prasanna82/netflix/code
    Explore at:
    zip(1400865 bytes)Available download formats
    Dataset updated
    Jul 29, 2025
    Authors
    Prasanna@82
    Description

    Netflix Dataset Exploration and Visualization

    This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.

    Objectives:

    Analyze the distribution of content types (Movies vs. TV Shows)

    Identify the most prolific countries producing Netflix content

    Study the ratings and duration of shows

    Handle missing values using techniques like interpolation, forward-fill, and custom replacements

    Enhance readability with bar charts, horizontal plots, and annotated visuals

    Key Visualizations:

    Bar charts for type distribution and country-wise contributions

    Handling missing data in rating, duration, and date_added

    Annotated plots showing values for clarity

    Tools Used:

    Python 3

    Pandas for data wrangling

    Matplotlib for visualizations

    Jupyter Notebook for hands-on analysis

    Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.

    Feel free to fork, adapt, and extend the work.

  11. Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...

    • technavio.com
    pdf
    Updated Jan 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), Middle East and Africa (UAE), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-analytics-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jan 11, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Description

    Snapshot img

    Data Analytics Market Size 2025-2029

    The data analytics market size is forecast to increase by USD 288.7 billion, at a CAGR of 14.7% between 2024 and 2029.

    The market is driven by the extensive use of modern technology in company operations, enabling businesses to extract valuable insights from their data. The prevalence of the Internet and the increased use of linked and integrated technologies have facilitated the collection and analysis of vast amounts of data from various sources. This trend is expected to continue as companies seek to gain a competitive edge by making data-driven decisions. However, the integration of data from different sources poses significant challenges. Ensuring data accuracy, consistency, and security is crucial as companies deal with large volumes of data from various internal and external sources. Additionally, the complexity of data analytics tools and the need for specialized skills can hinder adoption, particularly for smaller organizations with limited resources. Companies must address these challenges by investing in robust data management systems, implementing rigorous data validation processes, and providing training and development opportunities for their employees. By doing so, they can effectively harness the power of data analytics to drive growth and improve operational efficiency.

    What will be the Size of the Data Analytics Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleIn the dynamic and ever-evolving the market, entities such as explainable AI, time series analysis, data integration, data lakes, algorithm selection, feature engineering, marketing analytics, computer vision, data visualization, financial modeling, real-time analytics, data mining tools, and KPI dashboards continue to unfold and intertwine, shaping the industry's landscape. The application of these technologies spans various sectors, from risk management and fraud detection to conversion rate optimization and social media analytics. ETL processes, data warehousing, statistical software, data wrangling, and data storytelling are integral components of the data analytics ecosystem, enabling organizations to extract insights from their data. Cloud computing, deep learning, and data visualization tools further enhance the capabilities of data analytics platforms, allowing for advanced data-driven decision making and real-time analysis. Marketing analytics, clustering algorithms, and customer segmentation are essential for businesses seeking to optimize their marketing strategies and gain a competitive edge. Regression analysis, data visualization tools, and machine learning algorithms are instrumental in uncovering hidden patterns and trends, while predictive modeling and causal inference help organizations anticipate future outcomes and make informed decisions. Data governance, data quality, and bias detection are crucial aspects of the data analytics process, ensuring the accuracy, security, and ethical use of data. Supply chain analytics, healthcare analytics, and financial modeling are just a few examples of the diverse applications of data analytics, demonstrating the industry's far-reaching impact. Data pipelines, data mining, and model monitoring are essential for maintaining the continuous flow of data and ensuring the accuracy and reliability of analytics models. The integration of various data analytics tools and techniques continues to evolve, as the industry adapts to the ever-changing needs of businesses and consumers alike.

    How is this Data Analytics Industry segmented?

    The data analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ComponentServicesSoftwareHardwareDeploymentCloudOn-premisesTypePrescriptive AnalyticsPredictive AnalyticsCustomer AnalyticsDescriptive AnalyticsOthersApplicationSupply Chain ManagementEnterprise Resource PlanningDatabase ManagementHuman Resource ManagementOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)

    By Component Insights

    The services segment is estimated to witness significant growth during the forecast period.The market is experiencing significant growth as businesses increasingly rely on advanced technologies to gain insights from their data. Natural language processing is a key component of this trend, enabling more sophisticated analysis of unstructured data. Fraud detection and data security solutions are also in high demand, as companies seek to protect against threats and maintain customer trust. Data analytics platforms, including cloud-based offerings, are driving innovatio

  12. Z

    AI-Enabled Testing Tools Market By technology (natural language processing...

    • zionmarketresearch.com
    pdf
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). AI-Enabled Testing Tools Market By technology (natural language processing (NLP), machine learning & pattern recognition, and computer vision and image processing), By solution (services, which include professional services & managed services, and AI-based tools which are reduction & feature selection, data pre-processing & wrangling, data visualization, and others), By application (efficiency and time-to-market, further categorized into test automation, data analytics, and infrastructure optimization, agility & coverage) And By Region: - Global And Regional Industry Overview, Market Intelligence, Comprehensive Analysis, Historical Data, And Forecasts, 2024-2032 [Dataset]. https://www.zionmarketresearch.com/report/ai-enabled-testing-tools-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 14, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Global AI-Enabled Testing Tools Market was valued at $437.56 Million in 2023, and is projected to reach $USD 1693.95 Million by 2032, at a CAGR of 16.23%.

  13. N

    Replication Data for: dxpr: An R package for generating analysis-ready data...

    • dataverse.lib.nycu.edu.tw
    bin, png +1
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYCU Dataverse (2022). Replication Data for: dxpr: An R package for generating analysis-ready data from electronic health records—diagnoses and procedures. [Dataset]. http://doi.org/10.57770/ZRNVCN
    Explore at:
    png(7908), bin(11118), png(6980), bin(5446), text/markdown(25651), png(8091), text/markdown(11422), text/markdown(172)Available download formats
    Dataset updated
    Jun 22, 2022
    Dataset provided by
    NYCU Dataverse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Enriched electronic health records (EHRs) contain crucial information related to disease progression, and this information can help with decision-making in the health care field. Data analytics in health care is deemed as one of the essential processes that help accelerate the progress of clinical research. However, processing and analyzing EHR data are common bottlenecks in health care data analytics. The dxpr R package provides mechanisms for integration, wrangling, and visualization of clinical data, including diagnosis and procedure records. First, the dxpr package helps users transform International Classification of Diseases (ICD) codes to a uniform format. After code format transformation, the dxpr package supports four strategies for grouping clinical diagnostic data. For clinical procedure data, two grouping methods can be chosen. After EHRs are integrated, users can employ a set of flexible built-in querying functions for dividing data into case and control groups by using specified criteria and splitting the data into before and after an event based on the record date. Subsequently, the structure of integrated long data can be converted into wide, analysis-ready data that are suitable for statistical analysis and visualization. We conducted comorbidity data processes based on a cohort of newborns from Medical Information Mart for Intensive Care-III (n = 7,833) by using the dxpr package. We first defined patent ductus arteriosus (PDA) cases as patients who had at least one PDA diagnosis (ICD, Ninth Revision, Clinical Modification [ICD-9-CM] 7470*). Controls were defined as patients who never had PDA diagnosis. In total, 381 and 7,452 patients with and without PDA, respectively, were included in our study population. Then, we grouped the diagnoses into defined comorbidities. Finally, we observed a statistically significant difference in 8 of the 16 comorbidities among patients with and without PDA, including fluid and electrolyte disorders, valvular disease, and others.

  14. Cyclistic_Divvy_data

    • kaggle.com
    zip
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
    Explore at:
    zip(21440758 bytes)Available download formats
    Dataset updated
    Jun 11, 2023
    Authors
    Rami Ghaith
    Description

    The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

    The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

    Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.

  15. Z

    Marché des outils de test basés sur l'IA Par technologie (traitement du...

    • zionmarketresearch.com
    pdf
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). Marché des outils de test basés sur l'IA Par technologie (traitement du langage naturel (NLP), apprentissage automatique et reconnaissance de formes, vision par ordinateur et traitement d'images), Par solution (services, qui incluent les services professionnels et les services gérés, et les outils basés sur l'IA qui sont la réduction et la sélection de fonctionnalités, le prétraitement et la gestion des données, la visualisation des données, et autres), Par application (efficacité et délai de mise sur le marché, classés en automatisation des tests, analyse des données et optimisation de l'infrastructure, agilité et couverture) Et par région : - Aperçu mondial et régional de l'industrie, informations sur le marché, analyse complète, données historiques et prévisions, 2024-2032 [Dataset]. https://www.zionmarketresearch.com/fr/report/ai-enabled-testing-tools-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 22, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Le marché mondial des outils de test basés sur l'IA était évalué à 437.56 millions de dollars en 2023 et devrait atteindre 1 693,95 millions de dollars d'ici 2032, à un TCAC de 16.23 %.

  16. WeRateDogs Data Analysis

    • kaggle.com
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amr Tamer (2025). WeRateDogs Data Analysis [Dataset]. https://www.kaggle.com/datasets/amrtmansour/weratedogs-data-analysis/versions/4
    Explore at:
    zip(556230 bytes)Available download formats
    Dataset updated
    Aug 12, 2025
    Authors
    Amr Tamer
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains tweet data from the popular Twitter account WeRateDogs (@dog_rates), known for humorously rating dogs with numerators greater than 10 ("they're good dogs Brent"). The archive includes 5000+ tweets as they stood on August 1, 2017, and is the basis for a full data wrangling, analysis, and visualization project.

    The dataset was originally provided to Udacity students for the Data Wrangling project, and I am sharing it here to enable others to practice gathering, assessing, cleaning, and analyzing real-world social media data.

  17. Z

    Markt für KI-gestützte Testtools nach Technologie (Natural Language...

    • zionmarketresearch.com
    pdf
    Updated Nov 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zion Market Research (2025). Markt für KI-gestützte Testtools nach Technologie (Natural Language Processing (NLP), maschinelles Lernen und Mustererkennung sowie Computer Vision und Bildverarbeitung), nach Lösung (Dienste, darunter professionelle Dienste und Managed Services sowie KI-basierte Tools wie Datenreduzierung und Merkmalsauswahl, Datenvorverarbeitung und -aufbereitung, Datenvisualisierung und andere), nach Anwendung (Effizienz und Markteinführungszeit, weiter kategorisiert in Testautomatisierung, Datenanalyse und Infrastrukturoptimierung, Agilität und Abdeckung) und nach Region: – Globaler und regionaler Branchenüberblick, Marktinformationen, umfassende Analysen, historische Daten und Prognosen, 2024–2032 [Dataset]. https://www.zionmarketresearch.com/de/report/ai-enabled-testing-tools-market
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 10, 2025
    Dataset authored and provided by
    Zion Market Research
    License

    https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy

    Time period covered
    2022 - 2030
    Area covered
    Global
    Description

    Der globale Markt für KI-gestützte Testtools wurde im Jahr 2023 auf 437.56 Millionen US-Dollar geschätzt und soll bis 2032 einen Wert von 1693.95 Millionen US-Dollar erreichen, bei einer durchschnittlichen jährlichen Wachstumsrate von 16.23 %.

  18. Airbnb Las Vegas Listings 🏠

    • kaggle.com
    zip
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kanchana1990 (2024). Airbnb Las Vegas Listings 🏠 [Dataset]. https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings/discussion
    Explore at:
    zip(70030 bytes)Available download formats
    Dataset updated
    Feb 23, 2024
    Authors
    Kanchana1990
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Area covered
    Las Vegas
    Description

    Airbnb Las Vegas Listings 🏠

    Overview: Welcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.

    Data Science Applications: This dataset is your playground for various data science projects. Whether you're predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there's something here for everyone. It's perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.

    Column Descriptors: - roomType: Discover the type of accommodation. - stars: Check out the guest ratings. - address: Know where you'll be staying. - numberOfGuests: Find out the guest capacity. - primaryHost/smartName: Get to know your host. - price: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!) - firstReviewComments: Read what the first guests had to say. - firstReviewRating: See how the first guests rated their stay.

    Ethically Mined Data: We're committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It's all about fostering learning and innovation, without stepping over any lines.

    A Big Thank You: We extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.

    Dive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! 🌟

  19. Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon (2023). Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways [Dataset]. http://doi.org/10.1371/journal.pone.0052884
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.

  20. Understanding the Influence of Parameter Value Uncertainty on Climate Model...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated May 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sofia Ingersoll; Heather Childers; Sujan Bhattarai (2024). Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Web Dashboard [Dataset]. http://doi.org/10.5061/dryad.vq83bk422
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2024
    Dataset provided by
    University of California, Santa Barbara
    Authors
    Sofia Ingersoll; Heather Childers; Sujan Bhattarai
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Scientists at the National Center for Atmospheric Research have recently carried out several experiments to better understand the uncertainties associated with future climate projections. In particular, the NCAR Climate and Global Dynamics Lab (CGDL) working group has completed a large Parameter Perturbation Experiment (PPE) utilizing the Community Land Model (CLM), testing the effects of 32 parameters over thousands of simulations over a range of 250 years. The CLM model experiment is focused on understanding uncertainty around biogeophysical parameters that influence the balance of chemical cycling and sequestration variables. The current website for displaying model results is not intuitive or informative to the broader scientific audience or the general public. The goal of this project is to develop an improved data visualization dashboard for communicating the results of the CLM PPE. The interactive dashboard would provide an interface where new or experienced users can query the experiment database to ask which environmental processes are affected by a given model parameter, or vice versa. Improving the accessibility of the data will allow professionals to use the most recent land parameter data when evaluating the impact of a policy or action on climate change. Methods Data Source:

    University of California, Santa Barbara – Climate and Global Dynamics Lab, National Center for Atmospheric Research: Parameter Perturbation Experiment (CGD NCAR PPE-5). https://webext.cgd.ucar.edu/I2000/PPEn11_OAAT/ (Only public version of the data currently accessible. Data leveraged in this project is currently stored on the NCAR server and is not publicly available), https://www.cgd.ucar.edu/events/seminar/2023/katie-dagon-and-daniel-kennedy-132940 (Learn more about this complex data via this amazing presentation by Katie Dragon & Daniel Kennedy ^) The Parameter Perturbation Experiment data leveraged by our project was generated utilizing the Community Land Model v5 (CLM5) predictions. https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CLM_LAND_ONLY.html

    Data Processing: We were working inside of NCAR’s CASPER cluster HPC server, this enabled us direct access to the raw data files. We created a script to read in 500 LHC PPE simulations as a data set with inputs for a climate variable and time range. When reading in the cluster of simulations, there is a preprocess function that performs dimensional reduction to simplify the data set for wrangling later. Once the data sets of interest were loaded, they were then ready for some dimensional corrections – some quirks that come with using CESM data. Our friend’s at NCAR CGDL actually provided us with the correct time-paring bug. The other functions to weigh each grid cell by land area, properly weigh each month according to their contribution to the number of days in a year, and to calculate the global average of each simulation were generated by our team to wrangle the data so it is suitable for emulation. These files were saved so they could be leveraged later using a built-in if-else statement within the read_n_wrangle() function. The preprocessed data is then used in the GPR ML Emulator to make 100 predictions for a climate variable of interest and 32 individual parameters. To summarize briefly without getting too into the nitty gritty, our GPR emulator does 3 things:

    Simplifies the LHC data so it can look at 1 parameter at a time and assess its relationship with a climate variable. Applies Fourier Amplitude Sensitivity Analysis to identify relationships between parameters and climate variables. It helps us see what the key influencers are. In the full chaotic LHC, it can assess the covariance of the parameter-parameter predictions simultaneously (this is the R^2 value you’ll see on your accuracy inset plot later)

    Additionally, it ‘pickles’ and saves the predictions and trained gpr_model so they can be utilized for further analysis, exploration, and visualizations. Attributes and structures defined in this notebook outlines the workflow utilized to generate the data in this repo. It pulls functions from this utils.py to execute the desired commands. Below we will look at the utils.py functions that are not explicitly defined in the notebook. – General side note: if you decide to explore that Attributes and structures defined in this notebook explaining how the data was made, you’ll notice you’ll be transported to another repo in this Organization: GaiaFuture. That’s our prototype playground! It’s a little messy because that’s where we spent the second half of this project tinkering. The official repository is https://github.com/GaiaFuture/CLM5_PPE_Emulator.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
NCO NITRD (2025). Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary [Dataset]. https://catalog.data.gov/dataset/frontiers-of-data-visualization-workshop-ii-data-wrangling-workshop-summary

Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary

Explore at:
Dataset updated
May 14, 2025
Dataset provided by
NCO NITRD
Description

The Data Visualization Workshop II: Data Wrangling was a web-based event held on October 18, 2017. This workshop report summarizes the individual perspectives of a group of visualization experts from the public, private, and academic sectors who met online to discuss how to improve the creation and use of high-quality visualizations. The specific focus of this workshop was on the complexities of "data wrangling". Data wrangling includes finding the appropriate data sources that are both accessible and usable and then shaping and combining that data to facilitate the most accurate and meaningful analysis possible. The workshop was organized as a 3-hour web event and moderated by the members of the Human Computer Interaction and Information Management Task Force of the Networking and Information Technology Research and Development Program's Big Data Interagency Working Group. Report prepared by the Human Computer Interaction And Information Management Task Force, Big Data Interagency Working Group, Networking & Information Technology Research & Development Subcommittee, Committee On Technology Of The National Science & Technology Council...

Search
Clear search
Close search
Google apps
Main menu