41 datasets found

d
Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary...
catalog.data.gov
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCO NITRD (2025). Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary [Dataset]. https://catalog.data.gov/dataset/frontiers-of-data-visualization-workshop-ii-data-wrangling-workshop-summary
Explore at:
Dataset updated
May 14, 2025
Dataset provided by
NCO NITRD
Description
The Data Visualization Workshop II: Data Wrangling was a web-based event held on October 18, 2017. This workshop report summarizes the individual perspectives of a group of visualization experts from the public, private, and academic sectors who met online to discuss how to improve the creation and use of high-quality visualizations. The specific focus of this workshop was on the complexities of "data wrangling". Data wrangling includes finding the appropriate data sources that are both accessible and usable and then shaping and combining that data to facilitate the most accurate and meaningful analysis possible. The workshop was organized as a 3-hour web event and moderated by the members of the Human Computer Interaction and Information Management Task Force of the Networking and Information Technology Research and Development Program's Big Data Interagency Working Group. Report prepared by the Human Computer Interaction And Information Management Task Force, Big Data Interagency Working Group, Networking & Information Technology Research & Development Subcommittee, Committee On Technology Of The National Science & Technology Council...
Prosper loan data.
kaggle.com
zip
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shikhar Sharma (2021). Prosper loan data. [Dataset]. https://www.kaggle.com/shikhar07/prosper-loan-data
Explore at:
zip(23591647 bytes)Available download formats
Dataset updated
Jun 7, 2021
Authors
Shikhar Sharma
Description
Context

Loan Data from Prosper.

Content

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. This data dictionary explains the variables in the data set.
Enriched NYTimes COVID19 U.S. County Dataset
kaggle.com
zip
Updated Jun 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ringhilterra17 (2020). Enriched NYTimes COVID19 U.S. County Dataset [Dataset]. https://www.kaggle.com/ringhilterra17/enrichednytimescovid19
Explore at:
zip(11291611 bytes)Available download formats
Dataset updated
Jun 14, 2020
Authors
ringhilterra17
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Area covered
United States
Description
Overview and Inspiration

I wanted to make some geospatial visualizations to convey the current severity of COVID19 in different parts of the U.S..

I liked the NYTimes COVID dataset, but it was lacking information on county boundary shape data, population per county, new cases / deaths per day, and per capita calculations, and county demographics.

After a lot of work tracking down the different data sources I wanted and doing all of the data wrangling and joins in python, I wanted to open-source the final enriched data set in order to give others a head start in their COVID-19 related analytic, modeling, and visualization efforts.

This dataset is enriched with county shapes, county center point coordinates, 2019 census population estimates, county population densities, cases and deaths per capita, and calculated per day cases / deaths metrics. It contains daily data per county back to January, allowing for analyizng changes over time.

UPDATE: I have also included demographic information per county, including ages, races, and gender breakdown. This could help determine which counties are most susceptible to an outbreak.

How this data can be used

Geospatial analysis and visualization - Which counties are currently getting hit the hardest (per capita and totals)? - What patterns are there in the spread of the virus across counties? (network based spread simulations using county center lat / lons) -county population densities play a role in how quickly the virus spreads? -how does a specific county/state cases and deaths compare to other counties/states? Join with other county level datasets easily (with fips code column)

Content Details

See the column descriptions for more details on the dataset

Visualizations and Analysis Examples

COVID-19 U.S. Time-lapse: Confirmed Cases per County (per capita)

https://github.com/ringhilterra/enriched-covid19-data/blob/master/example_viz/covid-cases-final-04-06.gif?raw=true" alt="">-

Other Data Notes

Please review nytimes README for detailed notes on Covid-19 data - https://github.com/nytimes/covid-19-data/

The only update I made in regards to 'Geographic Exceptions', is that I took 'New York City' county provided in the Covid-19 data, which has all cases for 'for the five boroughs of New York City (New York, Kings, Queens, Bronx and Richmond counties) and replaced the missing FIPS for those rows with the 'New York County' fips code 36061. That way I could join to a geometry, and then I used the sum of those five boroughs population estimates for the 'New York City' estimate, which allowed me calculate 'per capita' metrics for 'New York City' entries in the Covid-19 dataset

Acknowledgements

Special thanks to NYTimes for all of their hard work gathering and consolidating all of the U.S. COVID19 related data on daily basis. Their git repo https://github.com/nytimes/covid-19-data/

Also, thanks to ykzeng for the county population density estimates: https://github.com/ykzeng/covid-19/tree/master/data-
D
Data Prep Report
marketresearchforecast.com
doc, pdf, ppt
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Data Prep Report [Dataset]. https://www.marketresearchforecast.com/reports/data-prep-547253
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Jun 23, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Data Prep market is booming, projected to reach $12 Billion by 2033 with a 13.7% CAGR. Discover key trends, leading companies (Alteryx, Informatica, IBM), and regional insights in this comprehensive market analysis. Learn how self-service tools and cloud solutions are transforming data preparation.
Airbnb-NYC-Cleaned
kaggle.com
zip
Updated Aug 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sandeep majumdar (2022). Airbnb-NYC-Cleaned [Dataset]. https://www.kaggle.com/sandeepmajumdar/airbnbnyccleaned
Explore at:
zip(7294486 bytes)Available download formats
Dataset updated
Aug 25, 2022
Authors
sandeep majumdar
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
New York
Description
IF YOU WANT TO START WITH DATA VISUALIZATION DIRECTLY, USE THIS DATASET

But if you want to start with data cleaning, find the original dataset below: This is a cleaned version of the Airbnb open data found at the following link: https://www.kaggle.com/datasets/arianazmoudeh/airbnbopendata

The original message from Arian: "This dataset is part of Airbnb Inside but I tried to make new columns and many data inconsistency issue to create a new dataset to practice data cleaning. The original source can be found here http://insideairbnb.com/explore/

Arian Azmoudeh"
H
Drought Machine Learning Data Example
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bryce Pulver (2023). Drought Machine Learning Data Example [Dataset]. https://www.hydroshare.org/resource/9024db8a67fd4afdab2358d1b75e7e85
Explore at:
zip(518.5 MB)Available download formats
Dataset updated
Aug 22, 2023
Dataset provided by
HydroShare
Authors
Bryce Pulver
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1980 - Dec 1, 2020
Area covered

Description
This repository showcases some examples of data wrangling and visualization using the output of the USGS's output from a drought prediction model on the Colorado River Basin and example ecology site data.

Global Data Wrangling Market Research Report: By Application (Data...

wiseguyreports.com

Updated Jan 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Global Data Wrangling Market Research Report: By Application (Data Preparation, Data Cleaning, Data Integration, Data Transformation), By Deployment Mode (Cloud-Based, On-Premises, Hybrid), By End User (BFSI, Healthcare, Retail, Telecommunications), By Tool Type (Self-Service Tools, Data Analytics Tools, Data Visualization Tools) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/de/reports/data-wrangling-market

Explore at:

Dataset updated

Jan 1, 2025

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Sep 25, 2025

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2023
REGIONS COVERED	North America, Europe, APAC, South America, MEA
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2024	5.18(USD Billion)
MARKET SIZE 2025	5.7(USD Billion)
MARKET SIZE 2035	15.0(USD Billion)
SEGMENTS COVERED	Application, Deployment Mode, End User, Tool Type, Regional
COUNTRIES COVERED	US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
KEY MARKET DYNAMICS	Increasing data volume, Need for data quality, Rising adoption of AI, Growing demand for analytics, Cost-effective solutions
MARKET FORECAST UNITS	USD Billion
KEY COMPANIES PROFILED	Microsoft, Apache Software Foundation, Google, Pentaho, IBM, Talend, RapidMiner, SAS Institute, Informatica, Deloitte, TIBCO Software, Crimson Hexagon, Oracle, Trifacta, DataRobot, Alteryx
MARKET FORECAST PERIOD	2025 - 2035
KEY MARKET OPPORTUNITIES	Increased demand for big data, Growing need for real-time analytics, Rise of AI-driven data solutions, Expansion in cloud-based services, Emerging trends in data privacy compliance
COMPOUND ANNUAL GROWTH RATE (CAGR)	10.1% (2025 - 2035)

n
Data from: Designing data science workshops for data-intensive environmental...
data.niaid.nih.gov
datasetcatalog.nlm.nih.gov
+1more
zip
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allison Theobold; Stacey Hancock; Sara Mannheimer (2020). Designing data science workshops for data-intensive environmental science research [Dataset]. http://doi.org/10.5061/dryad.7wm37pvp7
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7wm37pvp7
Dataset updated
Dec 8, 2020
Dataset provided by
Montana State University
California State Polytechnic University
Authors
Allison Theobold; Stacey Hancock; Sara Mannheimer
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.

Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.

The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw. The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey. The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean. The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
NiftyOptionChainDataset
kaggle.com
zip
Updated May 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikhil Ulahannan (2021). NiftyOptionChainDataset [Dataset]. https://www.kaggle.com/nikhilulahannan/niftyoptionchaindataset
Explore at:
zip(31131 bytes)Available download formats
Dataset updated
May 13, 2021
Authors
Nikhil Ulahannan
Description
Context

Option Chain data is a product of complex calculations yet unorganised because of its inherent non uniform data relevance structure which makes it harder to use for data analytics.

Content

Dataset contains 3 adjacent week raw option chain data(calls, puts, iv etc) in the month of May 2021. An additional data file is added with minor modification(clean-sample) for better utilization of data explorer features.

Acknowledgements

National Stock Exchange (NSE) website.

Inspiration

Develop code framework for data cleaning, wrangling and visualization of option chain data. Exploratory Data Analysis (EDA) Analyse evolution of option premiums, iv etc and its impact over a month. Insights for better straddles and strangles(option strategies).
Netflix
kaggle.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasanna@82 (2025). Netflix [Dataset]. https://www.kaggle.com/datasets/prasanna82/netflix/code
Explore at:
zip(1400865 bytes)Available download formats
Dataset updated
Jul 29, 2025
Authors
Prasanna@82
Description
Netflix Dataset Exploration and Visualization

This project involves an in-depth analysis of the Netflix dataset to uncover key trends and patterns in the streaming platform’s content offerings. Using Python libraries such as Pandas, NumPy, and Matplotlib, this notebook visualizes and interprets critical insights from the data.

Objectives:

Analyze the distribution of content types (Movies vs. TV Shows)

Identify the most prolific countries producing Netflix content

Study the ratings and duration of shows

Handle missing values using techniques like interpolation, forward-fill, and custom replacements

Enhance readability with bar charts, horizontal plots, and annotated visuals

Key Visualizations:

Bar charts for type distribution and country-wise contributions

Handling missing data in rating, duration, and date_added

Annotated plots showing values for clarity

Tools Used:

Python 3

Pandas for data wrangling

Matplotlib for visualizations

Jupyter Notebook for hands-on analysis

Outcome: This project provides a clear view of Netflix's content library, helping data enthusiasts and beginners understand how to process, clean, and visualize real-world datasets effectively.

Feel free to fork, adapt, and extend the work.
Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...
technavio.com
pdf
Updated Jan 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, and UK), Middle East and Africa (UAE), APAC (China, India, Japan, and South Korea), South America (Brazil), and Rest of World (ROW) [Dataset]. https://www.technavio.com/report/data-analytics-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jan 11, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Description
Snapshot img

Data Analytics Market Size 2025-2029

The data analytics market size is forecast to increase by USD 288.7 billion, at a CAGR of 14.7% between 2024 and 2029.

The market is driven by the extensive use of modern technology in company operations, enabling businesses to extract valuable insights from their data. The prevalence of the Internet and the increased use of linked and integrated technologies have facilitated the collection and analysis of vast amounts of data from various sources. This trend is expected to continue as companies seek to gain a competitive edge by making data-driven decisions. However, the integration of data from different sources poses significant challenges. Ensuring data accuracy, consistency, and security is crucial as companies deal with large volumes of data from various internal and external sources. Additionally, the complexity of data analytics tools and the need for specialized skills can hinder adoption, particularly for smaller organizations with limited resources. Companies must address these challenges by investing in robust data management systems, implementing rigorous data validation processes, and providing training and development opportunities for their employees. By doing so, they can effectively harness the power of data analytics to drive growth and improve operational efficiency.

What will be the Size of the Data Analytics Market during the forecast period?

Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
Request Free SampleIn the dynamic and ever-evolving the market, entities such as explainable AI, time series analysis, data integration, data lakes, algorithm selection, feature engineering, marketing analytics, computer vision, data visualization, financial modeling, real-time analytics, data mining tools, and KPI dashboards continue to unfold and intertwine, shaping the industry's landscape. The application of these technologies spans various sectors, from risk management and fraud detection to conversion rate optimization and social media analytics. ETL processes, data warehousing, statistical software, data wrangling, and data storytelling are integral components of the data analytics ecosystem, enabling organizations to extract insights from their data. Cloud computing, deep learning, and data visualization tools further enhance the capabilities of data analytics platforms, allowing for advanced data-driven decision making and real-time analysis. Marketing analytics, clustering algorithms, and customer segmentation are essential for businesses seeking to optimize their marketing strategies and gain a competitive edge. Regression analysis, data visualization tools, and machine learning algorithms are instrumental in uncovering hidden patterns and trends, while predictive modeling and causal inference help organizations anticipate future outcomes and make informed decisions. Data governance, data quality, and bias detection are crucial aspects of the data analytics process, ensuring the accuracy, security, and ethical use of data. Supply chain analytics, healthcare analytics, and financial modeling are just a few examples of the diverse applications of data analytics, demonstrating the industry's far-reaching impact. Data pipelines, data mining, and model monitoring are essential for maintaining the continuous flow of data and ensuring the accuracy and reliability of analytics models. The integration of various data analytics tools and techniques continues to evolve, as the industry adapts to the ever-changing needs of businesses and consumers alike.

How is this Data Analytics Industry segmented?

The data analytics industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. ComponentServicesSoftwareHardwareDeploymentCloudOn-premisesTypePrescriptive AnalyticsPredictive AnalyticsCustomer AnalyticsDescriptive AnalyticsOthersApplicationSupply Chain ManagementEnterprise Resource PlanningDatabase ManagementHuman Resource ManagementOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth KoreaSouth AmericaBrazilRest of World (ROW)

By Component Insights

The services segment is estimated to witness significant growth during the forecast period.The market is experiencing significant growth as businesses increasingly rely on advanced technologies to gain insights from their data. Natural language processing is a key component of this trend, enabling more sophisticated analysis of unstructured data. Fraud detection and data security solutions are also in high demand, as companies seek to protect against threats and maintain customer trust. Data analytics platforms, including cloud-based offerings, are driving innovatio
Z
AI-Enabled Testing Tools Market By technology (natural language processing...
zionmarketresearch.com
pdf
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zion Market Research (2025). AI-Enabled Testing Tools Market By technology (natural language processing (NLP), machine learning & pattern recognition, and computer vision and image processing), By solution (services, which include professional services & managed services, and AI-based tools which are reduction & feature selection, data pre-processing & wrangling, data visualization, and others), By application (efficiency and time-to-market, further categorized into test automation, data analytics, and infrastructure optimization, agility & coverage) And By Region: - Global And Regional Industry Overview, Market Intelligence, Comprehensive Analysis, Historical Data, And Forecasts, 2024-2032 [Dataset]. https://www.zionmarketresearch.com/report/ai-enabled-testing-tools-market
Explore at:
pdfAvailable download formats
Dataset updated
Nov 14, 2025
Dataset authored and provided by
Zion Market Research
License
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Time period covered
2022 - 2030
Area covered
Global
Description
Global AI-Enabled Testing Tools Market was valued at $437.56 Million in 2023, and is projected to reach $USD 1693.95 Million by 2032, at a CAGR of 16.23%.
N
Replication Data for: dxpr: An R package for generating analysis-ready data...
dataverse.lib.nycu.edu.tw
bin, png +1
Updated Jun 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYCU Dataverse (2022). Replication Data for: dxpr: An R package for generating analysis-ready data from electronic health records—diagnoses and procedures. [Dataset]. http://doi.org/10.57770/ZRNVCN
Explore at:
png(7908), bin(11118), png(6980), bin(5446), text/markdown(25651), png(8091), text/markdown(11422), text/markdown(172)Available download formats
Unique identifier
https://doi.org/10.57770/ZRNVCN
Dataset updated
Jun 22, 2022
Dataset provided by
NYCU Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Enriched electronic health records (EHRs) contain crucial information related to disease progression, and this information can help with decision-making in the health care field. Data analytics in health care is deemed as one of the essential processes that help accelerate the progress of clinical research. However, processing and analyzing EHR data are common bottlenecks in health care data analytics. The dxpr R package provides mechanisms for integration, wrangling, and visualization of clinical data, including diagnosis and procedure records. First, the dxpr package helps users transform International Classification of Diseases (ICD) codes to a uniform format. After code format transformation, the dxpr package supports four strategies for grouping clinical diagnostic data. For clinical procedure data, two grouping methods can be chosen. After EHRs are integrated, users can employ a set of flexible built-in querying functions for dividing data into case and control groups by using specified criteria and splitting the data into before and after an event based on the record date. Subsequently, the structure of integrated long data can be converted into wide, analysis-ready data that are suitable for statistical analysis and visualization. We conducted comorbidity data processes based on a cohort of newborns from Medical Information Mart for Intensive Care-III (n = 7,833) by using the dxpr package. We first defined patent ductus arteriosus (PDA) cases as patients who had at least one PDA diagnosis (ICD, Ninth Revision, Clinical Modification [ICD-9-CM] 7470*). Controls were defined as patients who never had PDA diagnosis. In total, 381 and 7,452 patients with and without PDA, respectively, were included in our study population. Then, we grouped the diagnoses into defined comorbidities. Finally, we observed a statistically significant difference in 8 of the 16 comorbidities among patients with and without PDA, including fluid and electrolyte disorders, valvular disease, and others.
Cyclistic_Divvy_data
kaggle.com
zip
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rami Ghaith (2023). Cyclistic_Divvy_data [Dataset]. https://www.kaggle.com/datasets/ramighaith/cyclistic-divvy-data
Explore at:
zip(21440758 bytes)Available download formats
Dataset updated
Jun 11, 2023
Authors
Rami Ghaith
Description
The following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.

The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.

Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Z
Marché des outils de test basés sur l'IA Par technologie (traitement du...
zionmarketresearch.com
pdf
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zion Market Research (2025). Marché des outils de test basés sur l'IA Par technologie (traitement du langage naturel (NLP), apprentissage automatique et reconnaissance de formes, vision par ordinateur et traitement d'images), Par solution (services, qui incluent les services professionnels et les services gérés, et les outils basés sur l'IA qui sont la réduction et la sélection de fonctionnalités, le prétraitement et la gestion des données, la visualisation des données, et autres), Par application (efficacité et délai de mise sur le marché, classés en automatisation des tests, analyse des données et optimisation de l'infrastructure, agilité et couverture) Et par région : - Aperçu mondial et régional de l'industrie, informations sur le marché, analyse complète, données historiques et prévisions, 2024-2032 [Dataset]. https://www.zionmarketresearch.com/fr/report/ai-enabled-testing-tools-market
Explore at:
pdfAvailable download formats
Dataset updated
Nov 22, 2025
Dataset authored and provided by
Zion Market Research
License
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Time period covered
2022 - 2030
Area covered
Global
Description
Le marché mondial des outils de test basés sur l'IA était évalué à 437.56 millions de dollars en 2023 et devrait atteindre 1 693,95 millions de dollars d'ici 2032, à un TCAC de 16.23 %.
WeRateDogs Data Analysis
kaggle.com
zip
Updated Aug 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amr Tamer (2025). WeRateDogs Data Analysis [Dataset]. https://www.kaggle.com/datasets/amrtmansour/weratedogs-data-analysis/versions/4
Explore at:
zip(556230 bytes)Available download formats
Dataset updated
Aug 12, 2025
Authors
Amr Tamer
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains tweet data from the popular Twitter account WeRateDogs (@dog_rates), known for humorously rating dogs with numerators greater than 10 ("they're good dogs Brent"). The archive includes 5000+ tweets as they stood on August 1, 2017, and is the basis for a full data wrangling, analysis, and visualization project.

The dataset was originally provided to Udacity students for the Data Wrangling project, and I am sharing it here to enable others to practice gathering, assessing, cleaning, and analyzing real-world social media data.
Z
Markt für KI-gestützte Testtools nach Technologie (Natural Language...
zionmarketresearch.com
pdf
Updated Nov 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zion Market Research (2025). Markt für KI-gestützte Testtools nach Technologie (Natural Language Processing (NLP), maschinelles Lernen und Mustererkennung sowie Computer Vision und Bildverarbeitung), nach Lösung (Dienste, darunter professionelle Dienste und Managed Services sowie KI-basierte Tools wie Datenreduzierung und Merkmalsauswahl, Datenvorverarbeitung und -aufbereitung, Datenvisualisierung und andere), nach Anwendung (Effizienz und Markteinführungszeit, weiter kategorisiert in Testautomatisierung, Datenanalyse und Infrastrukturoptimierung, Agilität und Abdeckung) und nach Region: – Globaler und regionaler Branchenüberblick, Marktinformationen, umfassende Analysen, historische Daten und Prognosen, 2024–2032 [Dataset]. https://www.zionmarketresearch.com/de/report/ai-enabled-testing-tools-market
Explore at:
pdfAvailable download formats
Dataset updated
Nov 10, 2025
Dataset authored and provided by
Zion Market Research
License
https://www.zionmarketresearch.com/privacy-policyhttps://www.zionmarketresearch.com/privacy-policy
Time period covered
2022 - 2030
Area covered
Global
Description
Der globale Markt für KI-gestützte Testtools wurde im Jahr 2023 auf 437.56 Millionen US-Dollar geschätzt und soll bis 2032 einen Wert von 1693.95 Millionen US-Dollar erreichen, bei einer durchschnittlichen jährlichen Wachstumsrate von 16.23 %.
Airbnb Las Vegas Listings 🏠
kaggle.com
zip
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kanchana1990 (2024). Airbnb Las Vegas Listings 🏠 [Dataset]. https://www.kaggle.com/datasets/kanchana1990/airbnb-las-vegas-listings/discussion
Explore at:
zip(70030 bytes)Available download formats
Dataset updated
Feb 23, 2024
Authors
Kanchana1990
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Area covered
Las Vegas
Description
Airbnb Las Vegas Listings 🏠

Overview: Welcome to our cozy corner of data, featuring a curated selection of Airbnb listings from the vibrant city of Las Vegas! Dive into the unique stays Vegas has to offer, from luxurious condos to private rooms that promise an unforgettable stay.

Data Science Applications: This dataset is your playground for various data science projects. Whether you're predicting prices, analyzing guest preferences, or exploring the impact of locations on ratings, there's something here for everyone. It's perfect for those looking to practice their data wrangling, visualization, and machine learning skills in a real-world context. Price column is null here so , one may take that as a data cleaning activity also.

Column Descriptors: - roomType: Discover the type of accommodation. - stars: Check out the guest ratings. - address: Know where you'll be staying. - numberOfGuests: Find out the guest capacity. - primaryHost/smartName: Get to know your host. - price: Peek at the listing prices. (Note: Some data may be missing here, so creativity in handling this could be a fun challenge!) - firstReviewComments: Read what the first guests had to say. - firstReviewRating: See how the first guests rated their stay.

Ethically Mined Data: We're committed to ethical data practices. This dataset has been carefully compiled, respecting privacy and data sharing norms. It's all about fostering learning and innovation, without stepping over any lines.

A Big Thank You: We extend our heartfelt gratitude to Airbnb and the platforms that share data openly, making projects like this possible. Their commitment to community and openness enriches the data science world.

Dive in, explore, and let the data spark your curiosity and creativity! Happy analyzing! 🌟
Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon (2023). Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways [Dataset]. http://doi.org/10.1371/journal.pone.0052884
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0052884
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mark L. Grimes; Wan-Jui Lee; Laurens van der Maaten; Paul Shannon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The interpretation of biological data sets is essential for generating hypotheses that guide research, yet modern methods of global analysis challenge our ability to discern meaningful patterns and then convey results in a way that can be easily appreciated. Proteomic data is especially challenging because mass spectrometry detectors often miss peptides in complex samples, resulting in sparsely populated data sets. Using the R programming language and techniques from the field of pattern recognition, we have devised methods to resolve and evaluate clusters of proteins related by their pattern of expression in different samples in proteomic data sets. We examined tyrosine phosphoproteomic data from lung cancer samples. We calculated dissimilarities between the proteins based on Pearson or Spearman correlations and on Euclidean distances, whilst dealing with large amounts of missing data. The dissimilarities were then used as feature vectors in clustering and visualization algorithms. The quality of the clusterings and visualizations were evaluated internally based on the primary data and externally based on gene ontology and protein interaction networks. The results show that t-distributed stochastic neighbor embedding (t-SNE) followed by minimum spanning tree methods groups sparse proteomic data into meaningful clusters more effectively than other methods such as k-means and classical multidimensional scaling. Furthermore, our results show that using a combination of Spearman correlation and Euclidean distance as a dissimilarity representation increases the resolution of clusters. Our analyses show that many clusters contain one or more tyrosine kinases and include known effectors as well as proteins with no known interactions. Visualizing these clusters as networks elucidated previously unknown tyrosine kinase signal transduction pathways that drive cancer. Our approach can be applied to other data types, and can be easily adopted because open source software packages are employed.
Understanding the Influence of Parameter Value Uncertainty on Climate Model...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated May 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sofia Ingersoll; Heather Childers; Sujan Bhattarai (2024). Understanding the Influence of Parameter Value Uncertainty on Climate Model Output: Developing an Interactive Web Dashboard [Dataset]. http://doi.org/10.5061/dryad.vq83bk422
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.vq83bk422
Dataset updated
May 30, 2024
Dataset provided by
University of California, Santa Barbara
Authors
Sofia Ingersoll; Heather Childers; Sujan Bhattarai
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Scientists at the National Center for Atmospheric Research have recently carried out several experiments to better understand the uncertainties associated with future climate projections. In particular, the NCAR Climate and Global Dynamics Lab (CGDL) working group has completed a large Parameter Perturbation Experiment (PPE) utilizing the Community Land Model (CLM), testing the effects of 32 parameters over thousands of simulations over a range of 250 years. The CLM model experiment is focused on understanding uncertainty around biogeophysical parameters that influence the balance of chemical cycling and sequestration variables. The current website for displaying model results is not intuitive or informative to the broader scientific audience or the general public. The goal of this project is to develop an improved data visualization dashboard for communicating the results of the CLM PPE. The interactive dashboard would provide an interface where new or experienced users can query the experiment database to ask which environmental processes are affected by a given model parameter, or vice versa. Improving the accessibility of the data will allow professionals to use the most recent land parameter data when evaluating the impact of a policy or action on climate change. Methods Data Source:

University of California, Santa Barbara – Climate and Global Dynamics Lab, National Center for Atmospheric Research: Parameter Perturbation Experiment (CGD NCAR PPE-5). https://webext.cgd.ucar.edu/I2000/PPEn11_OAAT/ (Only public version of the data currently accessible. Data leveraged in this project is currently stored on the NCAR server and is not publicly available), https://www.cgd.ucar.edu/events/seminar/2023/katie-dagon-and-daniel-kennedy-132940 (Learn more about this complex data via this amazing presentation by Katie Dragon & Daniel Kennedy ^) The Parameter Perturbation Experiment data leveraged by our project was generated utilizing the Community Land Model v5 (CLM5) predictions. https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CLM_LAND_ONLY.html

Data Processing: We were working inside of NCAR’s CASPER cluster HPC server, this enabled us direct access to the raw data files. We created a script to read in 500 LHC PPE simulations as a data set with inputs for a climate variable and time range. When reading in the cluster of simulations, there is a preprocess function that performs dimensional reduction to simplify the data set for wrangling later. Once the data sets of interest were loaded, they were then ready for some dimensional corrections – some quirks that come with using CESM data. Our friend’s at NCAR CGDL actually provided us with the correct time-paring bug. The other functions to weigh each grid cell by land area, properly weigh each month according to their contribution to the number of days in a year, and to calculate the global average of each simulation were generated by our team to wrangle the data so it is suitable for emulation. These files were saved so they could be leveraged later using a built-in if-else statement within the read_n_wrangle() function. The preprocessed data is then used in the GPR ML Emulator to make 100 predictions for a climate variable of interest and 32 individual parameters. To summarize briefly without getting too into the nitty gritty, our GPR emulator does 3 things:

Simplifies the LHC data so it can look at 1 parameter at a time and assess its relationship with a climate variable. Applies Fourier Amplitude Sensitivity Analysis to identify relationships between parameters and climate variables. It helps us see what the key influencers are. In the full chaotic LHC, it can assess the covariance of the parameter-parameter predictions simultaneously (this is the R^2 value you’ll see on your accuracy inset plot later)

Additionally, it ‘pickles’ and saves the predictions and trained gpr_model so they can be utilized for further analysis, exploration, and visualizations. Attributes and structures defined in this notebook outlines the workflow utilized to generate the data in this repo. It pulls functions from this utils.py to execute the desired commands. Below we will look at the utils.py functions that are not explicitly defined in the notebook. – General side note: if you decide to explore that Attributes and structures defined in this notebook explaining how the data was made, you’ll notice you’ll be transported to another repo in this Organization: GaiaFuture. That’s our prototype playground! It’s a little messy because that’s where we spent the second half of this project tinkering. The official repository is https://github.com/GaiaFuture/CLM5_PPE_Emulator.

Facebook

Twitter

Click to copy link

Link copied

Cite

NCO NITRD (2025). Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary [Dataset]. https://catalog.data.gov/dataset/frontiers-of-data-visualization-workshop-ii-data-wrangling-workshop-summary

Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary

Explore at:

Dataset updated

May 14, 2025

Dataset provided by

NCO NITRD

Description

The Data Visualization Workshop II: Data Wrangling was a web-based event held on October 18, 2017. This workshop report summarizes the individual perspectives of a group of visualization experts from the public, private, and academic sectors who met online to discuss how to improve the creation and use of high-quality visualizations. The specific focus of this workshop was on the complexities of "data wrangling". Data wrangling includes finding the appropriate data sources that are both accessible and usable and then shaping and combining that data to facilitate the most accurate and meaningful analysis possible. The workshop was organized as a 3-hour web event and moderated by the members of the Human Computer Interaction and Information Management Task Force of the Networking and Information Technology Research and Development Program's Big Data Interagency Working Group. Report prepared by the Human Computer Interaction And Information Management Task Force, Big Data Interagency Working Group, Networking & Information Technology Research & Development Subcommittee, Committee On Technology Of The National Science & Technology Council...

Clear search

Close search

Google apps

Main menu

Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary...

Prosper loan data.

Context

Content

Enriched NYTimes COVID19 U.S. County Dataset

Overview and Inspiration

How this data can be used

Content Details

Visualizations and Analysis Examples

Other Data Notes

Acknowledgements

Data Prep Report

Airbnb-NYC-Cleaned

Drought Machine Learning Data Example

Global Data Wrangling Market Research Report: By Application (Data...

Data from: Designing data science workshops for data-intensive environmental...

NiftyOptionChainDataset

Context

Content

Acknowledgements

Inspiration

Netflix

Data Analytics Market Analysis, Size, and Forecast 2025-2029: North America...

Snapshot img

AI-Enabled Testing Tools Market By technology (natural language processing...

Replication Data for: dxpr: An R package for generating analysis-ready data...

Cyclistic_Divvy_data

Marché des outils de test basés sur l'IA Par technologie (traitement du...

WeRateDogs Data Analysis

Markt für KI-gestützte Testtools nach Technologie (Natural Language...

Airbnb Las Vegas Listings 🏠

Wrangling Phosphoproteomic Data to Elucidate Cancer Signaling Pathways

Understanding the Influence of Parameter Value Uncertainty on Climate Model...

Frontiers of Data Visualization Workshop II: Data Wrangling Workshop SummarySee More Versions

Frontiers of Data Visualization Workshop II: Data Wrangling Workshop Summary