Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statements for Data Visualization – Supermarket Sales Dataset 1. Sales Performance Across Branches Management wants to understand how sales performance varies across supermarket branches in Lagos, Abuja, Ogun, and Port Harcourt to identify the best-performing locations and areas that need improvement. • Suggested Visualizations: • Bar chart comparing total sales and profit by branch • Map chart showing sales by city • KPI cards: Total Sales, Profit, and Average Transaction Value per branch 2. Customer Purchase Behavior The marketing team needs insights into how different customer types (Member vs Normal) and genders influence purchase trends and average spending. • Suggested Visualizations: • Pie chart for customer type distribution • Bar chart for average spend by gender • Segmented comparison of total sales by customer type 3. Product Line Performance The business wants to know which product categories drive the highest revenue, quantity sold, and customer satisfaction to optimize stock levels and marketing focus. • Suggested Visualizations: • Bar chart showing total sales by product line • Column chart comparing average rating per product line • Profit margin chart by product line 4. Sales Trends Over Time The management team wants to monitor sales trends over time to identify peak periods, track seasonal variations, and plan future promotions accordingly. • Suggested Visualizations: • Line chart showing monthly or weekly sales trend • Seasonal decomposition (sales by month) • Trendline showing revenue growth 5. Payment Method Analysis The finance department needs to evaluate payment method usage (Cash, E-wallet, Credit Card) across cities to improve payment convenience and reduce transaction delays. • Suggested Visualizations: • Donut or bar chart showing share of payment methods • City-level breakdown of preferred payment type • Correlation between payment method and average transaction value 6. Customer Satisfaction Insights The customer experience team wants to explore how customer ratings relate to sales amount, product type, and branch performance to identify drivers of customer satisfaction. • Suggested Visualizations: • Scatter plot of rating vs total purchase amount • Heat map of average rating by branch and product line • KPI card showing average customer rating
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Domain-Specific Dataset and Visualization Guide
This package contains 20 realistic datasets in CSV format across different industries, along with 20 text files suggesting visualization ideas. Each dataset includes about 300 rows of synthetic but domain-appropriate data. They are designed for data analysis, visualization practice, machine learning projects, and dashboard building.
What’s inside
20 CSV files, one for each domain:
20 TXT files, each listing 10 relevant graphing options for the dataset.
MASTER_INDEX.csv, which summarizes all domains with their column names.
Use cases
Example
Education dataset has columns like StudentName, Class, Subject, Marks, AttendancePercent. Suggested graphs: bar chart of average marks by subject, scatter plot of marks vs attendance percent, line chart of attendance over time.
E-Commerce dataset has columns like OrderDate, Product, Category, Price, Quantity, Total. Suggested graphs: line chart of revenue trend, bar chart of revenue by category, pie chart of payment mode share.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.
• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.
• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
his project involves the creation of an interactive Excel dashboard for SwiftAuto Traders to analyze and visualize car sales data. The dashboard includes several visualizations to provide insights into car sales, profits, and performance across different models and manufacturers. The project makes use of various charts and slicers in Excel for the analysis.
Objective: The primary goal of this project is to showcase the ability to manipulate and visualize car sales data effectively using Excel. The dashboard aims to provide:
Profit and Sales Analysis for each dealer. Sales Performance across various car models and manufacturers. Resale Value Analysis comparing prices and resale values. Insights into Retention Percentage by car models. Files in this Project: Car_Sales_Kaggle_DV0130EN_Lab3_Start.xlsx: The original dataset used to create the dashboard. dashboards.xlsx: The final Excel file that contains the complete dashboard with interactive charts and slicers. Key Visualizations: Average Price and Year Resale Value: A bar chart comparing the average price and resale value of various car models. Power Performance Factor: A column chart displaying the performance across different car models. Unit Sales by Model: A donut chart showcasing unit sales by car model. Retention Percentage: A pie chart illustrating customer retention by car model. Tools Used: Microsoft Excel for creating and organizing the visualizations and dashboard. Excel Slicers for interactive filtering. Charts: Bar charts, pie charts, column charts, and sunburst charts. How to Use: Download the Dataset: You can download the Car_Sales_Kaggle_DV0130EN_Lab3_Start.xlsx file from Kaggle and follow the steps to create a similar dashboard in Excel. Open the Dashboard: The dashboards.xlsx file contains the final version of the dashboard. Simply open it in Excel and start exploring the interactive charts and slicers.
Facebook
TwitterWe start by cleaning our data. Tasks during this section include: - Drop NaN values from DataFrame - Removing rows based on a condition - Change the type of columns (to_numeric, to_datetime, astype)
Once we have cleaned up our data a bit, we move the data exploration section. In this section we explore 5 high level business questions related to our data: - What was the best month for sales? How much was earned that month? - What city sold the most product? - What time should we display advertisemens to maximize the likelihood of customer’s buying product? - What products are most often sold together? - What product sold the most? Why do you think it sold the most?
To answer these questions we walk through many different pandas & matplotlib methods. They include: - Concatenating multiple csvs together to create a new DataFrame (pd.concat) - Adding columns - Parsing cells as strings to make new columns (.str) - Using the .apply() method - Using groupby to perform aggregate analysis - Plotting bar charts and lines graphs to visualize our results - Labeling our graphs
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a list of 186 Digital Humanities projects leveraging information visualisation techniques. Each project has been classified according to visualisation and interaction methods, narrativity and narrative solutions, domain, methods for the representation of uncertainty and interpretation, and the employment of critical and custom approaches to visually represent humanities data.
The project_id column contains unique internal identifiers assigned to each project. Meanwhile, the last_access column records the most recent date (in DD/MM/YYYY format) on which each project was reviewed based on the web address specified in the url column.
The remaining columns can be grouped into descriptive categories aimed at characterising projects according to different aspects:
Narrativity. It reports the presence of information visualisation techniques employed within narrative structures. Here, the term narrative encompasses both author-driven linear data stories and more user-directed experiences where the narrative sequence is determined by user exploration [1]. We define 2 columns to identify projects using visualisation techniques in narrative, or non-narrative sections. Both conditions can be true for projects employing visualisations in both contexts. Columns:
non_narrative (boolean)
narrative (boolean)
Domain. The humanities domain to which the project is related. We rely on [2] and the chapters of the first part of [3] to abstract a set of general domains. Column:
domain (categorical):
History and archaeology
Art and art history
Language and literature
Music and musicology
Multimedia and performing arts
Philosophy and religion
Other: both extra-list domains and cases of collections without a unique or specific thematic focus.
Visualisation of uncertainty and interpretation. Buiding upon the frameworks proposed by [4] and [5], a set of categories was identified, highlighting a distinction between precise and impressional communication of uncertainty. Precise methods explicitly represent quantifiable uncertainty such as missing, unknown, or uncertain data, precisely locating and categorising it using visual variables and positioning. Two sub-categories are interactive distinction, when uncertain data is not visually distinguishable from the rest of the data but can be dynamically isolated or included/excluded categorically through interaction techniques (usually filters); and visual distinction, when uncertainty visually “emerges” from the representation by means of dedicated glyphs and spatial or visual cues and variables. On the other hand, impressional methods communicate the constructed and situated nature of data [6], exposing the interpretative layer of the visualisation and indicating more abstract and unquantifiable uncertainty using graphical aids or interpretative metrics. Two sub-categories are: ambiguation, when the use of graphical expedients—like permeable glyph boundaries or broken lines—visually convey the ambiguity of a phenomenon; and interpretative metrics, when expressive, non-scientific, or non-punctual metrics are used to build a visualisation. Column:
uncertainty_interpretation (categorical):
Interactive distinction
Visual distinction
Ambiguation
Interpretative metrics
Critical adaptation. We identify projects in which, with regards to at least a visualisation, the following criteria are fulfilled: 1) avoid repurposing of prepackaged, generic-use, or ready-made solutions; 2) being tailored and unique to reflect the peculiarities of the phenomena at hand; 3) avoid simplifications to embrace and depict complexity, promoting time-consuming visualisation-based inquiry. Column:
critical_adaptation (boolean)
Non-temporal visualisation techniques. We adopt and partially adapt the terminology and definitions from [7]. A column is defined for each type of visualisation and accounts for its presence within a project, also including stacked layouts and more complex variations. Columns and inclusion criteria:
plot (boolean): visual representations that map data points onto a two-dimensional coordinate system.
cluster_or_set (boolean): sets or cluster-based visualisations used to unveil possible inter-object similarities.
map (boolean): geographical maps used to show spatial insights. While we do not specify the variants of maps (e.g., pin maps, dot density maps, flow maps, etc.), we make an exception for maps where each data point is represented by another visualisation (e.g., a map where each data point is a pie chart) by accounting for the presence of both in their respective columns.
network (boolean): visual representations highlighting relational aspects through nodes connected by links or edges.
hierarchical_diagram (boolean): tree-like structures such as tree diagrams, radial trees, but also dendrograms. They differ from networks for their strictly hierarchical structure and absence of closed connection loops.
treemap (boolean): still hierarchical, but highlighting quantities expressed by means of area size. It also includes circle packing variants.
word_cloud (boolean): clouds of words, where each instance’s size is proportional to its frequency in a related context
bars (boolean): includes bar charts, histograms, and variants. It coincides with “bar charts” in [7] but with a more generic term to refer to all bar-based visualisations.
line_chart (boolean): the display of information as sequential data points connected by straight-line segments.
area_chart (boolean): similar to a line chart but with a filled area below the segments. It also includes density plots.
pie_chart (boolean): circular graphs divided into slices which can also use multi-level solutions.
plot_3d (boolean): plots that use a third dimension to encode an additional variable.
proportional_area (boolean): representations used to compare values through area size. Typically, using circle- or square-like shapes.
other (boolean): it includes all other types of non-temporal visualisations that do not fall into the aforementioned categories.
Temporal visualisations and encodings. In addition to non-temporal visualisations, a group of techniques to encode temporality is considered in order to enable comparisons with [7]. Columns:
timeline (boolean): the display of a list of data points or spans in chronological order. They include timelines working either with a scale or simply displaying events in sequence. As in [7], we also include structured solutions resembling Gantt chart layouts.
temporal_dimension (boolean): to report when time is mapped to any dimension of a visualisation, with the exclusion of timelines. We use the term “dimension” and not “axis” as in [7] as more appropriate for radial layouts or more complex representational choices.
animation (boolean): temporality is perceived through an animation changing the visualisation according to time flow.
visual_variable (boolean): another visual encoding strategy is used to represent any temporality-related variable (e.g., colour).
Interactions. A set of categories to assess affordable interactions based on the concept of user intent [8] and user-allowed perceptualisation data actions [9]. The following categories roughly match the manipulative subset of methods of the “how” an interaction is performed in the conception of [10]. Only interactions that affect the aspect of the visualisation or the visual representation of its data points, symbols, and glyphs are taken into consideration. Columns:
basic_selection (boolean): the demarcation of an element either for the duration of the interaction or more permanently until the occurrence of another selection.
advanced_selection (boolean): the demarcation involves both the selected element and connected elements within the visualisation or leads to brush and link effects across views. Basic selection is tacitly implied.
navigation (boolean): interactions that allow moving, zooming, panning, rotating, and scrolling the view but only when applied to the visualisation and not to the web page. It also includes “drill” interactions (to navigate through different levels or portions of data detail, often generating a new view that replaces or accompanies the original) and “expand” interactions generating new perspectives on data by expanding and collapsing nodes.
arrangement (boolean): the organisation of visualisation elements (symbols, glyphs, etc.) or multi-visualisation layouts spatially through drag and drop or
Facebook
TwitterData for Figure SPM.4 from the Summary for Policymakers (SPM) of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6). Figure SPM.4 panel a shows global emissions projections for CO2 and a set of key non-CO2 climate drivers, for the core set of five IPCC AR6 scenarios. Figure SPM.4 panel b shows attributed warming in 2081-2100 relative to 1850-1900 for total anthropogenic, CO2, other greenhouse gases, and other anthropogenic forcings for five Shared Socio-economic Pathway (SSP) scenarios. --------------------------------------------------- How to cite this dataset --------------------------------------------------- When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: IPCC, 2021: Summary for Policymakers. In: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 3−32, doi:10.1017/9781009157896.001. --------------------------------------------------- Figure subpanels --------------------------------------------------- The figure has two panels, with data provided for all panels in subdirectories named panel_a and panel_b. --------------------------------------------------- List of data provided --------------------------------------------------- This dataset contains: - Projected emissions from 2015 to 2100 for the five scenarios of the AR6 WGI core scenario set (SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5) - Projected warming for all anthropogenic forcers, CO2 only, non-CO2 greenhouse gases (GHGs) only, and other anthropogenic components for 2081-2100 relative to 1850-1900, for SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5. The five illustrative SSP (Shared Socio-economic Pathway) scenarios are described in Box SPM.1 of the Summary for Policymakers and Section 1.6.1.1 of Chapter 1. --------------------------------------------------- Data provided in relation to figure --------------------------------------------------- Panel a: The first column includes the years, while the next columns include the data per scenario and per climate forcer for the line graphs. - Data file: Carbon_dioxide_Gt_CO2_yr.csv. relates to Carbon dioxide emissions panel - Data file: Methane_Mt_CO2_yr.csv. relates to Methane emissions panel - Data file: Nitrous_oxide_Mt N2O_yr.csv. relates to Nitrous oxide emissions panel - Data file: Sulfur_dioxide_Mt SO2_yr.csv. relates to Sulfur dioxide emissions panel Panel b: - Data file: ts_warming_ranges_1850-1900_base_panel_b.csv. [Rows 2 to 5 relate to the first bar chart (cyan). Rows 6 to 9 relate to the second bar chart (blue). Rows 10 to 13 relate to the third bar chart (orange). Rows 14 to 17 relate to the fourth bar chart (red). Rows 18 to 21 relate to the fifth bar chart (brown).]. --------------------------------------------------- Sources of additional information --------------------------------------------------- The following weblink are provided in the Related Documents section of this catalogue record: - Link to the report webpage, which includes the report component containing the figure (Summary for Policymakers) and the Supplementary Material for Chapter 1, which contains details on the input data used in Table 1.SM.1..(Cross-Chapter Box 1.4, Figure 2). - Link to related publication for input data used in panel a.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
About Dataset:
Domain : Marketing Project: User Profiling and Segmentation Datasets: user_profile_for_ads Dataset Type: Excel Data Dataset Size: 16k+ record
KPI's:
Distribution of Key Demographic Variables like: a. Count of Age b. Count of Gender c. Count of Education Level d. Count of Income Level e. Count of Device Usage
Understanding Online Behavior like: a. Count of Time Spent Online (hrs/Weekday) b. Count of Time Spent Online (hrs/Weekend)
Ad Interaction Metrics: a. Count of likes and Reactions b. Count of click through rates (CTR) c. Count of Conversion Rate d. Count of Ad Interaction Time (secs) e. Count of Ad Interaction Time by Top Interests
Process: 1. Understanding the problem 2. Data Collection 3. Exploring and analyzing the data 4. Interpreting the results
This data contains stacked column chart, stacked bar chart, pie chart, dashboard, slicers, page navigation button.
Facebook
TwitterThe Ecommerce transaction analysis is one of great way to learn data visualization with Power BI or Tableau. Your visualization must reveals customer sales, product sales, regional sales, monthly sales, time of the day sales to gain valuable insights and business planning. You may use Combo Charts, Cards, Bar Charts, Tables, or Line Charts; for the customer segmentation page, you could employ Column Charts, Bubble Charts, Point Maps, Tables, etc.
Facebook
TwitterBy Gove Allen [source]
The Law and Order Dataset is a comprehensive collection of data related to the popular television series Law and Order that aired from 1990 to 2010. This dataset, compiled by IMDB.com, provides detailed information about each episode of the show, including its title, summary, airdate, director, writer, guest stars, and IMDb rating.
With over 450 episodes spanning 20 seasons of the original series as well as its spin-offs like Law and Order: Special Victims Unit, this dataset offers a wealth of information for analyzing various facets of criminal justice and law enforcement portrayed in the show. Whether you are a student or researcher studying crime-related topics or simply an avid fan interested in exploring behind-the-scenes details about your favorite episodes or actors involved in them, this dataset can be a valuable resource.
By examining this extensive collection of data using SQL queries or other analytical techniques, one can gain insights into patterns such as common tropes used in different seasons or characters that appeared most frequently throughout the series. Additionally, researchers can investigate correlations between factors like episode directors/writers and their impact on viewer ratings.
This dataset allows users to dive deep into analyzing aspects like crime types covered within episodes (e.g., homicide cases versus white-collar crimes), how often certain guest stars made appearances (including famous actors who had early roles on the show), or which writers/directors contributed most consistently high-rated episodes. Such analyses provide opportunities for uncovering trends over time within Law and Order's narrative structure while also shedding light on societal issues addressed by the series.
By making this dataset available for educational purposes at collegiate levels specifically aimed at teaching SQL skills—a powerful tool widely used in data analysis—the intention is to empower students with real-world examples they can explore hands-on while honing their database querying abilities. The graphical representation accompanying this dataset further enhances understanding by providing visualizations that illustrate key relationships between different variables.
Whether you are a seasoned data analyst, a budding criminologist, or simply looking to understand the intricacies of one of the most successful crime dramas in television history, the Law and Order Dataset offers you a vast array of information ripe for exploration and analysis
Understanding the Columns
Before diving into analyzing the data, it's important to understand what each column represents. Here is an overview:
Episode: The episode number within its respective season.Title: The title of each episode.Season: The season number in which each episode belongs.Year: The year in which each episode was released.Rating: IMDB rating for each episode (on a scale from 0-10).Votes: Number of votes received by each episode on IMDB.Description: Brief summary or description of each episode's plot.Director: Director(s) responsible for directing an episode.Writers: Writer(s) credited for writing an episode.Stars: Actor(s) who starred in an individual episode.Exploring Episode Data
The dataset allows you to explore various aspects of individual episodes as well as broader trends throughout different seasons:
1. Analyzing Ratings:
- You can examine how ratings vary across seasons using aggregation functions like average (AVG), minimum (MIN), maximum (MAX), etc., depending on your analytical goals. - Identify popular episodes by sorting based on highest ratings or most votes received.2.Trends over Time:
- Investigate how ratings have changed over time by visualizing them using line charts or bar graphs based on release years or seasons. - Examine if there are any significant fluctuations in ratings across different seasons or years.3. Directors and Writers:
- Identify episodes directed by a specific director or written by particular writers by filtering the dataset based on their names. - Analyze the impact of different directors or writers on episode ratings.4. Popular Actors:
- Explore episodes featuring popular actors from the show such as Mariska Hargitay (Olivia Benson), Christopher Meloni (Elliot Stabler), etc. - Investigate whether episodes with popular actors received higher ratings compared to ...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset contains information on nearly 150,000 products listed on Myntra. Each entry includes:
Data Analysis on Myntra Dataset
Data Analysis on Myntra dataset and represented using pivot tables and interactive dashboard
In this data analysis project, I undertook a comprehensive approach to enhance and visualize the Myntra real-time dataset. The key steps involved in the process were as follows:
Data Cleaning and Preparation:
Remove Unwanted Columns: I meticulously reviewed the dataset to identify and eliminate irrelevant columns like size, discounted amount that did not contribute to the analysis objectives. This step streamlined the dataset, focusing on the most pertinent data.
Data Cleaning: Addressed inconsistencies, missing values, and errors within the dataset. This involved standardizing data formats, correcting inaccuracies, and filling in or removing incomplete records to ensure the dataset's integrity.
Data Analysis:
Pivot Tables Creation: Developed pivot tables to summarize and analyze key metrics. This allowed for the aggregation of data across various dimensions such as product categories, sales performance, and customer demographics, providing insightful summaries and trends.
Interactive Dashboard:
Dashboard Development: Created an interactive dashboard to visualize real-time data. This dashboard includes dynamic charts, filters, and visualizations that enable users to interact with the dataset, facilitating real-time insights and decision-making. Visualization: Implemented various types of visualizations such as bar charts, column chart to effectively communicate the data trends and patterns.
Overall, this project aimed to deliver a clean, organized, and insightful view of the Myntra dataset through advanced analysis and interactive visualization techniques. The resulting dashboard offers a powerful tool for monitoring and analyzing real-time data, supporting data-driven decision-making processes.
Facebook
TwitterPortfolio_Adidas_Dataset A set of real-world dataset tasks is completed by using the Python Pandas and Matplotlib libraries.
Background Information: In this portfolio, we use Python Pandas & Python Matplotlib to analyze and answer business questions about 5 products worth of sales data. The data contains hundreds of thousands of footwear store purchases broken down by product type, cost, region, state, city, and so on.
We start by cleaning our data. Tasks during this section include:
Once we have cleaned up our data a bit, we move to the data exploration section. In this section we explore 5 high-level business questions related to our data:
To answer these questions we walk through many different openpyxl, pandas, and matplotlib methods. They include:
Facebook
TwitterBy IBM Watson AI XPRIZE - Environment [source]
This dataset from Kaggle contains global land and surface temperature data from major cities around the world. By relying on the raw temperature reports that form the foundation of their averaging system, researchers are able to accurately track climate change over time. With this dataset, we can observe monthly averages and create detailed gridded temperature fields to analyze localized data on a country-by-country basis. The information in this dataset has allowed us to gain a better understanding of our changing planet and how certain regions are being impacted more than others by climate change. With such insights, we can look towards developing better responses and strategies as our temperatures continue to increase over time
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Introduction
This guide will show you how to use this dataset to explore global climate change trends over time.
Exploring the Dataset
Select one or more countries by using df[df['Country']=='countryname'] command in order to filter out any unnecessary information that is not related to those countries;
Use df.groupby('City')['AverageTemperature'] command in order to group all cities together with their respective average temperatures;
Compute basic summary statistics such as mean or median for each group with df['AverageTemperature'].{mean(),median()}, where {} can be replaced with mean or median according various statistic requirements;
4 .Plot a graph comparing these results from line plots or bar charts with pandas plot function such as df[column].plot(kind='line'/'bar'), etc., which can help visualize certain trends associated form these groups
You can also use latitude/longitude coordinates provided alongwith every record further decompose records by location using folium library within python such as folium maps that provide visualization features & zoomable maps alongwith many other rendering options within them like mapping locations according different color shades & size based on different parameters given.. These are just some ways you could visualize your data! There are plenty more possibilities!
- Analyzing temperature changes across different countries to identify regional climate trends and abnormalities.
- Investigating how global warming is affecting urban areas by looking at the average temperatures of major cities over time.
- Comparing historic average temperatures for a given region to current day average temperatures to quantify the magnitude of global warming in that region.
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: GlobalLandTemperaturesByCountry.csv | Column name | Description | |:----------------------------------|:--------------------------------------------------------------| | dt | Date of the temperature measurement. (Date) | | AverageTemperature | Average temperature for the given date. (Float) | | AverageTemperatureUncertainty | Uncertainty of the average temperature measurement. (Float) | | Country | Country where the temperature measurement was taken. (String) |
File: GlobalLandTemperaturesByMajorCity.csv | Column name | Description | |:----------------------------------|:-----------------------------------------------------------------------| | dt | Date...
Facebook
TwitterBy Throwback Thursday [source]
This dataset offers a comprehensive analysis of the recorded music revenue in the United States, specifically focusing on the 10th week of the year. The data is meticulously categorized based on different formats, shedding light on the diverse ways in which music is consumed and purchased by individuals. The dataset includes key columns that provide relevant information, such as Format, Year, Units, Revenue, and Revenue (Inflation Adjusted). These columns offer valuable insights into the specific format of music being consumed or purchased, the respective year in which this data was recorded, the number of units of music sold within each format category, and both the total revenue generated from sales and its corresponding inflation-adjustment amount. By analyzing this dataset with its extensive range of information about recorded music revenue in various formats during a specific week within a given year in the United States market context can help derive meaningful patterns and trends for industry professionals to make informed decisions regarding marketing strategies or investments
Introduction:
Familiarize Yourself with Columns:
- Format: This column categorizes how music is consumed or purchased.
- Year: This column represents the year when each data point was recorded.
- Units: The number of units of music sold within a particular format during a given week.
- Revenue: The total revenue generated from sales of music within a specific format during a given week.
- Revenue (Inflation Adjusted): The total revenue generated from sales of music adjusted for inflation within a specific format during a given week.
Understanding Categorical Formats: In this dataset, formats refer to different ways in which music is consumed or purchased. Examples include physical formats like CDs and vinyl records, as well as digital formats such as downloads and streaming services.
Analyzing Trends over Time: By exploring data across multiple years, you can identify trends and patterns related to how formats have evolved over time. Use statistical techniques or visualization tools like line graphs or bar charts to gain insights into any fluctuations or consistent growth.
Comparing Units Sold vs Revenue Generated: Analyze both units sold and revenue generated columns simultaneously to understand if there are any significant differences between different formats' popularity versus their financial performance.
Examining Adjusted Revenue for Inflation Effects: Comparison between Revenue and Revenue (Inflation Adjusted) can provide insights into whether changes in revenue are due solely to changes in purchasing power caused by inflation or influenced by other factors affecting format popularity.
Identifying Format Preferences: Explore how units and revenue differ across various formats to determine whether consumer preferences are shifting towards digital formats or experiencing a resurgence in physical formats like vinyl.
Comparing Revenue Performance Between Formats: Use statistical analysis or data visualization techniques to compare revenue performance between different formats. Identify which format generates the highest revenue and whether there have been any changes in dominance over time.
Supplementary Research Opportunities: Combine this dataset with external sources on music industry trends, technological advancements, or major events like album releases to gain a deeper understanding of the factors influencing recorded music sales
- Trend analysis: This dataset can be used to analyze the trends in recorded music revenue by format over the years. By examining the revenue and units sold for each format, one can identify which formats are growing in popularity and which ones are declining.
- Comparison of revenue vs inflation-adjusted revenue: The dataset includes both total revenue and inflation-adjusted revenue for each format. This allows for a comparison of the actual revenue generated with the potential impact of inflation on that revenue. It can provide insights into whether the increase or decrease in revenue is solely due to changes in market demand or if it is influenced by changes in purchasing power.
- Format preference analysis: By analyzing the units sold for each format, one can identify which formats are preferred by consumers during a particular week. This information can be useful for music industry professionals and marketers to under...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
About Datasets:
Domain : Finance Project: Bank loan of customers Datasets: Finance_1.xlsx & Finance_2.xlsx Dataset Type: Excel Data Dataset Size: Each Excel file has 39k+ records
KPI's: 1. Year wise loan amount Stats 2. Grade and sub grade wise revolving balance 3. Total Payment for Verified Status Vs Total Payment for Non Verified Status 4. State wise loan status 5. Month wise loan status 6. Get more insights based on your understanding of the data
Process: 1. Understanding the problem 2. Data Collection 3. Data Cleaning 4. Exploring and analyzing the data 5. Interpreting the results
This data contains Power Query, Power Pivot, Merge data, Clustered Bar Chart, Clustered Column Chart, Line Chart, 3D Pie chart, Dashboard, slicers, timeline, formatting techniques.
Facebook
TwitterBy Coronavirus (COVID-19) Data Hub [source]
The COVID-19 Global Time Series Case and Death Data is a comprehensive collection of global COVID-19 case and death information recorded over time. This dataset includes data from various sources such as JHU CSSE COVID-19 Data and The New York Times.
The dataset consists of several columns providing detailed information on different aspects of the COVID-19 situation. The COUNTRY_SHORT_NAME column represents the short name of the country where the data is recorded, while the Data_Source column indicates the source from which the data was obtained.
Other important columns include Cases, which denotes the number of COVID-19 cases reported, and Difference, which indicates the difference in case numbers compared to the previous day. Additionally, there are columns such as CONTINENT_NAME, DATA_SOURCE_NAME, COUNTRY_ALPHA_3_CODE, COUNTRY_ALPHA_2_CODE that provide additional details about countries and continents.
Furthermore, this dataset also includes information on deaths related to COVID-19. The column PEOPLE_DEATH_NEW_COUNT shows the number of new deaths reported on a specific date.
To provide more context to the data, certain columns offer demographic details about locations. For instance, Population_Count provides population counts for different areas. Moreover,**FIPS** code is available for provincial/state regions for identification purposes.
It is important to note that this dataset covers both confirmed cases (Case_Type: confirmed) as well as probable cases (Case_Type: probable). These classifications help differentiate between various types of COVID-19 infections.
Overall, this dataset offers a comprehensive picture of global COVID-19 situations by providing accurate and up-to-date information on cases, deaths, demographic details like population count or FIPS code), source references (such as JHU CSSE or NY Times), geographical information (country names coded with ALPHA codes) , etcetera making it useful for researchers studying patterns and trends associated with this pandemic
Understanding the Dataset Structure:
- The dataset is available in two files: COVID-19 Activity.csv and COVID-19 Cases.csv.
- Both files contain different columns that provide information about the COVID-19 cases and deaths.
- Some important columns to look out for are: a. PEOPLE_POSITIVE_CASES_COUNT: The total number of confirmed positive COVID-19 cases. b. COUNTY_NAME: The name of the county where the data is recorded. c. PROVINCE_STATE_NAME: The name of the province or state where the data is recorded. d. REPORT_DATE: The date when the data was reported. e. CONTINENT_NAME: The name of the continent where the data is recorded. f. DATA_SOURCE_NAME: The name of the data source. g. PEOPLE_DEATH_NEW_COUNT: The number of new deaths reported on a specific date. h.COUNTRY_ALPHA_3_CODE :The three-letter alpha code represents country f.Lat,Long :latitude and longitude coordinates represent location i.Country_Region or COUNTRY_SHORT_NAME:The country or region where cases were reported.
Choosing Relevant Columns: It's important to determine which columns are relevant to your analysis or research question before proceeding with further analysis.
Exploring Data Patterns: Use various statistical techniques like summarizing statistics, creating visualizations (e.g., bar charts, line graphs), etc., to explore patterns in different variables over time or across regions/countries.
Filtering Data: You can filter your dataset based on specific criteria using column(s) such as COUNTRY_SHORT_NAME, CONTINENT_NAME, or PROVINCE_STATE_NAME to focus on specific countries, continents, or regions of interest.
Combining Data: You can combine data from different sources (e.g., COVID-19 cases and deaths) to perform advanced analysis or create insightful visualizations.
Analyzing Trends: Use the dataset to analyze and identify trends in COVID-19 cases and deaths over time. You can examine factors such as population count, testing count, hospitalization count, etc., to gain deeper insights into the impact of the virus.
Comparing Countries/Regions: Compare COVID-19
- Trend Analysis: This dataset can be used to analyze and track the trends of COVID-19 cases and deaths over time. It provides comprehensive global data, allowing researchers and po...
Facebook
TwitterHello all, this dataset involves various factors effecting cancer and based upon those factors, I have created a Histogram of various columns of the table which leads to heart disease. A histogram is a bar graph-like representation of data that buckets a range of outcomes into columns along the x-axis. The y-axis represents the number count or percentage of occurrences in the data for each column and can be used to visualize data distribution. At last I have created combined histogram of entire table which involves all the columns. Giving Titles, X-axis name, Y-axis name, Sizes and Colors is also done in this notebook.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this data analysis, I used the dataset ‘Restaurant Orders’ , from https://mavenanalytics.io/data-playground .Which has a license (License: Public Domain). Public domain license work is free for use by anyone for any purpose without restriction under copyright law. Public domain is the form of open/free, since no one owns or controls the material in any way. Dataset ‘Restaurant Orders’ , from https://mavenanalytics.io/data-playground has 3 dataframes in csv format: ‘restaurant_db_data_dictionary.csv’ as an instruction or description of the relationships between tables. ‘order_details.csv’ - it has columns order_details_id,order_id, order_date, order_time,item_id ‘menu_items.csv‘ - it has columns menu_item_id , item_name ,category ,price .
Using 3 dataframes we will create new dataframe ‘order_details_table' (result dataframe in Power BI file restaurant_orders_result.pbix). Based on this new dataframe, we will generate various charts visualizations in the file restaurant_orders_result_charts.pbix and also attach the charts here .Below is a more detailed description of how I created the new dataframe 'order_details_table' ,and the visualizations, including bar charts and pie charts.
I will use Power Bi in this project . 1. Delete all rows where value rows is ‘NULL’ in the column ‘item_id’ from the dataframe ‘order_details’. For this, I use Power Query Editor and the ‘Keep Rows’ function. And keep all rows except for 'NULL' values . 2. Combine 2 columns ‘order_date’ and ‘order_time’ to 1 column ‘order_date_time’ in the format MM/DD/YY HH:MM:SS 3. We also need to merge two dataframes into one dataframe ‘order_details_table’ using the ‘Merge Queries’ function in Power Query Editor and choose inner join (only matching rows). In the dataframe ‘restaurant_db_data_dictionary.csv’ we find information that column ‘item_id’ from ‘order_details’ table matches the ‘menu_item_id’ in the ‘menu_items’ table and combine 2 tables by common column id ‘menu_item_id’ and ‘item_id’ . 4. We remove the columns that we don’t need and also create a new ‘order_id’ with unique number for each order.
As a result we have 6 columns in the new dataframe ‘order_details_table’ , such as: order_details_id: A unique identifier for each dish within an order, order_id : The unique identifier for each order or transaction , order_date_time : The date when the order was created in the format (MM/DD/YY HH:MM:SS) , menu_item_category : The category to which the dish belongs , menu_item_name : The name of the dish on the menu , menu_item_price : The price of the dish .
Table order_detail_tables from Power BI file restaurant_orders_result.pbix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F1098315c0e34255b67ad3419aa113bf0%2Fdataframe.png?generation=1730269164808705&alt=media" alt="">
I have also created bar charts and pie charts to display the results from the new dataframe. These plots are included in the file ‘restaurant_orders_result_charts.pbix’ . And you can find pictures of charts below.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F4254696bbd3d7e0fc5f456c226c39114%2Fpicture_1.png?generation=1730269227195114&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F71092cf769862cf7364fe1ccac9fad83%2Fpicture_2.png?generation=1730269249147687&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F528ef51ecf21f006b0c21b65503e03fa%2Fpicture_3.png?generation=1730269284640753&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F147c240da4be5bfe9da057a8bc5d5939%2Fpicture_4.png?generation=1730269300346146&alt=media" alt="">
I also attached the original and new files to this project, thank you.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This file is a dataset likely used to study or predict Vitamin D deficiency using lifestyle and demographic data from different individuals.
🧾 Possible Columns in the Dataset (Example): Column Name Explanation Age Age of the individual Gender Male or Female BMI Body Mass Index (based on height and weight) Sun Exposure Amount of daily sunlight exposure Diet Type Type of diet followed (e.g., vegetarian, balanced) Physical Activity Level of physical exercise per day/week Vitamin D Level Blood vitamin D level (e.g., Normal, Deficient, Insufficient)
🎯 Purpose of the Dataset: This dataset can be used to:
Analyze how lifestyle choices impact Vitamin D levels
Conduct health research
Train machine learning models to predict if a person is at risk of Vitamin D deficiency
🔬 Example Insights You Can Discover: Whether people under 30 have less sun exposure
If females are more likely to be deficient
How diet and physical activity affect Vitamin D levels
✅ What You Can Do with It: Summary statistics
Build prediction models (e.g., using machine learning)
Visualizations like:
Bar graphs (e.g., deficiency by gender)
Pie charts (e.g., distribution of vitamin D levels)
Correlation heatmaps (e.g., link between BMI and deficiency)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem Statements for Data Visualization – Supermarket Sales Dataset 1. Sales Performance Across Branches Management wants to understand how sales performance varies across supermarket branches in Lagos, Abuja, Ogun, and Port Harcourt to identify the best-performing locations and areas that need improvement. • Suggested Visualizations: • Bar chart comparing total sales and profit by branch • Map chart showing sales by city • KPI cards: Total Sales, Profit, and Average Transaction Value per branch 2. Customer Purchase Behavior The marketing team needs insights into how different customer types (Member vs Normal) and genders influence purchase trends and average spending. • Suggested Visualizations: • Pie chart for customer type distribution • Bar chart for average spend by gender • Segmented comparison of total sales by customer type 3. Product Line Performance The business wants to know which product categories drive the highest revenue, quantity sold, and customer satisfaction to optimize stock levels and marketing focus. • Suggested Visualizations: • Bar chart showing total sales by product line • Column chart comparing average rating per product line • Profit margin chart by product line 4. Sales Trends Over Time The management team wants to monitor sales trends over time to identify peak periods, track seasonal variations, and plan future promotions accordingly. • Suggested Visualizations: • Line chart showing monthly or weekly sales trend • Seasonal decomposition (sales by month) • Trendline showing revenue growth 5. Payment Method Analysis The finance department needs to evaluate payment method usage (Cash, E-wallet, Credit Card) across cities to improve payment convenience and reduce transaction delays. • Suggested Visualizations: • Donut or bar chart showing share of payment methods • City-level breakdown of preferred payment type • Correlation between payment method and average transaction value 6. Customer Satisfaction Insights The customer experience team wants to explore how customer ratings relate to sales amount, product type, and branch performance to identify drivers of customer satisfaction. • Suggested Visualizations: • Scatter plot of rating vs total purchase amount • Heat map of average rating by branch and product line • KPI card showing average customer rating