11 datasets found

US Regional Sales Data
kaggle.com
Updated Aug 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abu Talha (2023). US Regional Sales Data [Dataset]. https://www.kaggle.com/datasets/talhabu/us-regional-sales-data/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abu Talha
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides comprehensive insights into US regional sales data across different sales channels, including In-Store, Online, Distributor, and Wholesale. With a total of 17,992 rows and 15 columns, this dataset encompasses a wide range of information, from order and product details to sales performance metrics. It offers a comprehensive overview of sales transactions and customer interactions, enabling deep analysis of sales patterns, trends, and potential opportunities.

Columns in the dataset: - OrderNumber: A unique identifier for each order. - Sales Channel: The channel through which the sale was made (In-Store, Online, Distributor, Wholesale). - WarehouseCode: Code representing the warehouse involved in the order. - ProcuredDate: Date when the products were procured. - OrderDate: Date when the order was placed. - ShipDate: Date when the order was shipped. - DeliveryDate: Date when the order was delivered. - SalesTeamID: Identifier for the sales team involved. - CustomerID: Identifier for the customer. - StoreID: Identifier for the store. - ProductID: Identifier for the product. - Order Quantity: Quantity of products ordered. - Discount Applied: Applied discount for the order. - Unit Cost: Cost of a single unit of the product. - Unit Price: Price at which the product was sold.

This dataset serves as a valuable resource for analysing sales trends, identifying popular products, assessing the performance of different sales channels, and optimising pricing strategies for different regions.

Visualization Ideas:

Time Series Analysis: Plot sales trends over time to identify seasonal patterns and changes in demand.

Sales Channel Comparison: Compare sales performance across different channels using bar charts or line graphs.

Product Analysis: Visualise the distribution of sales across different products using pie charts or bar plots.

Discount Analysis: Analyse the impact of discounts on sales using scatter plots or line graphs.

Regional Performance: Create maps to visualise sales performance across different regions.

Data Modelling and Machine Learning Ideas (Price Prediction): - Linear Regression: Build a linear regression model to predict the unit price based on features such as order quantity, discount applied, and unit cost. - Random Forest Regression: Use a random forest regression model to predict the price, taking into account multiple features and their interactions. - Neural Networks: Train a neural network to predict unit price using deep learning techniques, which can capture complex relationships in the data. - Feature Importance Analysis: Identify the most influential features affecting price prediction using techniques like feature importance scores from tree-based models. - Time Series Forecasting: Develop a time series forecasting model to predict future prices based on historical sales data. - These visualisation and modelling ideas can help you gain valuable insights from the sales data and create predictive models to optimise pricing strategies and improve sales performance.
f
Petre_Slide_CategoricalScatterplotFigShare.pptx
figshare.com
pptx
Updated Sep 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benj Petre; Aurore Coince; Sophien Kamoun (2016). Petre_Slide_CategoricalScatterplotFigShare.pptx [Dataset]. http://doi.org/10.6084/m9.figshare.3840102.v1
Explore at:
pptxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3840102.v1
Dataset updated
Sep 19, 2016
Dataset provided by
figshare
Authors
Benj Petre; Aurore Coince; Sophien Kamoun
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Categorical scatterplots with R for biologists: a step-by-step guide

Benjamin Petre1, Aurore Coince2, Sophien Kamoun1

1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK

Weissgerber and colleagues (2015) recently stated that ‘as scientists, we urgently need to change our practices for presenting continuous data in small sample size studies’. They called for more scatterplot and boxplot representations in scientific papers, which ‘allow readers to critically evaluate continuous data’ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.

Protocol

• Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column ‘Replicate’ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column ‘Condition’ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column ‘Value’ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in ‘File Format’, select .csv). This .csv file is the input file to import in R.

• Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.

• Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.

Notes

• Note 1: install the ggplot2 package. The R script requires the package ‘ggplot2’ to be installed. To install it, Packages & Data -> Package Installer -> enter ‘ggplot2’ in the Package Search space and click on ‘Get List’. Select ‘ggplot2’ in the Package column and click on ‘Install Selected’. Install all dependencies as well.

• Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.

7 Display the graph in a separate window. Dot colors indicate

replicates

graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()

References

Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.

Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128

https://cran.r-project.org/

http://ggplot2.org/
US Tobacco Use Prevalence
kaggle.com
Updated Dec 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). US Tobacco Use Prevalence [Dataset]. https://www.kaggle.com/datasets/thedevastator/us-tobacco-use-prevalence/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2023
Dataset provided by
Kaggle
Authors
The Devastator
Description
US Tobacco Use Prevalence

US Tobacco Use Prevalence by Year, State, Type, and Age

By Throwback Thursday [source]

About this dataset

This dataset contains comprehensive information on tobacco use in the United States from 2011 to 2016. The data is sourced from the CDC Behavioral Risk Factor Survey, a reliable and extensive survey that captures important data about tobacco use behaviors across different states in the United States.

The dataset includes various key variables such as the year of data collection, state abbreviation indicating where the data was collected, and specific tobacco types explored in the survey. It also provides valuable insight into the prevalence of tobacco use through quantitative measures represented by numeric values. The unit of measurement for these values, such as percentages or numbers, is included as well.

Moreover, this dataset offers an understanding of how different age groups are affected by tobacco use, with age being categorized into distinct groups. This ensures that researchers and analysts can assess variations in tobacco consumption and its associated health implications across different age demographics.

With all these informative attributes arranged in a convenient tabular format, this dataset serves as a valuable resource for investigating patterns and trends related to tobacco use within varying contexts over a six-year period

How to use the dataset

Introduction:

Step 1: Familiarize Yourself with the Columns

Before diving into any analysis, it is important to understand the structure of the dataset by familiarizing yourself with its columns. Here are the key columns in this dataset:

Year: The year in which the data was collected (Numeric)

State Abbreviation: The abbreviation of the state where the data was collected (String)

Tobacco Type: The type of tobacco product used (String)

Data Value: The percentage or number representing prevalence of tobacco use (Numeric)

Data Value Unit: The unit of measurement for data value (e.g., percentage, number) (String)

Age: The age group to which the data value corresponds (String)

Step 2: Determine Your Research Questions or Objectives

To make effective use of this dataset, it is essential to clearly define your research questions or objectives. Some potential research questions related to this dataset could be:

How has tobacco use prevalence changed over time?

Which states have the highest and lowest rates of tobacco use?

What are the most commonly used types of tobacco products?

Is there a correlation between age group and tobacco use?

By defining your research questions or objectives upfront, you can focus your analysis accordingly.

Step 3: Analyzing Trends Over Time

To analyze trends over time using this dataset: - Group and aggregate relevant columns such as Year and Data Value. - Plot the data using line graphs or bar charts to visualize the changes in tobacco use prevalence over time. - Interpret the trends and draw conclusions from your analysis.

Step 4: Comparing States

To compare states and their tobacco use prevalence: - Group and aggregate relevant columns such as State Abbreviation and Data Value. - Sort the data based on prevalence rates to identify states with the highest and lowest rates of tobacco use. - Visualize this comparison using bar charts or maps for a clearer understanding.

Step 5: Understanding Tobacco Types

To gain insights into different types of tobacco products used: - Analyze the Tobacco

Research Ideas

Analyzing trends in tobacco use: This dataset can be used to analyze the prevalence of tobacco use over time and across different states. It can help identify patterns and trends in tobacco consumption, which can be valuable for public health research and policy-making.

Assessing the impact of anti-smoking campaigns: Researchers or organizations working on anti-smoking campaigns can use this dataset to evaluate the effectiveness of their interventions. By comparing the data before and after a campaign, they can determine whether there has been a decrease in tobacco use and if specific groups or regions have responded better to the campaign.

Understanding demographic factors related to tobacco use: The dataset includes information on age groups, allowing for analysis of how different age demographics are affected by tobacco use. By examining data value variations across age groups, researchers can gain insights into which populations are most vulnerable to smoking-related issues and design targeted prevention programs an...
Airlines Flights Data
kaggle.com
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science Lovers (2025). Airlines Flights Data [Dataset]. https://www.kaggle.com/datasets/rohitgrewal/airlines-flights-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Science Lovers
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

The Flights Booking Dataset of various Airlines is a scraped datewise from a famous website in a structured format. The dataset contains the records of flight travel details between the cities in India. Here, multiple features are present like Source & Destination City, Arrival & Departure Time, Duration & Price of the flight etc.

This data is available as a CSV file. We are going to analyze this data set using the Pandas DataFrame.

This analyse will be helpful for those working in Airlines, Travel domain.

Using this dataset, we answered multiple questions with Python in our Project.

Q.1. What are the airlines in the dataset, accompanied by their frequencies?

Q.2. Show Bar Graphs representing the Departure Time & Arrival Time.

Q.3. Show Bar Graphs representing the Source City & Destination City.

Q.4. Does price varies with airlines ?

Q.5. Does ticket price change based on the departure time and arrival time?

Q.6. How the price changes with change in Source and Destination?

Q.7. How is the price affected when tickets are bought in just 1 or 2 days before departure?

Q.8. How does the ticket price vary between Economy and Business class?

Q.9. What will be the Average Price of Vistara airline for a flight from Delhi to Hyderabad in Business Class ?

These are the main Features/Columns available in the dataset :

1) Airline: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.

2) Flight: Flight stores information regarding the plane's flight code. It is a categorical feature.

3) Source City: City from which the flight takes off. It is a categorical feature having 6 unique cities.

4) Departure Time: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.

5) Stops: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.

6) Arrival Time: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.

7) Destination City: City where the flight will land. It is a categorical feature having 6 unique cities.

8) Class: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.

9) Duration: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.

10) Days Left: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.

11) Price: Target variable stores information of the ticket price.
Summary for Policymakers of the Working Group I Contribution to the IPCC...
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Mar 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joeri Rogelj; Chris Smith; Gian-Kasper Plattner; Malte Meinshausen; Sophie Szopa; Sebastian Milinski; Jochem Marotzke (2024). Summary for Policymakers of the Working Group I Contribution to the IPCC Sixth Assessment Report - data for Figure SPM.4 (v20210809) [Dataset]. https://catalogue.ceda.ac.uk/uuid/bd65331b1d344ccca44852e495d3a049
Explore at:
Dataset updated
Mar 9, 2024
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Joeri Rogelj; Chris Smith; Gian-Kasper Plattner; Malte Meinshausen; Sophie Szopa; Sebastian Milinski; Jochem Marotzke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2015 - Dec 31, 2100
Area covered
Earth
Description
Data for Figure SPM.4 from the Summary for Policymakers (SPM) of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).

Figure SPM.4 panel a shows global emissions projections for CO2 and a set of key non-CO2 climate drivers, for the core set of five IPCC AR6 scenarios. Figure SPM.4 panel b shows attributed warming in 2081-2100 relative to 1850-1900 for total anthropogenic, CO2, other greenhouse gases, and other anthropogenic forcings for five Shared Socio-economic Pathway (SSP) scenarios.

How to cite this dataset

When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates:

IPCC, 2021: Summary for Policymakers. In: Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 3−32, doi:10.1017/9781009157896.001.

Figure subpanels

The figure has two panels, with data provided for all panels in subdirectories named panel_a and panel_b.

List of data provided

This dataset contains:

Projected emissions from 2015 to 2100 for the five scenarios of the AR6 WGI core scenario set (SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5)

Projected warming for all anthropogenic forcers, CO2 only, non-CO2 greenhouse gases (GHGs) only, and other anthropogenic components for 2081-2100 relative to 1850-1900, for SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0 and SSP5-8.5.

The five illustrative SSP (Shared Socio-economic Pathway) scenarios are described in Box SPM.1 of the Summary for Policymakers and Section 1.6.1.1 of Chapter 1.

Data provided in relation to figure

Panel a:

The first column includes the years, while the next columns include the data per scenario and per climate forcer for the line graphs.

Data file: Carbon_dioxide_Gt_CO2_yr.csv. relates to Carbon dioxide emissions panel

Data file: Methane_Mt_CO2_yr.csv. relates to Methane emissions panel

Data file: Nitrous_oxide_Mt N2O_yr.csv. relates to Nitrous oxide emissions panel

Data file: Sulfur_dioxide_Mt SO2_yr.csv. relates to Sulfur dioxide emissions panel

Panel b:

Data file: ts_warming_ranges_1850-1900_base_panel_b.csv. [Rows 2 to 5 relate to the first bar chart (cyan). Rows 6 to 9 relate to the second bar chart (blue). Rows 10 to 13 relate to the third bar chart (orange). Rows 14 to 17 relate to the fourth bar chart (red). Rows 18 to 21 relate to the fifth bar chart (brown).].

Sources of additional information

The following weblink are provided in the Related Documents section of this catalogue record: - Link to the report webpage, which includes the report component containing the figure (Summary for Policymakers) and the Supplementary Material for Chapter 1, which contains details on the input data used in Table 1.SM.1..(Cross-Chapter Box 1.4, Figure 2). - Link to related publication for input data used in panel a.
S
machine learning models on the WDBC dataset
scidb.cn
Updated Apr 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi Aghaziarati (2025). machine learning models on the WDBC dataset [Dataset]. http://doi.org/10.57760/sciencedb.23537
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.23537
Dataset updated
Apr 15, 2025
Dataset provided by
Science Data Bank
Authors
Mahdi Aghaziarati
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset used in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, originally provided by the University of Wisconsin and obtained via Kaggle. It consists of 569 observations, each corresponding to a digitized image of a fine needle aspirate (FNA) of a breast mass. The dataset contains 32 attributes: one identifier column (discarded during preprocessing), one diagnosis label (malignant or benign), and 30 continuous real-valued features that describe the morphology of cell nuclei. These features are grouped into three statistical descriptors—mean, standard error (SE), and worst (mean of the three largest values)—for ten morphological properties including radius, perimeter, area, concavity, and fractal dimension. All feature values were normalized using z-score standardization to ensure uniform scale across models sensitive to input ranges. No missing values were present in the original dataset. Label encoding was applied to the diagnosis column, assigning 1 to malignant and 0 to benign cases. The dataset was split into training (80%) and testing (20%) sets while preserving class balance via stratified sampling. The accompanying Python source code (breast_cancer_classification_models.py) performs data loading, preprocessing, model training, evaluation, and result visualization. Four lightweight classifiers—Decision Tree, Naïve Bayes, Perceptron, and K-Nearest Neighbors (KNN)—were implemented using the scikit-learn library (version 1.2 or later). Performance metrics including Accuracy, Precision, Recall, F1-score, and ROC-AUC were calculated for each model. Confusion matrices and ROC curves were generated and saved as PNG files for interpretability. All results are saved in a structured CSV file (classification_results.csv) that contains the performance metrics for each model. Supplementary visualizations include all_feature_histograms.png (distribution plots for all standardized features), model_comparison.png (metric-wise bar plot), and feature_correlation_heatmap.png (Pearson correlation matrix of all 30 features). The data files are in standard CSV and PNG formats and can be opened using any spreadsheet or image viewer, respectively. No rare file types are used, and all scripts are compatible with any Python 3.x environment. This data package enables reproducibility and offers a transparent overview of how baseline machine learning models perform in the domain of breast cancer diagnosis using a clinically-relevant dataset.
z
Classification of web-based Digital Humanities projects leveraging...
zenodo.org
csv, tsv
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tommaso Battisti; Tommaso Battisti (2025). Classification of web-based Digital Humanities projects leveraging information visualisation techniques [Dataset]. http://doi.org/10.5281/zenodo.14192758
Explore at:
tsv, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14192758
Dataset updated
Aug 27, 2025
Dataset provided by
Zenodo
Authors
Tommaso Battisti; Tommaso Battisti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This dataset contains a list of 186 Digital Humanities projects leveraging information visualisation methods. Each project has been classified according to visualisation and interaction techniques, narrativity and narrative solutions, domain, methods for the representation of uncertainty and interpretation, and the employment of critical and custom approaches to visually represent humanities data.

Classification schema: categories and columns

The project_id column contains unique internal identifiers assigned to each project. Meanwhile, the last_access column records the most recent date (in DD/MM/YYYY format) on which each project was reviewed based on the web address specified in the url column.
The remaining columns can be grouped into descriptive categories aimed at characterising projects according to different aspects:

Narrativity. It reports the presence of information visualisation techniques employed within narrative structures. Here, the term narrative encompasses both author-driven linear data stories and more user-directed experiences where the narrative sequence is determined by user exploration [1]. We define 2 columns to identify projects using visualisation techniques in narrative, or non-narrative sections. Both conditions can be true for projects employing visualisations in both contexts. Columns:

non_narrative (boolean)

narrative (boolean)

Domain. The humanities domain to which the project is related. We rely on [2] and the chapters of the first part of [3] to abstract a set of general domains. Column:

domain (categorical):

History and archaeology

Art and art history

Language and literature

Music and musicology

Multimedia and performing arts

Philosophy and religion

Other: both extra-list domains and cases of collections without a unique or specific thematic focus.

Visualisation of uncertainty and interpretation. Buiding upon the frameworks proposed by [4] and [5], a set of categories was identified, highlighting a distinction between precise and impressional communication of uncertainty. Precise methods explicitly represent quantifiable uncertainty such as missing, unknown, or uncertain data, precisely locating and categorising it using visual variables and positioning. Two sub-categories are interactive distinction, when uncertain data is not visually distinguishable from the rest of the data but can be dynamically isolated or included/excluded categorically through interaction techniques (usually filters); and visual distinction, when uncertainty visually “emerges” from the representation by means of dedicated glyphs and spatial or visual cues and variables. On the other hand, impressional methods communicate the constructed and situated nature of data [6], exposing the interpretative layer of the visualisation and indicating more abstract and unquantifiable uncertainty using graphical aids or interpretative metrics. Two sub-categories are: ambiguation, when the use of graphical expedients—like permeable glyph boundaries or broken lines—visually convey the ambiguity of a phenomenon; and interpretative metrics, when expressive, non-scientific, or non-punctual metrics are used to build a visualisation. Column:

uncertainty_interpretation (categorical):

Interactive distinction

Visual distinction

Ambiguation

Interpretative metrics

Critical adaptation. We identify projects in which, with regards to at least a visualisation, the following criteria are fulfilled: 1) avoid repurposing of prepackaged, generic-use, or ready-made solutions; 2) being tailored and unique to reflect the peculiarities of the phenomena at hand; 3) avoid simplifications to embrace and depict complexity, promoting time-consuming visualisation-based inquiry. Column:

critical_adaptation (boolean)

Non-temporal visualisation techniques. We adopt and partially adapt the terminology and definitions from [7]. A column is defined for each type of visualisation and accounts for its presence within a project, also including stacked layouts and more complex variations. Columns and inclusion criteria:

plot (boolean): visual representations that map data points onto a two-dimensional coordinate system.

cluster_or_set (bool): sets or cluster-based visualisations used to unveil possible inter-object similarities.

map (boolean): geographical maps used to show spatial insights. While we do not specify the variants of maps (e.g., pin maps, dot density maps, flow maps, etc.), we make an exception for maps where each data point is represented by another visualisation (e.g., a map where each data point is a pie chart) by accounting for the presence of both in their respective columns.

network (boolean): visual representations highlighting relational aspects through nodes connected by links or edges.

hierarchical_diagram (boolean): tree-like structures such as tree diagrams, radial trees, but also dendrograms. They differ from networks for their strictly hierarchical structure and absence of closed connection loops.

treemap (boolean): still hierarchical, but highlighting quantities expressed by means of area size. It also includes circle packing variants.

word_cloud (boolean): clouds of words, where each instance’s size is proportional to its frequency in a related context

bars (boolean): includes bar charts, histograms, and variants. It coincides with “bar charts” in [7] but with a more generic term to refer to all bar-based visualisations.

line_chart (boolean): the display of information as sequential data points connected by straight-line segments.

area_chart (boolean): similar to a line chart but with a filled area below the segments. It also includes density plots.

pie_chart (boolean): circular graphs divided into slices which can also use multi-level solutions.

plot_3d (boolean): plots that use a third dimension to encode an additional variable.

proportional_area (boolean): representations used to compare values through area size. Typically, using circle- or square-like shapes.

other (boolean): it includes all other types of non-temporal visualisations that do not fall into the aforementioned categories.

Temporal visualisations and encodings. In addition to non-temporal visualisations, a group of techniques to encode temporality is considered in order to enable comparisons with [7]. Columns:

timeline (boolean): the display of a list of data points or spans in chronological order. They include timelines working either with a scale or simply displaying events in sequence. As in [7], we also include structured solutions resembling Gantt chart layouts.

temporal_dimension (boolean): to report when time is mapped to any dimension of a visualisation, with the exclusion of timelines. We use the term “dimension” and not “axis” as in [7] as more appropriate for radial layouts or more complex representational choices.

animation (boolean): temporality is perceived through an animation changing the visualisation according to time flow.

visual_variable (boolean): another visual encoding strategy is used to represent any temporality-related variable (e.g., colour).

Interaction techniques. A set of categories to assess affordable interaction techniques based on the concept of user intent [8] and user-allowed data actions [9]. The following categories roughly match the “processing”, “mapping”, and “presentation” actions from [9] and the manipulative subset of methods of the “how” an interaction is performed in the conception of [10]. Only interactions that affect the visual representation or the aspect of data points, symbols, and glyphs are taken into consideration. Columns:

basic_selection (boolean): the demarcation of an element either for the duration of the interaction or more permanently until the occurrence of another selection.

advanced_selection (boolean): the demarcation involves both the selected element and connected elements within the visualisation or leads to brush and link effects across views. Basic selection is tacitly implied.

navigation (boolean): interactions that allow moving, zooming, panning, rotating, and scrolling the view but only when applied to the visualisation and not to the web page. It also includes “drill” interactions (to navigate through different levels or portions of data detail, often generating a new view that replaces or accompanies the original) and “expand” interactions generating new perspectives on data by expanding and collapsing nodes.

arrangement (boolean): methods to organise visualisation elements (symbols, glyphs, etc.) or
OECD Alcohol Consumption per Capita
kaggle.com
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). OECD Alcohol Consumption per Capita [Dataset]. https://www.kaggle.com/datasets/thedevastator/oecd-alcohol-consumption-per-capita
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 4, 2023
Dataset provided by
Kaggle
Authors
The Devastator
Description
OECD Alcohol Consumption per Capita

Alcohol Consumption in OECD Countries

By Andy Kriebel [source]

About this dataset

How to use the dataset

Here is a step-by-step guide on how to effectively use this dataset:

Step 1: Understanding the Columns - The dataset consists of several columns that provide important information about the data. Let's briefly explain each column: - LOCATION: Represents the country or region for which the data is reported (Categorical). - INDICATOR: Refers to the specific indicator or measurement being reported (Categorical). - SUBJECT: Indicates the subject or topic to which the indicator relates (Categorical). - MEASURE: Represents the unit of measurement for each indicator (Categorical). - FREQUENCY: Specifies how frequently data is reported, such as annually or quarterly (Categorical). - TIME: Represents the time period for which data is reported. This column contains numeric values without specific dates. - LITRES/CAPITA: Shows the amount of alcohol consumed per capita, measured in litres per person (Numeric). - Flag Codes: Includes codes indicating any flags or notes associated with the data (Categorical).

Step 2: Identifying Location and Indicator - To start analyzing this dataset, you need to decide on a specific location(s) and indicator(s) that interest you. - Scan through unique values in columns like LOCATION, INDICATOR, SUBJECT, and MEASURE to identify interesting locations and indicators related to your analysis goals.

Step 3: Filtering Data - Once you have identified your preferred location(s) and indicator(s), filter out irrelevant rows using these criteria. For example:

Select rows where LOCATION equals United States AND INDICATOR equals Total alcohol consumption AND MEASURE equals Litres per capita.

Step 4: Visualizing the Data - After filtering out relevant data, you can perform various visualizations and statistical analysis to gain insights. - Plotting a line graph over time (using TIME on the x-axis and LITRES/CAPITA on the y-axis) can help identify trends in alcohol consumption per capita. - You can also compare multiple locations by creating grouped bar charts or stacked area plots.

Step 5: Exploring Flags and Notes - Check the Flag Codes column for any associated flags or notes that

Research Ideas

Comparative Analysis: This dataset can be used to compare alcohol consumption per capita across different countries or regions. Researchers or policymakers can analyze the data to identify trends, patterns, and variations in alcohol consumption levels among OECD countries.

Health Impact Assessment: The dataset can be utilized to assess the impact of alcohol consumption on public health. By examining the per capita alcohol consumption rates in different countries and regions, researchers can analyze the correlation between alcohol consumption and health outcomes such as liver diseases, accidents, and other related health issues.

Policy Evaluation: The dataset provides valuable information for evaluating the effectiveness of alcohol-related policies implemented by various governments or organizations in OECD countries. It allows policymakers to assess whether certain policies aimed at reducing excessive drinking have had a significant impact on per capita alcohol consumption rates over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

Column name Description
LOCATION The country or region for which the alcohol consumption data is...
Z
Classification and Quantification of Strawberry Fruit Shape
data.niaid.nih.gov
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feldmann, Mitchell J. (2020). Classification and Quantification of Strawberry Fruit Shape [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3365714
Explore at:
Dataset updated
Apr 24, 2020
Dataset authored and provided by
Feldmann, Mitchell J.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
"Classification and Quantification of Strawberry Fruit Shape" is a dataset that includes raw RGB images and binary images of strawberry fruit. These folders contain JPEG images taken from the same experimental units on 2 different harvest dates. Images in each folder are labeled according to the 4 digit plot ID from the field experiment (####_) and the 10 digit individual ID (_##########).

"H1" and "H2" folders contain RGB images of multiple fruits. Each fruit was extracted and binarized to become the images in "H1_indiv" and "H2_indiv".

"H1_indiv" and "H2_indiv" folders contain images of individual fruit. Each fruit is bordered by ten white pixels. There are a total of 6,874 images between these two folders. The images were used then resized and scaled to be the images in "ReSized".

"ReSized" contains 6,874 binary images of individual berries. These images are all square images (1000x1000px) with the object represented by black pixels (0) and background represented with white pixels (1). Each image was scaled so that it would take up the maximum number of pixels in a 1000 x 1000px image and would maintain the aspect ratio.

"Fruit_image_data.csv" contains all of the morphometric features extracted from individual images including intermediate values.

All images title with the form "B##_NA" were discarded prior to any analyses. These images come from the buffer plots, not the experimental units of the study.

"PPKC_Figures.zip" contains all figures (F1-F7) and supplemental figures (S1-S7_ from the manuscript. Captions for the main figures are found in the manuscript. Captions for Supplemental figures are below.

Fig. S1 Results of PPKC against original cluster assignments. Ordered centroids from k = 2 to k = 8. On the left are the unordered assignments from k-means, and the on the right are the order assignments following PPKC. Cluster position indicated on the right [1, 8].

Fig. S2 Optimal Value of k. (A) Total within clusters sum of squares. (B) The inverse of the Adjusted R . (C) Akaike information criterion (AIC). (D) Bayesian information criterion (AIC). All metrics were calculated on a random sample of 3, 437 images (50%). 10 samples were randomly drawn. The vertical dashed line in each plot represents the optimal value of k. Reported metrics are standardized to be between [0, 1].

Fig. S3 Hierarchical clustering and distance between classes on PC1. The relationship between clusters at each value of k is represented as both a dendrogram and as bar plot. The labels on the dendrogram (i.e., V1, V2, V3,..., V10) represent the original cluster assignment from k-means. The barplot to the right of each dendrogram depicts the elements of the eigenvector associated with the largest eigenvalue form PPKC. The labels above each line represent the original cluster assignment.

Fig. S4 BLUPs for 13 selected features. For each plot, the X-axis is the index and the Y-axis is the BLUP value estimated from a linear mixed model. Grey points represent the mean feature value for each individual. Each point is the BLUP for a single genotype.

Fig. S5 Effects of Eigenfruit, Vertical Biomass, and Horizontal Biomass Analyses. (A) Effects of PC [1, 7] from the Eigenfruit analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal pixel position. The vertical axis is the vertical pixel position. (B) Effects of PC [1, 3] from the Horizontal Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the vertical position from the image (height). The vertical axis is the number of activated pixels (RowSum) at the given vertical position. (C) Effects of PC [1, 3] from the Vertical Biomass analysis on the mean shape (center column). The left column is the mean shape minus 1.5× the standard deviation. Right is the mean shape plus 1.5× the standard deviation. The horizontal axis is the horizontal position from the image (width). The vertical axis is the number of activated pixels (ColSum) at the given horizontal position.

Fig. S6 PPKC with variable sample size. Ordered centroids from k = 2 to k = 5 using different image sets for clustering. For all k = [2, 5], k-means clustering was performed using either 100, 80, 50%, or 20% of the total number of images; 6,874, 5, 500, 3, 437, and 1, 374 respectively. Cluster position indicated on the right [1, 5].

Fig. S7 Comparison of scale and continuous features. (A.) PPKC 4-unit ordinal scale. (B.) Distributions of the selected features with each level of k = 4 from the PPKC 4-unit ordinal scale. The light gray line is cluster 1, the medium gray line is cluster 2, the dark gray line is cluster 3, and the black line is cluster 4.
Diabetes_Dataset_1.1
kaggle.com
Updated Nov 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KIRANMAYI G 777 (2023). Diabetes_Dataset_1.1 [Dataset]. https://www.kaggle.com/datasets/kiranmayig777/diabetes-dataset-1-1/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
KIRANMAYI G 777
Description
import pandas as pd import numpy as np

PERFORMING EDA

data.head() data.info()

attributes_data = data.iloc[:, 1:] attributes_data

attributes_data.describe() attributes_data.corr()

import seaborn as sns import matplotlib.pyplot as plt

Calculate correlation matrix

correlation_matrix = attributes_data.corr() plt.figure(figsize=(18, 10))

Create a heatmap

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.show()

CHECKING IF DATASET IS LINEAR OR NON-LINEAR

Calculate correlations between target and predictor columns

correlations = data.corr()['Diabetes_binary'].drop('Diabetes_binary')

Create a bar chart

plt.figure(figsize=(10, 6)) correlations.plot(kind='bar') plt.xlabel('Predictor Columns') plt.ylabel('Correlation values') plt.title('Correlation between Diabetes_binary and Predictors') plt.show()

CHECKING FOR NULL AND MISSING VALUES, CLEANING THEM

Count the number of null values in each column

print(data.isnull().sum())

to check for missing values in all columns

print(data.isna().sum())

LASSO import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import Lasso from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, KFold

X = data.iloc[:, 1:] y = data.iloc[:, 0] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

parameters = {"alpha": np.arange(0.00001, 10, 500)}
kfold = KFold(n_splits = 10, shuffle=True, random_state = 42) lassoReg = Lasso() lasso_cv = GridSearchCV(lassoReg, param_grid = parameters, cv = kfold) lasso_cv.fit(X, y) print("Best Params {}".format(lasso_cv.best_params_))

column_names = list(data) column_names = column_names[1:] column_names

lassoModel = Lasso(alpha = 0.00001) lassoModel.fit(X_train, y_train) lasso_coeff = np.abs(lassoModel.coef_)#making all coefficients positive plt.bar(column_names, lasso_coeff, color = 'orange') plt.xticks(rotation=90) plt.grid() plt.title("Feature Selection Based on Lasso") plt.xlabel("Features") plt.ylabel("Importance") plt.ylim(0, 0.16) plt.show()

RFE from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

from sklearn.feature_selection import RFECV from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() rfecv = RFECV(estimator= model, step = 1, cv = 20, scoring="accuracy") rfecv = rfecv.fit(X_train, y_train)

num_features_selected = len(rfecv.rankin_)

Cross-validation scores

cv_scores = rfecv.ranking_

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10, 6)) plt.xlabel("Number of features selected") plt.ylabel("Score (accuracy)") plt.plot(range(1, num_features_selected + 1), cv_scores, marker='o', color='r') plt.xticks(range(1, num_features_selected + 1)) # Set x-ticks to integers plt.grid() plt.title("RFECV: Number of Features vs. Score(accuracy)") plt.show()

print("The optimal number of features:", rfecv.n_features_) print("Best features:", X_train.columns[rfecv.support_])

PCA import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler

X = data.drop(["Diabetes_binary"], axis=1) y = data["Diabetes_binary"]

df1=pd.DataFrame(data = data,columns=data.columns) print(df1)

scaling=StandardScaler() scaling.fit(df1) Scaled_data=scaling.transform(df1) principal=PCA(n_components=3) principal.fit(Scaled_data) x=principal.transform(Scaled_data) print(x.shape)

principal.components_

plt.figure(figsize=(10,10))

plt.scatter(x[:,0],x[:,1],c=data['Diabetes_binary'],cmap='plasma') plt.xlabel('pc1') plt.ylabel('pc2')

print(principal.explained_variance_ratio_)

T-SNE from sklearn.manifold import TSNE from numpy import reshape import seaborn as sns

tsne = TSNE(n_components=3, verbose=1, random_state=42) z = tsne.fit_transform(X)

df = pd.DataFrame() df["y"] = y df["comp-1"] = z[:,0] df["comp-2"] = z[:,1] df["comp-3"] = z[:,2] sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(), palette=sns.color_palette("husl", 2), data=df).set(title="Diabetes data T-SNE projection")
NBA Player Performance Stats
kaggle.com
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Wahab (2023). NBA Player Performance Stats [Dataset]. https://www.kaggle.com/datasets/iabdulw/nba-player-performance-stats/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abdul Wahab
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The goal of this project was to extract data from an NBA stats website using web scraping techniques and then perform data analysis to create visualizations using Python. The website used was "https://www.basketball-reference.com/", which contains data on players and teams in the NBA. The code for this project can be found on my GitHub repository at "https://github.com/Duggsdaddy/Srihith_I310D.git".

The data was extracted using the BeautifulSoup library in Python, and the data was stored in a Pandas DataFrame. The data was cleaned and processed to remove any unnecessary columns or rows, and the data types of the columns were checked and corrected where necessary.

The data was analyzed using various Python libraries such as Matplotlib, Seaborn, and Plotly to create visualizations like bar graphs, line graphs, and box plots. The visualizations were used to identify trends and patterns in the data.

The project follows ethical web scraping practices by not overwhelming the website with too many requests and by giving proper attribution to the website as the source of the data.

Overall, this project demonstrates how web scraping and data analysis techniques can be used to extract meaningful insights from data available on the internet.

Here's a data dictionary for the table

Player: string - name of the player Pos (Position): string - position played by the player Age: integer - age of the player as of February 1, 2023 Tm (Team): string - team the player belongs to G (Games Played): integer - number of games played by the player GS (Games Started): integer - number of games started by the player MP (Minutes Played): integer - total minutes played by the player FG (Field Goals): integer - number of field goals made by the player FGA (Field Goal Attempts): integer - number of field goal attempts by the player FG% (Field Goal Percentage): float - percentage of field goals made by the player 3P (3-Point Field Goals): integer - number of 3-point field goals made by the player 3PA (3-Point Field Goal Attempts): integer - number of 3-point field goal attempts by the player 3P% (3-Point Field Goal Percentage): float - percentage of 3-point field goals made by the player 2P (2-Point Field Goals): integer - number of 2-point field goals made by the player 2PA (2-point Field Goal Attempts): integer - number of 2-point field goal attempts by the player 2P% (2-Point Field Goal Percentage): float - percentage of 2-point field goals made by the player eFG% (Effective Field Goal Percentage): float - effective field goal percentage of the player FT (Free Throws): integer - number of free throws made by the player FTA (Free Throw Attempts): integer - number of free throw attempts by the player FT% (Free Throw Percentage): float - percentage of free throws made by the player ORB (Offensive Rebounds): integer - number of offensive rebounds by the player DRB (Defensive Rebounds): integer - number of defensive rebounds by the player TRB (Total Rebounds): integer - total rebounds by the player AST (Assists): integer - number of assists made by the player STL (Steals): integer - number of steals made by the player BLK (Blocks): integer - number of blocks made by the player TOV (Turnovers): integer - number of turnovers made by the player PF (Personal Fouls): integer - number of personal fouls made by the player PTS (Points): integer - total points scored by the player
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Column name	Description
LOCATION	The country or region for which the alcohol consumption data is...

Facebook

Twitter

Click to copy link

Link copied

Cite

Abu Talha (2023). US Regional Sales Data [Dataset]. https://www.kaggle.com/datasets/talhabu/us-regional-sales-data/discussion

US Regional Sales Data

US Regional Sales Data Analysis and Prediction

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 14, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Abu Talha

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset provides comprehensive insights into US regional sales data across different sales channels, including In-Store, Online, Distributor, and Wholesale. With a total of 17,992 rows and 15 columns, this dataset encompasses a wide range of information, from order and product details to sales performance metrics. It offers a comprehensive overview of sales transactions and customer interactions, enabling deep analysis of sales patterns, trends, and potential opportunities.

Columns in the dataset: - OrderNumber: A unique identifier for each order. - Sales Channel: The channel through which the sale was made (In-Store, Online, Distributor, Wholesale). - WarehouseCode: Code representing the warehouse involved in the order. - ProcuredDate: Date when the products were procured. - OrderDate: Date when the order was placed. - ShipDate: Date when the order was shipped. - DeliveryDate: Date when the order was delivered. - SalesTeamID: Identifier for the sales team involved. - CustomerID: Identifier for the customer. - StoreID: Identifier for the store. - ProductID: Identifier for the product. - Order Quantity: Quantity of products ordered. - Discount Applied: Applied discount for the order. - Unit Cost: Cost of a single unit of the product. - Unit Price: Price at which the product was sold.

This dataset serves as a valuable resource for analysing sales trends, identifying popular products, assessing the performance of different sales channels, and optimising pricing strategies for different regions.

Visualization Ideas:

Time Series Analysis: Plot sales trends over time to identify seasonal patterns and changes in demand.
Sales Channel Comparison: Compare sales performance across different channels using bar charts or line graphs.
Product Analysis: Visualise the distribution of sales across different products using pie charts or bar plots.
Discount Analysis: Analyse the impact of discounts on sales using scatter plots or line graphs.
Regional Performance: Create maps to visualise sales performance across different regions.

Data Modelling and Machine Learning Ideas (Price Prediction): - Linear Regression: Build a linear regression model to predict the unit price based on features such as order quantity, discount applied, and unit cost. - Random Forest Regression: Use a random forest regression model to predict the price, taking into account multiple features and their interactions. - Neural Networks: Train a neural network to predict unit price using deep learning techniques, which can capture complex relationships in the data. - Feature Importance Analysis: Identify the most influential features affecting price prediction using techniques like feature importance scores from tree-based models. - Time Series Forecasting: Develop a time series forecasting model to predict future prices based on historical sales data. - These visualisation and modelling ideas can help you gain valuable insights from the sales data and create predictive models to optimise pricing strategies and improve sales performance.

Clear search

Close search

Google apps

Main menu

US Regional Sales Data

Petre_Slide_CategoricalScatterplotFigShare.pptx

7 Display the graph in a separate window. Dot colors indicate

US Tobacco Use Prevalence

US Tobacco Use Prevalence

US Tobacco Use Prevalence by Year, State, Type, and Age

About this dataset

How to use the dataset

Research Ideas

Airlines Flights Data

📹Project Video available on YouTube - https://youtu.be/gu3Ot78j_Gc

Airlines Flights Dataset for Different Cities

Summary for Policymakers of the Working Group I Contribution to the IPCC...

Figure subpanels

List of data provided

machine learning models on the WDBC dataset

Classification of web-based Digital Humanities projects leveraging...

Description

Classification schema: categories and columns

OECD Alcohol Consumption per Capita

OECD Alcohol Consumption per Capita

Alcohol Consumption in OECD Countries

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Classification and Quantification of Strawberry Fruit Shape

Diabetes_Dataset_1.1

Calculate correlation matrix

Create a heatmap

Calculate correlations between target and predictor columns

Create a bar chart

Count the number of null values in each column

to check for missing values in all columns

gridsearchcv is used to find the optimal combination of hyperparameters for a given model

So, in the end, we can select the best parameters from the listed hyperparameters.

Cross-validation scores

Plotting the number of features vs. cross-validation score

plt.figure(figsize=(10,10))

NBA Player Performance Stats

US Regional Sales Data

US Regional Sales Data Analysis and Prediction