Facebook
TwitterLibraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Systematic reviews are the method of choice to synthesize research evidence. To identify main topics (so-called hot spots) relevant to large corpora of original publications in need of a synthesis, one must address the âthree Vsâ of big data (volume, velocity, and variety), especially in loosely defined or fragmented disciplines. For this purpose, text mining and predictive modeling are very helpful. Thus, we applied these methods to a compilation of documents related to digitalization in aesthetic, arts, and cultural education, as a prototypical, loosely defined, fragmented discipline, and particularly to quantitative research within it (QRD-ACE). By broadly querying the abstract and citation database Scopus with terms indicative of QRD-ACE, we identified a corpus of N = 55,553 publications for the years 2013â2017. As the result of an iterative approach of text mining, priority screening, and predictive modeling, we identified n = 8,304 potentially relevant publications of which n = 1,666 were included after priority screening. Analysis of the subject distribution of the included publications revealed video games as a first hot spot of QRD-ACE. Topic modeling resulted in aesthetics and cultural activities on social media as a second hot spot, related to 4 of k = 8 identified topics. This way, we were able to identify current hot spots of QRD-ACE by screening less than 15% of the corpus. We discuss implications for harnessing text mining, predictive modeling, and priority screening in future research syntheses and avenues for future original research on QRD-ACE. Dataset for: Christ, A., Penthin, M., & Kröner, S. (2019). Big Data and Digital Aesthetic, Arts, and Cultural Education: Hot Spots of Current Quantitative Research. Social Science Computer Review, 089443931988845. https://doi.org/10.1177/0894439319888455:
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Project Description:
Title: Pandas Data Manipulation and File Conversion
Overview: This project aims to demonstrate the basic functionalities of Pandas, a powerful data manipulation library in Python. In this project, we will create a DataFrame, perform some data manipulation operations using Pandas, and then convert the DataFrame into both Excel and CSV formats.
Key Objectives:
Tools and Libraries Used:
Project Implementation:
DataFrame Creation:
Data Manipulation:
File Conversion:
to_excel() function.to_csv() function.Expected Outcome:
Upon completion of this project, you will have gained a fundamental understanding of how to work with Pandas DataFrames, perform basic data manipulation tasks, and convert DataFrames into different file formats. This knowledge will be valuable for data analysis, preprocessing, and data export tasks in various data science and analytics projects.
Conclusion:
The Pandas library offers powerful tools for data manipulation and file conversion in Python. By completing this project, you will have acquired essential skills that are widely applicable in the field of data science and analytics. You can further extend this project by exploring more advanced Pandas functionalities or integrating it into larger data processing pipelines.in this data we add number of data and make that data a data frame.and save in single excel file as different sheet name and then convert that excel file in csv file .
Facebook
Twitter[EDIT/UPDATE]
There are a few important updates.
pd.Dataframe as a .csv, the following command should be used to avoid improper interpretation of newline character(s). train_df.to_csv(
"train.csv", index=False,
encoding='utf-8',
quoting=csv.QUOTE_NONNUMERIC # <== THIS IS REQUIRED
)
.csv as a pd.Dataframe, the following command must be used to avoid misinterpretation of NaN like strings (null, nan, ...) as pd.NaN values.train_df = pd.read_csv(
"/kaggle/input/ai4code-train-dataframe/train.csv",
keep_default_na=False # <== THIS IS REQUIRED
)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{
_id:
Facebook
TwitterProperty Based Matching Dataset
This dataset is part of the Deep Principle Bench collection.
Files
property_based_matching.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/property_based_matching")
df = pd.read_csv("hf://datasets/yhqu/property_based_matching/property_based_matching.csv")
Citation
Please cite this work if⊠See the full description on the dataset page: https://huggingface.co/datasets/yhqu/property_based_matching.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Pandas is a powerful package to process tabular data. This dataset provides the absolute minimum amount of data needed to start exploring the capabilities of this package.
This dataset contains very basic examples to explore handy operations with Pandas DataFrames. There are 3 CSV files in the dataset:
- thermometer_A.csv and thermometer_B.csv contain synthetic data representing temperature measurements over a full day by two devices.
- fertiliser_plant_growth.csv contains synthetic data represeting the growth of 3 groups of plants (control, fertilizer A and fertilizer B).
Banner image by Sid Balachandran on Unsplash
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a corpus of 56,416 unique privacy policy texts spanning the years 1996-2021.
policy-texts.zip contains a directory of text files with the policy texts. File names are the hashes of the policy text.
policy-metadata.zip contains two CSV files (can be imported into a pandas dataframe) with policy metadata including readability measures for each policy text.
labeled-policies.zip contains CSV files with content labels for each policy. Labeling was done using a BERT classifier.
Details on the methodology can be found in the accompanying paper.
Facebook
TwitterDescriptor Prediction Dataset
This dataset is part of the Deep Principle Bench collection.
Files
descriptor_prediction.csv: Main dataset file
Usage
import pandas as pd from datasets import load_dataset
dataset = load_dataset("yhqu/descriptor_prediction")
df = pd.read_csv("hf://datasets/yhqu/descriptor_prediction/descriptor_prediction.csv")
Citation
Please cite this work if you use⊠See the full description on the dataset page: https://huggingface.co/datasets/yhqu/descriptor_prediction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.
The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850
As it contains longer off periods with zeros, the CSV file is nicely compressible.
To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).
To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.
pythonimport pandas as pd
df = pd.read_csv("DARCK.csv", parse_dates=["time"])
The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.
DARCK.csv)The dataset is provided as a single comma-separated value (CSV) file.
Column Name |
Data Type |
Unit |
Description |
time | datetime | - | Timestamp for the reading in YYYY-MM-DD HH:MM:SS |
main | float | Watt | Total aggregate power consumption for the apartment, measured at the main electrical panel. |
[appliance_name] | float | Watt | Power consumption of an individual appliance (e.g., lightbathroom, fridge, sherlockpc). See Section 8 for a full list. |
| Aggregate Columns | |||
aggr_chargers | float | Watt | The sum of sherlockcharger, sherlocklaptop, watsoncharger, watsonlaptop, watsonipadcharger, kitchencharger. |
aggr_stoveplates | float | Watt | The sum of stoveplatel1 and stoveplatel2. |
aggr_lights | float | Watt | The sum of lightbathroom, lighthallway, lightsherlock, lightkitchen, lightlivingroom, lightwatson, lightstoreroom, fcob, sherlockalarmclocklight, sherlockfloorlamphue, sherlockledstrip, livingfloorlamphue, sherlockglobe, watsonfloorlamp, watsondesklamp and watsonledmap. |
| Analysis Columns | |||
inaccuracy | float | Watt | As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for. |
The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.
main) PostprocessingThe aggregate power data required several cleaning steps to ensure accuracy.
shellies) PostprocessingThe Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.
.resample('1s').last().ffill(). time index.NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.
The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Download the dataset
At the moment to download the dataset you should use Pandas DataFrame: import pandas as pd df = pd.read_csv("https://huggingface.co/datasets/cybernetic-m/oldIT2modIT/resolve/main/oldIT2modIT_dataset.csv")
You can visualize the dataset with: df.head()
To convert into Huggingface dataset: from datasets import Dataset dataset = Dataset.from_pandas(df)
Dataset Description
This is an italian dataset formed by 200 old (ancient) italian sentence and⊠See the full description on the dataset page: https://huggingface.co/datasets/cybernetic-m/oldIT2modIT.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The excel files are the measured values (mostly particle properties from microscopic images) in .csv format. The files are readable by pandas dataframe and exported by pandas dataframe. Files with Intensity values are the intensity values from maximum z-stack projection of fluorescent micrographs taken from the particles inside the DLD device
Facebook
Twitterhttps://mit-license.org/https://mit-license.org/
This resource contains "RouteLink" files for version 2.1.6 of the National Water Model which are used to associate feature identifiers for computational reaches to relevant metadata. These data are important for comparing NWM feature data to USGS streamflow and lake observations. The original RouteLink files are in NetCDF format and available here: https://www.nco.ncep.noaa.gov/pmb/codes/nwprod
This resource includes the files in a human-friendlier CSV format for easier use, and a machine-friendlier file in HDF5 format which contains a single pandas.DataFrame. The scripts and supporting utilities are also included for users that wish to rebuild these files. Source code is hosted here: https://github.com/jarq6c/NWM_RouteLinks
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example of .bin file that have an IndexError when processing.
Consider #120 OxWearables / stepcount issue for more details.
The .csv files are 1-second epoch conversions from the .bin file and contain time, x, y, z columns. The conversion was done by:
reading the .bin with the GENEAread R package.
keeping only the time, x, y and z columns.
saving the data.frame into a .csv file.
The only difference between the .csv files is the column format used for the time column before saving:
time column in XXXXXX_....csv had a string class
time column in XXXXXT....csv had a "POSIXct" "POSIXt" class
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This submission provides csv files with the data files from a comprehensive study aimed at investigating the effects of sublethal concentrations of the insecticide teflubenzuron on the survival, growth, reproduction, and lipid changes of theCollembola Folsomia candida over different exposure periods.
The dataset files are provided in CSV format with Comma Separated Values:
Description of the files
Variables in the files:
File 1:
sample: sample unique ID
Files 2 and 3:
File 4:
[NA stands for samples lost/ not measured]
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie SkĆodowska-Curie grant agreement No 859891.
This publication reflects only the authors' view and the European Commission is not responsible for any use that may be made of the information it contains.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please read the readme.txt !
This depository contains raw and clean data (.csv), as well as the R-scripts (.r) that process the data, create the plots and the models.
We recommend to go through the R-scripts in their chronological order.
Code was developed in the R software:
R version 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64
****** List of files ********************************
---raw
72 files from 72 Hobo data loggers
names: site_position_medium.csv
example: "20_20_down_water.csv" (site = 20, position = 20 m downstream, medium = water)
---clean
site_logger_position_medium.csv list of all sites, their loggers, their position and medium in which they were placed
loggerdata_compiled.csv all raw logger data (see above) compiled into one dataframe, for column names see below
Daily_loggerdata.csv all data aggregated to daily mean, max and min values, for column names see below
CG_site_distance_pairs.csv all logger positions for each stream and their pairwise geographical distance in meters
Discharge_site7.csv Discharge data for the same season as logger data from a reference stream
buffer_width_eniro_CG.csv measured and averaged buffer widths for each of the studied streams (in m)
01_compile_clean_loggerdata.r
02_aggregate_loggerdata.r
03_model_stream_temp_summer.r
03b_model_stream_temp_autumn.r
04_calculate_warming_cooling_rates_summer.r
04b_calculate_warming_cooling_rates_autumn.r
05_model_air_temp_summer.r
05b_model_air_temp_autumn.r
06_plot_representative_time_series_temp_discharge.r
****** Column names ********************************
Most column names are self explaining, and are also explained in the R code.
Below some detailed info on two dataframes (.csv) - the column names are similar in other csv files
File "loggerdata_compiled.csv" [in Data/clean/ ]
"Logger.SN" Logger serial number
"Timestamp" Datetime, YYYY-MM-DD HH:MM:SS
"Temp" temperature in °C
"Illum" light in lux
"Year" YYYY
"Month" MM
"Day" DD
"Hour" HH
"Minute" MM
"Second" SS
"tz" time zone
"path" file path
"site" stream/site ID
"file" file name
"medium" "water" or "air"
"position" one of 6 positions along the stream: up, mid, end, 20, 70, 150
"date" YYYY-MM-DD
File "Daily_loggerdata.csv" [in Data/clean/ ]
"date" ... (see above)
"Logger.SN" Logger serial number
"mean_temp" mean daily temperature
"min_temp" minimum daily temperature
"max_temp" maximum daily temperature
"path" ...
"site" ...
"file" ...
"medium" ...
"position" ...
"buffer" one of 3 buffer categories: no, thin, wide
"Temp.max.ref" maximum daily temperature of the upstream reference logger
"Temp.min.ref" minimum daily temperature of the upstream reference logger
"Temp.mean.ref" mean daily temperature of the upstream reference logger
"Temp.max.dev" max. temperature difference to upstream reference
"Temp.min.dev" min. temperature difference to upstream reference
"Temp.mean.dev" mean temperature difference to upstream reference
Paper abstract:
Clearcutting increases temperatures of forest streams, and in temperate zones, the effects can extend far downstream. Here, we studied whether similar patterns are found in colder, boreal zones and if riparian buffers can prevent stream water from heating up. We recorded temperature at 45 locations across nine streams with varying buffer widths. In these streams, we compared upstream (control) reaches with reaches in clearcuts and up to 150 m downstream. In summer, we found daily maximum water temperature increases on clearcuts up to 4.1 °C with the warmest week ranging from 12.0 to 18.6 °C. We further found that warming was sustained downstream of clearcuts to 150 m in three out of six streams with buffers < 10 m. Surprisingly, temperature patterns in autumn resembled those in summer, yet with lower absolute temperatures (maximum warming was 1.9 °C in autumn). Clearcuts in boreal forests can indeed warm streams, and because these temperature effects are propagated downstream, we risk catchment-scale effects and cumulative warming when streams pass through several clearcuts. In this study, riparian buffers wider than 15 m protected against water temperature increases; hence, we call for a general increase of riparian buffer width along small streams in boreal forests.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lesson files
For all compressed files, go to the Shell and uncompress using `tar -xzvf myarchive.tar.gz`.
1) Bioinformatic files: bioinformatic_tutorial_files.tar.gz
This archive contains the following datasets:
FASTQ files from Arabidopsis leaf RNA-seq:
Arabidopsis thaliana genome assembly and genome annotation:
The sequence of sequencing adapters in adapters.fasta.
2) Gene counts usable with DESeq2 and R: tutorial.tar.gz
This archive contains the following datasets:
contrast = c("infected", "Pseudomonas_syringae_DC3000", "mock")The raw_counts.csv file was obtained by running the `v0.1.1` version of a RNA-Seq bioinformatic pipeline on the mRNA-Seq sequencing files from Vogel et al. (2016): https://www.ebi.ac.uk/ena/data/view/PRJEB13938.
Please read the original study (Vogel et al. 2016): https://nph.onlinelibrary.wiley.com/doi/full/10.1111/nph.14036
====
Exercise files
1) NASA spaceflight
The NASA GeneLab experiment GLDS-38 performed transcriptomics and proteomics of Arabidopsis seedlings in microgravity by sending seedlings to the International Space Station (ISS).
The raw counts, scaled counts and sample to conditions files are available in the ZIP archive
2) Deforges 2019 hormone-treatments: deforges_2019.tar.gz
This archive contains:
The arabidopsis_root_hormones_raw_counts.csv file contains all gene counts from all hormones. Separate datasets were made for each hormone for convenience.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this data analysis, I used the dataset âRestaurant Ordersâ , from https://mavenanalytics.io/data-playground .Which has a license (License: Public Domain). Public domain license work is free for use by anyone for any purpose without restriction under copyright law. Public domain is the form of open/free, since no one owns or controls the material in any way. Dataset âRestaurant Ordersâ , from https://mavenanalytics.io/data-playground has 3 dataframes in csv format: ârestaurant_db_data_dictionary.csvâ as an instruction or description of the relationships between tables. âorder_details.csvâ - it has columns order_details_id,order_id, order_date, order_time,item_id âmenu_items.csvâ - it has columns menu_item_id , item_name ,category ,price .
Using 3 dataframes we will create new dataframe âorder_details_table' (result dataframe in Power BI file restaurant_orders_result.pbix). Based on this new dataframe, we will generate various charts visualizations in the file restaurant_orders_result_charts.pbix and also attach the charts here .Below is a more detailed description of how I created the new dataframe 'order_details_table' ,and the visualizations, including bar charts and pie charts.
I will use Power Bi in this project . 1. Delete all rows where value rows is âNULLâ in the column âitem_idâ from the dataframe âorder_detailsâ. For this, I use Power Query Editor and the âKeep Rowsâ function. And keep all rows except for 'NULL' values . 2. Combine 2 columns âorder_dateâ and âorder_timeâ to 1 column âorder_date_timeâ in the format MM/DD/YY HH:MM:SS 3. We also need to merge two dataframes into one dataframe âorder_details_tableâ using the âMerge Queriesâ function in Power Query Editor and choose inner join (only matching rows). In the dataframe ârestaurant_db_data_dictionary.csvâ we find information that column âitem_idâ from âorder_detailsâ table matches the âmenu_item_idâ in the âmenu_itemsâ table and combine 2 tables by common column id âmenu_item_idâ and âitem_idâ . 4. We remove the columns that we donât need and also create a new âorder_idâ with unique number for each order.
As a result we have 6 columns in the new dataframe âorder_details_tableâ , such as: order_details_id: A unique identifier for each dish within an order, order_id : The unique identifier for each order or transaction , order_date_time : The date when the order was created in the format (MM/DD/YY HH:MM:SS) , menu_item_category : The category to which the dish belongs , menu_item_name : The name of the dish on the menu , menu_item_price : The price of the dish .
Table order_detail_tables from Power BI file restaurant_orders_result.pbix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F1098315c0e34255b67ad3419aa113bf0%2Fdataframe.png?generation=1730269164808705&alt=media" alt="">
I have also created bar charts and pie charts to display the results from the new dataframe. These plots are included in the file ârestaurant_orders_result_charts.pbixâ . And you can find pictures of charts below.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F4254696bbd3d7e0fc5f456c226c39114%2Fpicture_1.png?generation=1730269227195114&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F71092cf769862cf7364fe1ccac9fad83%2Fpicture_2.png?generation=1730269249147687&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F528ef51ecf21f006b0c21b65503e03fa%2Fpicture_3.png?generation=1730269284640753&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13670445%2F147c240da4be5bfe9da057a8bc5d5939%2Fpicture_4.png?generation=1730269300346146&alt=media" alt="">
I also attached the original and new files to this project, thank you.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset consists of four years of technical language annotations from two paper machines in northern Sweden, structured as a Pandas dataframe. The same data is also available as a semicolon-separated .csv file. The data consists of two columns, where the first column corresponds to annotation note contents, and the second column corresponds to annotation titles. The annotations are in Swedish, and processed so that all mentions of personal information are replaced with the string âegennamnâ, meaning âpersonal nameâ in Swedish. Each row corresponds to one annotation with the corresponding title. Data can be accessed in Python with: import pandas as pd annotations_df = pd.read_pickle("Technical_Language_Annotations.pkl") annotation_contents = annotations_df['noteComment'] annotation_titles = annotations_df['title']
Facebook
TwitterLibraries Import:
Importing necessary libraries such as pandas, seaborn, matplotlib, scikit-learn's KMeans, and warnings. Data Loading and Exploration:
Reading a dataset named "Mall_Customers.csv" into a pandas DataFrame (df). Displaying the first few rows of the dataset using df.head(). Conducting univariate analysis by calculating descriptive statistics with df.describe(). Univariate Analysis:
Visualizing the distribution of the 'Annual Income (k$)' column using sns.distplot. Looping through selected columns ('Age', 'Annual Income (k$)', 'Spending Score (1-100)') and plotting individual distribution plots. Bivariate Analysis:
Creating a scatter plot for 'Annual Income (k$)' vs 'Spending Score (1-100)' using sns.scatterplot. Generating a pair plot for selected columns with gender differentiation using sns.pairplot. Gender-Based Analysis:
Grouping the data by 'Gender' and calculating the mean for selected columns. Computing the correlation matrix for the grouped data and visualizing it using a heatmap. Univariate Clustering:
Applying KMeans clustering with 3 clusters based on 'Annual Income (k$)' and adding the 'Income Cluster' column to the DataFrame. Plotting the elbow method to determine the optimal number of clusters. Bivariate Clustering:
Applying KMeans clustering with 5 clusters based on 'Annual Income (k$)' and 'Spending Score (1-100)' and adding the 'Spending and Income Cluster' column. Plotting the elbow method for bivariate clustering and visualizing the cluster centers on a scatter plot. Displaying a normalized cross-tabulation between 'Spending and Income Cluster' and 'Gender'. Multivariate Clustering:
Performing multivariate clustering by creating dummy variables, scaling selected columns, and applying KMeans clustering. Plotting the elbow method for multivariate clustering. Result Saving:
Saving the modified DataFrame with cluster information to a CSV file named "Result.csv". Saving the multivariate clustering plot as an image file ("Multivariate_figure.png").