Facebook
TwitterExcel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
TwitterThis dataset was created by Pinky Verma
Facebook
Twitterhttps://digital.nhs.uk/about-nhs-digital/terms-and-conditionshttps://digital.nhs.uk/about-nhs-digital/terms-and-conditions
Warning: Large file size (over 1GB). Each monthly data set is large (over 4 million rows), but can be viewed in standard software such as Microsoft WordPad (save by right-clicking on the file name and selecting 'Save Target As', or equivalent on Mac OSX). It is then possible to select the required rows of data and copy and paste the information into another software application, such as a spreadsheet. Alternatively, add-ons to existing software, such as the Microsoft PowerPivot add-on for Excel, to handle larger data sets, can be used. The Microsoft PowerPivot add-on for Excel is available from Microsoft http://office.microsoft.com/en-gb/excel/download-power-pivot-HA101959985.aspx Once PowerPivot has been installed, to load the large files, please follow the instructions below. Note that it may take at least 20 to 30 minutes to load one monthly file. 1. Start Excel as normal 2. Click on the PowerPivot tab 3. Click on the PowerPivot Window icon (top left) 4. In the PowerPivot Window, click on the "From Other Sources" icon 5. In the Table Import Wizard e.g. scroll to the bottom and select Text File 6. Browse to the file you want to open and choose the file extension you require e.g. CSV Once the data has been imported you can view it in a spreadsheet. What does the data cover? General practice prescribing data is a list of all medicines, dressings and appliances that are prescribed and dispensed each month. A record will only be produced when this has occurred and there is no record for a zero total. For each practice in England, the following information is presented at presentation level for each medicine, dressing and appliance, (by presentation name): - the total number of items prescribed and dispensed - the total net ingredient cost - the total actual cost - the total quantity The data covers NHS prescriptions written in England and dispensed in the community in the UK. Prescriptions written in England but dispensed outside England are included. The data includes prescriptions written by GPs and other non-medical prescribers (such as nurses and pharmacists) who are attached to GP practices. GP practices are identified only by their national code, so an additional data file - linked to the first by the practice code - provides further detail in relation to the practice. Presentations are identified only by their BNF code, so an additional data file - linked to the first by the BNF code - provides the chemical name for that presentation.
Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Greetings , fellow analysts !
(NOTE : This is a random dataset generated using python. It bears no resemblance to any real entity in the corporate world. Any resemblance is a matter of coincidence.)
REC-SSEC Bank is a govt-aided bank operating in the Indian Peninsula. They have regional branches in over 40+ regions of the country. You have been provided with a massive excel sheet containing the transaction details, the total transaction amount and their location and total transaction count.
The dataset is described as follows :
For example , in the very first row , the data can be read as : " On the first of January, 2022 , 1932 transactions of summing upto INR 365554 from Bhuj were reported " NOTE : There are about 2750 transactions every single day. All of this has been given to you.
The bank wants you to answer the following questions :
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.
retail_store_sales.csv| Column Name | Description | Example Values |
|---|---|---|
Transaction ID | A unique identifier for each transaction. Always present and unique. | TXN_1234567 |
Customer ID | A unique identifier for each customer. 25 unique customers. | CUST_01 |
Category | The category of the purchased item. | Food, Furniture |
Item | The name of the purchased item. May contain missing values or None. | Item_1_FOOD, None |
Price Per Unit | The static price of a single unit of the item. May contain missing or None values. | 4.00, None |
Quantity | The quantity of the item purchased. May contain missing or None values. | 1, None |
Total Spent | The total amount spent on the transaction. Calculated as Quantity * Price Per Unit. | 8.00, None |
Payment Method | The method of payment used. May contain missing or invalid values. | Cash, Credit Card |
Location | The location where the transaction occurred. May contain missing or invalid values. | In-store, Online |
Transaction Date | The date of the transaction. Always present and valid. | 2023-01-15 |
Discount Applied | Indicates if a discount was applied to the transaction. May contain missing values. | True, False, None |
The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_EHE | Blender | 5.0 |
| Item_2_EHE | Microwave | 6.5 |
| Item_3_EHE | Toaster | 8.0 |
| Item_4_EHE | Vacuum Cleaner | 9.5 |
| Item_5_EHE | Air Purifier | 11.0 |
| Item_6_EHE | Electric Kettle | 12.5 |
| Item_7_EHE | Rice Cooker | 14.0 |
| Item_8_EHE | Iron | 15.5 |
| Item_9_EHE | Ceiling Fan | 17.0 |
| Item_10_EHE | Table Fan | 18.5 |
| Item_11_EHE | Hair Dryer | 20.0 |
| Item_12_EHE | Heater | 21.5 |
| Item_13_EHE | Humidifier | 23.0 |
| Item_14_EHE | Dehumidifier | 24.5 |
| Item_15_EHE | Coffee Maker | 26.0 |
| Item_16_EHE | Portable AC | 27.5 |
| Item_17_EHE | Electric Stove | 29.0 |
| Item_18_EHE | Pressure Cooker | 30.5 |
| Item_19_EHE | Induction Cooktop | 32.0 |
| Item_20_EHE | Water Dispenser | 33.5 |
| Item_21_EHE | Hand Blender | 35.0 |
| Item_22_EHE | Mixer Grinder | 36.5 |
| Item_23_EHE | Sandwich Maker | 38.0 |
| Item_24_EHE | Air Fryer | 39.5 |
| Item_25_EHE | Juicer | 41.0 |
| Item Code | Item Name | Price |
|---|---|---|
| Item_1_FUR | Office Chair | 5.0 |
| Item_2_FUR | Sofa | 6.5 |
| Item_3_FUR | Coffee Table | 8.0 |
| Item_4_FUR | Dining Table | 9.5 |
| Item_5_FUR | Bookshelf | 11.0 |
| Item_6_FUR | Bed F... |
Facebook
TwitterTypically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".
"This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers."
Per the UCI Machine Learning Repository, this data was made available by Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
Image from stocksnap.io.
Analyses for this dataset could include time series, clustering, classification and more.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides synthetic yet realistic data for analyzing and forecasting retail store inventory demand. It contains over 73000 rows of daily data across multiple stores and products, including attributes like sales, inventory levels, pricing, weather, promotions, and holidays.
The dataset is ideal for practicing machine learning tasks such as demand forecasting, dynamic pricing, and inventory optimization. It allows data scientists to explore time series forecasting techniques, study the impact of external factors like weather and holidays on sales, and build advanced models to optimize supply chain performance.
Challenge 1: Time Series Demand Forecasting Predict daily product demand across stores using historical sales and inventory data. Can you build an LSTM-based forecasting model that outperforms classical methods like ARIMA?
Challenge 2: Inventory Optimization Optimize inventory levels by analyzing sales trends and minimizing stockouts while reducing overstock situations.
Challenge 3: Dynamic Pricing Develop a pricing strategy based on demand, competitor pricing, and discounts to maximize revenue.
Date: Daily records from [start_date] to [end_date]. Store ID & Product ID: Unique identifiers for stores and products. Category: Product categories like Electronics, Clothing, Groceries, etc. Region: Geographic region of the store. Inventory Level: Stock available at the beginning of the day. Units Sold: Units sold during the day. Demand Forecast: Predicted demand based on past trends. Weather Condition: Daily weather impacting sales. Holiday/Promotion: Indicators for holidays or promotions.
Exploratory Data Analysis (EDA): Analyze sales trends, visualize data, and identify patterns. Time Series Forecasting: Train models like ARIMA, Prophet, or LSTM to predict future demand. Pricing Analysis: Study how discounts and competitor pricing affect sales.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterDataset Overview This fictional dataset, generated by ChatGPT, is designed for those interested in learning and practicing data visualization, dashboard creation, and data analysis. It contains 10,000 rows of data reflecting the inventory and sales patterns of a typical supermarket, spanning a timeframe from January 1, 2024, to June 30, 2024.
The dataset aims to mimic real-world inventory dynamics and includes product details, stock levels, sales data, supplier performance, and restocking schedules. It's perfect for creating interactive dashboards in tools like Excel, Tableau, or Power BI or for practicing data cleaning and exploratory data analysis (EDA).
Key Features Comprehensive Columns:
Date: Record date. ProductID: Unique identifier for products. ProductName: Product names across diverse supermarket categories. Category: Categories like Dairy, Meat, Produce, etc. Supplier: Fictional supplier names for products. UnitPrice: Realistic product pricing. StockQuantity: Current stock levels. StockValue: Total value of inventory for each product. ReorderLevel: Threshold for triggering a reorder. ReorderQuantity: Recommended reorder quantity. UnitsSold: Number of units sold. SalesValue: Total sales value for each product. LastSoldDate: Last date of sale. LastRestockDate: Date of the last restock. NextRestockDate: Scheduled date for the next restock. DeliveryTimeDays: Delivery lead time from suppliers. DeliveryStatus: Status of the latest delivery (e.g., On Time, Delayed).
Realistic Data Generation:
Products include 50 common supermarket items across 9 categories (Dairy, Bakery, Beverages, Meat, Produce, Frozen, Snacks, Cleaning Supplies, Health & Beauty). Reflects seasonal trends and realistic stock replenishment behaviors. Randomized yet logical patterns for pricing, sales, and stock levels.
Versatile Use Cases:
Ideal for data visualization projects. Suitable for inventory management simulation. Can be used to practice time-series analysis.
Why Use This Dataset? This dataset is a learning resource, crafted to provide aspiring data enthusiasts and professionals with a sandbox to hone their skills in:
Building dashboards in Tableau, Power BI, or Excel. Analyzing inventory trends and forecasting demand. Visualizing data insights using tools like Matplotlib, Seaborn, or Plotly.
Disclaimer This dataset is entirely fictional and was generated by ChatGPT, a large language model created by OpenAI. While the data reflects patterns of a real supermarket, it is not based on any actual business or proprietary data.
Shoutout to ChatGPT for generating this comprehensive dataset and making it available to the Kaggle community! 🎉
Acknowledgments If you find this dataset helpful, feel free to share your visualizations and insights! Let’s make learning data visualization engaging and fun.
Facebook
TwitterSales record of 100,000 items by global regions, country, and sales channel for the year 2010 through 2017. This data is highly useful for practice and hands-on activity as it is unbiased and vetted by reputable data agencies. You can also find same data and more in eforexcel.com. Data points include office supplies, cereal, fruits, household fittings, snacks and cosmetics.
Facebook
TwitterThe data set is uncleaned, which consist of - Missing values - Inconsistent data type - Duplicate rows
If the ML model is trained on uncleaned data, leads to low accuracy and poor performance of the model. So, before performing any kind of analysis the data set should be cleaned.
This data set is a kind of exercise on data cleaning.
Facebook
TwitterMy Grandpa asked if the programs I was using could calculate his Golf League’s handicaps, so I decided to play around with SQL and Google Sheets to see if I could functionally recreate what they were doing.
The goal is to calculate a player’s handicap, which is the average of the last six months of their scores minus 29. The average is calculated based on how many games they have actually played in the last six months, and the number of scores averaged correlates to total games. For example, Clem played over 20 games so his handicap will be calculated with the maximum possible scores accounted for, that being 8. Schomo only played six games, so the lowest 4 will be used for their average. Handicap is always calculated with the lowest available scores.
This league uses Excel, so upon receiving the data I converted it into a CSV and uploaded it into bigQuery.
First thing I did was change column names to best represent what they were and simplify things in the code. It is much easier to remember ‘someone_scores’ than ‘int64_field_number’. It also seemed to confuse SQL less, as int64 can mean something independently.
(ALTER TABLE grandpa-golf.grandpas_golf_35.should only need the one
RENAME COLUMN int64_field_4 TO schomo_scores;)
To Find the average of Clem’s scores:
SELECT AVG(clem_scores)
FROM grandpa-golf.grandpas_golf_35.should only need the one
LIMIT 8; RESULT: 43.1
Remembering that handicap is the average minus 29, the final computation looks like:
SELECT AVG(clem_scores) - 29
FROM grandpa-golf.grandpas_golf_35.should only need the one
LIMIT 8; RESULT: 14.1
Find the average of Schomo’s scores:
SELECT AVG(schomo_scores) - 29
FROM grandpa-golf.grandpas_golf_35.should only need the one
LIMIT 6; RESULT: 10.5
This data was already automated to calculate a handicap in the league’s excel spreadsheet, so I asked for more data to see if i could recreate those functions.
Grandpa provided the past three years of league data. The names were all replaced with generic “Golfer 001, Golfer 002, etc”. I had planned on converting this Excel sheet into a CSV and manipulating it in SQL like with the smaller sample, but this did not work.
Immediately, there were problems. I had initially tried to just convert the file into a CSV and drop it into SQL, but there were functions that did not transfer properly from what was functionally the PDF I had been emailed. So instead of working with SQL, I decided to pull this into google sheets and recreate the functions for this spreadsheet. We only need the most recent 6 months of scores to calculate our handicap, so once I made a working copy I deleted the data from before this time period. Once that was cleaned up, I started working on a function that would pull the working average from these values, which is still determined by how many total values there were. This correlates as follows: for 20 or more scores average the lowest 8, for 15 to 19 scores average the lowest 6, for 6 to 14 scores average the lowest 4 and for 6 or fewer scores average the lowest 2. We also need to ensure that an average value of 0 returns a value of 0 so our handicap calculator works. My formula ended up being:
=IF(COUNT(E2:AT2)>=20, AVERAGE(SMALL(E2:AT2, ROW(INDIRECT("1:"&8)))), IF(COUNT(E2:AT2)>=15, AVERAGE(SMALL(E2:AT2, ROW(INDIRECT("1:"&6)))), IF(COUNT(E2:AT2)>=6, AVERAGE(SMALL(E2:AT2, ROW(INDIRECT("1:"&4)))), IF(COUNT(E2:AT2)>=1, AVERAGE(SMALL(E2:AT2, ROW(INDIRECT("1:"&2)))), IF(COUNT(E2:AT2)=0, 0, "")))))
The handicap is just this value minus 29, so for the handicap column the script is relatively simple: =IF(D2=0,0,IF(D2>47,18,D2-29)) This ensures that we will not get a negative value for our handicap, and pulls the basic average from the right place. It also sets the handicap to zero if there are no scores present.
Now that we have our spreadsheet back in working order with our new scripts, we are functionally done. We have recreated what my Grandpa’s league uses to generate handicaps.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterExcel spreadsheets by species (4 letter code is abbreviation for genus and species used in study, year 2010 or 2011 is year data collected, SH indicates data for Science Hub, date is date of file preparation). The data in a file are described in a read me file which is the first worksheet in each file. Each row in a species spreadsheet is for one plot (plant). The data themselves are in the data worksheet. One file includes a read me description of the column in the date set for chemical analysis. In this file one row is an herbicide treatment and sample for chemical analysis (if taken). This dataset is associated with the following publication: Olszyk , D., T. Pfleeger, T. Shiroyama, M. Blakely-Smith, E. Lee , and M. Plocher. Plant reproduction is altered by simulated herbicide drift toconstructed plant communities. ENVIRONMENTAL TOXICOLOGY AND CHEMISTRY. Society of Environmental Toxicology and Chemistry, Pensacola, FL, USA, 36(10): 2799-2813, (2017).