100+ datasets found

w
Dataset of books called An introduction to data analysis in R : hands-on...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=An+introduction+to+data+analysis+in+R+%3A+hands-on+coding%2C+data+mining%2C+visualization+and+statistics+from+scratch
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.
Google Data Analytics Capstone Project
kaggle.com
Updated Oct 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Rookie (2022). Google Data Analytics Capstone Project [Dataset]. https://www.kaggle.com/datasets/rookieaj1234/google-data-analytics-capstone-project
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Data Rookie
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Project Name: Divvy Bikeshare Trip Data_Year2020 Date Range: April 2020 to December 2020. Analyst: Ajith Software: R Program, Microsoft Excel IDE: RStudio

The following are the basic system requirements, necessary for the project: Processor: Intel i3 or AMD Ryzen 3 and higher Internal RAM: 8 GB or higher Operating System: Windows 7 or above, MacOS

**Data Usage License: https://ride.divvybikes.com/data-license-agreement ** Introduction:

In this case, study we aim to utilize different data analysis techniques and tools, to understand the rental patterns of the divvy bike sharing company and understand the key business improvement suggestions. This case study is a mandatory project to be submitted to achieve the Google Data Analytics Certification. The data utilized in this case study was licensed based on the provided data usage license. The trips between April 2020 to December 2020 are used to analyse the data.

Scenario: Marketing team needs to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ.

Objective: The main objective of this case study, is to understand the customer usage patterns and the breakdown of customers, based on their subscription status and the average durations of the rental bike usage.

Introduction to Data: The Data provided for this project, is adhered to the data usage license, laid down by the source company. The source data was provided in the CSV files and are month and quarter breakdowns. A total of 13 columns of data was provided in each csv file.

The following are the columns, which were initially observed across the datasets.

Ride_id Ride_type Start_station_name Start_station_id End_station_name End_station_id Usertype Start_time End_time Start_lat Start_lng End_lat End_lng

Documentation, Cleaning and Preparing Data for Analysis: The total size of the datasets, for the year 2020, is approximately 450 MB, which is tiring job, when you have to upload them to the SQL database and visualize using the BI tools. I wanted to improve my skills into R environment and this is the best opportunity and optimal to use R for the data analysis.

For more insights, installation procedures for R and RStudio, please refer to the following URL, for additional information.

R Projects Document: https://www.r-project.org/other-docs.html RStudio Download: https://www.rstudio.com/products/rstudio/ Installation Guide: https://www.youtube.com/watch?v=TFGYlKvQEQ4
Product Sales Dataset (2023-2024)
kaggle.com
zip
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yash Yennewar (2025). Product Sales Dataset (2023-2024) [Dataset]. https://www.kaggle.com/datasets/yashyennewar/product-sales-dataset-2023-2024
Explore at:
zip(6012656 bytes)Available download formats
Dataset updated
Sep 30, 2025
Authors
Yash Yennewar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
🛍️ Product Sales Dataset (2023–2024)

📌 Overview

This dataset contains 200,000 synthetic sales records simulating real-world product transactions across different U.S. regions. It is designed for data analysis, business intelligence, and machine learning projects, especially in the areas of sales forecasting, customer segmentation, profitability analysis, and regional trend evaluation.

The dataset provides detailed transactional data including customer names, product categories, pricing, and revenue details, making it highly versatile for both beginners and advanced analysts.

📂 Dataset Structure

Rows: 200,000

Columns: 14

Features

Order_ID – Unique identifier for each order

Order_Date – Date of transaction

Customer_Name – Name of the customer

City – City of the customer

State – State of the customer

Region – Region (East, West, South, Centre)

Country – Country (United States)

Category – Broad product category (e.g., Accessories, Clothing & Apparel)

Sub_Category – Subdivision of category (e.g., Sportswear, Bags)

Product_Name – Product description

Quantity – Units purchased

Unit_Price – Price per unit (USD)

Revenue – Total sales amount (Quantity × Unit Price)

Profit – Net profit earned from the transaction

🎯 Potential Use Cases

Sales Analysis: Track revenue, profit, and performance by product, category, or region.

Customer Analytics: Identify top customers, purchasing frequency, and loyalty patterns.

Profitability Insights: Compare profit margins across categories and sub-categories.

Time-Series Analysis: Study seasonal demand and forecast future sales.

Visualization Projects: Build dashboards in Power BI, Tableau, or Excel.

Machine Learning: Train models for demand prediction, price optimization, or segmentation.

📊 Example Insights

Which region generates the highest revenue?

What are the top 10 most profitable products?

Are some product categories more popular in certain regions?

Which customers contribute the most to total revenue?

🏷️ Tags

business · sales · profitability · forecasting · customer analysis · retail

📜 License

This dataset is synthetic and created for educational and analytical purposes. You are free to use, modify, and share it under the CC BY 4.0 License.

🙌 Acknowledgments

This dataset was generated to provide a realistic foundation for learning and practicing Data Analytics, Power BI, Tableau, Python, and Excel projects.
w
Dataset of books called Data analysis in business research : a step-by-step...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Data analysis in business research : a step-by-step nonparametric approach [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Data+analysis+in+business+research+%3A+a+step-by-step+nonparametric+approach
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Data analysis in business research : a step-by-step nonparametric approach. It features 7 columns including author, publication date, language, and book publisher.
w
Dataset of books called Core concepts in data analysis : summarization,...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Core concepts in data analysis : summarization, correlation and visualization [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Core+concepts+in+data+analysis+%3A+summarization%2C+correlation+and+visualization
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Core concepts in data analysis : summarization, correlation and visualization. It features 7 columns including author, publication date, language, and book publisher.
Wikipedia SQLITE Portable DB, Huge 5M+ Rows
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
christernyc (2024). Wikipedia SQLITE Portable DB, Huge 5M+ Rows [Dataset]. https://www.kaggle.com/datasets/christernyc/wikipedia-sqlite-portable-db-huge-5m-rows/code
Explore at:
zip(6064169983 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
christernyc
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The "Wikipedia SQLite Portable DB" is a compact and efficient database derived from the Kensho Derived Wikimedia Dataset (KDWD). This dataset provides a condensed subset of raw Wikimedia data in a format optimized for natural language processing (NLP) research and applications.

I am not affiliated or partnered with the Kensho in any way, just really like the dataset for giving my agents to query easily.

Key Features:

Contains over 5 million rows of data from English Wikipedia and Wikidata Stored in a portable SQLite database format for easy integration and querying Includes a link-annotated corpus of English Wikipedia pages and a compact sample of the Wikidata knowledge base Ideal for NLP tasks, machine learning, data analysis, and research projects

The database consists of four main tables:

items: Contains information about Wikipedia items, including labels and descriptions

properties: Stores details about Wikidata properties, such as labels and descriptions

pages: Provides metadata for Wikipedia pages, including page IDs, item IDs, titles, and view counts

link_annotated_text: Contains the link-annotated text of Wikipedia pages, divided into sections

This dataset is derived from the Kensho Derived Wikimedia Dataset (KDWD), which is built from the English Wikipedia snapshot from December 1, 2019, and the Wikidata snapshot from December 2, 2019. The KDWD is a condensed subset of the raw Wikimedia data in a form that is helpful for NLP work, and it is released under the CC BY-SA 3.0 license. Credits: The "Wikipedia SQLite Portable DB" is derived from the Kensho Derived Wikimedia Dataset (KDWD), created by the Kensho R&D group. The KDWD is based on data from Wikipedia and Wikidata, which are crowd-sourced projects supported by the Wikimedia Foundation. We would like to acknowledge and thank the Kensho R&D group for their efforts in creating the KDWD and making it available for research and development purposes. By providing this portable SQLite database, we aim to make Wikipedia data more accessible and easier to use for researchers, data scientists, and developers working on NLP tasks, machine learning projects, and other data-driven applications. We hope that this dataset will contribute to the advancement of NLP research and the development of innovative applications utilizing Wikipedia data.

https://www.kaggle.com/datasets/kenshoresearch/kensho-derived-wikimedia-data/data

Tags: encyclopedia, wikipedia, sqlite, database, reference, knowledge-base, articles, information-retrieval, natural-language-processing, nlp, text-data, large-dataset, multi-table, data-science, machine-learning, research, data-analysis, data-mining, content-analysis, information-extraction, text-mining, text-classification, topic-modeling, language-modeling, question-answering, fact-checking, entity-recognition, named-entity-recognition, link-prediction, graph-analysis, network-analysis, knowledge-graph, ontology, semantic-web, structured-data, unstructured-data, data-integration, data-processing, data-cleaning, data-wrangling, data-visualization, exploratory-data-analysis, eda, corpus, document-collection, open-source, crowdsourced, collaborative, online-encyclopedia, web-data, hyperlinks, categories, page-views, page-links, embeddings

Usage with LIKE queries: ``` import aiosqlite import asyncio

class KenshoDatasetQuery: def init(self, db_file): self.db_file = db_file

async def _aenter_(self): self.conn = await aiosqlite.connect(self.db_file) return self async def _aexit_(self, exc_type, exc_val, exc_tb): await self.conn.close() async def search_pages_by_title(self, title): query = """ SELECT pages.page_id, pages.item_id, pages.title, pages.views, items.labels AS item_labels, items.description AS item_description, link_annotated_text.sections FROM pages JOIN items ON pages.item_id = items.id JOIN link_annotated_text ON pages.page_id = link_annotated_text.page_id WHERE pages.title LIKE ? """ async with self.conn.execute(query, (f"%{title}%",)) as cursor: return await cursor.fetchall() async def search_items_by_label_or_description(self, keyword): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? OR description LIKE ? """ async with self.conn.execute(query, (f"%{keyword}%", f"%{keyword}%")) as cursor: return await cursor.fetchall() async def search_items_by_label(self, label): query = """ SELECT id, labels, description FROM items WHERE labels LIKE ? """ async with self.conn.execute(query, (f"%{label}%",)) as cursor: return await cursor.fetchall() async def search_properties_by_label_or_desc...
t
Trusted Research Environments: Analysis of Characteristics and Data...
researchdata.tuwien.ac.at
bin, csv
Updated Jun 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber (2024). Trusted Research Environments: Analysis of Characteristics and Data Availability [Dataset]. http://doi.org/10.48436/cv20m-sg117
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.48436/cv20m-sg117
Dataset updated
Jun 25, 2024
Dataset provided by
TU Wien
Authors
Martin Weise; Martin Weise; Andreas Rauber; Andreas Rauber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
Methodology
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
Peer-reviewed articles where available,
TRE websites,
TRE metadata catalogs.
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
Technical details
This dataset consists of five comma-separated values (.csv) files describing our inventory:
countries.csv: Table of countries with columns id (number), name (text) and code (text, in ISO 3166-A3 encoding, optional)
tres.csv: Table of TREs with columns id (number), name (text), countryid (number, refering to column id of table countries), structureddata (bool, optional), datalevel (one of [1=de-identified, 2=pseudonomized, 3=anonymized], optional), outputcontrol (bool, optional), inceptionyear (date, optional), records (number, optional), datatype (one of [1=claims, 2=linked records]), optional), statistics_office (bool), size (number, optional), source (text, optional), comment (text, optional)
access.csv: Table of access modes of TREs with columns id (number), suf (bool, optional), physical_visit (bool, optional), external_physical_visit (bool, optional), remote_visit (bool, optional)
inclusion.csv: Table of included TREs into the literature study with columns id (number), included (bool), exclusion reason (one of [peer review, environment, duplicate], optional), comment (text, optional)
major_fields.csv: Table of data categorization into the major research fields with columns id (number), life_sciences (bool, optional), physical_sciences (bool, optional), arts_and_humanities (bool, optional), social_sciences (bool, optional).
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
schema.sql: Schema definition file to create the tables and views used in the analysis.
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb

Cafe Sales - Dirty Data for Cleaning Training

kaggle.com

zip

Updated Jan 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training

Explore at:

zip(113510 bytes)Available download formats

Dataset updated

Jan 17, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Cafe Sales Dataset

Overview

The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

File Information

File Name: dirty_cafe_sales.csv
Number of Rows: 10,000
Number of Columns: 8

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Item`	The name of the item purchased. May contain missing or invalid values (e.g., "ERROR").	`Coffee`, `Sandwich`
`Quantity`	The quantity of the item purchased. May contain missing or invalid values.	`1`, `3`, `UNKNOWN`
`Price Per Unit`	The price of a single unit of the item. May contain missing or invalid values.	`2.00`, `4.00`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `12.00`
`Payment Method`	The method of payment used. May contain missing or invalid values (e.g., `None`, "UNKNOWN").	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Takeaway`
`Transaction Date`	The date of the transaction. May contain missing or incorrect values.	`2023-01-01`

Data Characteristics

Missing Values:
- Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
Invalid Values:
- Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
Price Consistency:
- Prices for menu items are consistent but may have missing or incorrect values introduced.

Menu Items

The dataset includes the following menu items with their respective price ranges:

Item	Price($)
Coffee	2
Tea	1.5
Sandwich	4
Salad	5
Cake	3
Cookie	1
Smoothie	4
Juice	3

Use Cases

This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

Cleaning Steps Suggestions

To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

Handle Invalid Values:
- Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
Date Consistency:
- Ensure all dates are in a consistent format.
- Fill missing dates with plausible values based on nearby records.
Feature Engineering:
- Create new columns, such as Day of the Week or Transaction Month, for further analysis.

License

This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

Feedback

If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

w
Dataset of books called Statistical and computational methods in data...
workwithdata.com
Updated Apr 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Statistical and computational methods in data analysis [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Statistical+and+computational+methods+in+data+analysis
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 2 rows and is filtered where the book is Statistical and computational methods in data analysis. It features 7 columns including author, publication date, language, and book publisher.
S
Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...
scidb.cn
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang Luo (2025). Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on Trust: Evidence from Six Experimental Studies [Dataset]. http://doi.org/10.57760/sciencedb.psych.00565
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.psych.00565
Dataset updated
Apr 30, 2025
Dataset provided by
Science Data Bank
Authors
Yang Luo
Description
This dataset originates from a series of experimental studies titled “Tough on People, Tolerant to AI? Differential Effects of Human vs. AI Unfairness on Trust” The project investigates how individuals respond to unfair behavior (distributive, procedural, and interactional unfairness) enacted by artificial intelligence versus human agents, and how such behavior affects cognitive and affective trust.1 Experiment 1a: The Impact of AI vs. Human Distributive Unfairness on TrustOverview: This dataset comes from an experimental study aimed at examining how individuals respond in terms of cognitive and affective trust when distributive unfairness is enacted by either an artificial intelligence (AI) agent or a human decision-maker. Experiment 1a specifically focuses on the main effect of the “type of decision-maker” on trust.Data Generation and Processing: The data were collected through Credamo, an online survey platform. Initially, 98 responses were gathered from students at a university in China. Additional student participants were recruited via Credamo to supplement the sample. Attention check items were embedded in the questionnaire, and participants who failed were automatically excluded in real-time. Data collection continued until 202 valid responses were obtained. SPSS software was used for data cleaning and analysis.Data Structure and Format: The data file is named “Experiment1a.sav” and is in SPSS format. It contains 28 columns and 202 rows, where each row corresponds to one participant. Columns represent measured variables, including: grouping and randomization variables, one manipulation check item, four items measuring distributive fairness perception, six items on cognitive trust, five items on affective trust, three items for honesty checks, and four demographic variables (gender, age, education, and grade level). The final three columns contain computed means for distributive fairness, cognitive trust, and affective trust.Additional Information: No missing data are present. All variable names are labeled in English abbreviations to facilitate further analysis. The dataset can be directly opened in SPSS or exported to other formats.2 Experiment 1b: The Mediating Role of Perceived Ability and Benevolence (Distributive Unfairness)Overview: This dataset originates from an experimental study designed to replicate the findings of Experiment 1a and further examine the potential mediating role of perceived ability and perceived benevolence.Data Generation and Processing: Participants were recruited via the Credamo online platform. Attention check items were embedded in the survey to ensure data quality. Data were collected using a rolling recruitment method, with invalid responses removed in real time. A total of 228 valid responses were obtained.Data Structure and Format: The dataset is stored in a file named Experiment1b.sav in SPSS format and can be directly opened in SPSS software. It consists of 228 rows and 40 columns. Each row represents one participant’s data record, and each column corresponds to a different measured variable. Specifically, the dataset includes: random assignment and grouping variables; one manipulation check item; four items measuring perceived distributive fairness; six items on perceived ability; five items on perceived benevolence; six items on cognitive trust; five items on affective trust; three items for attention check; and three demographic variables (gender, age, and education). The last five columns contain the computed mean scores for perceived distributive fairness, ability, benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be analyzed directly in SPSS or exported to other formats as needed.3 Experiment 2a: Differential Effects of AI vs. Human Procedural Unfairness on TrustOverview: This dataset originates from an experimental study aimed at examining whether individuals respond differently in terms of cognitive and affective trust when procedural unfairness is enacted by artificial intelligence versus human decision-makers. Experiment 2a focuses on the main effect of the decision agent on trust outcomes.Data Generation and Processing: Participants were recruited via the Credamo online survey platform from two universities located in different regions of China. A total of 227 responses were collected. After excluding those who failed the attention check items, 204 valid responses were retained for analysis. Data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2a.sav in SPSS format and can be directly opened in SPSS software. It contains 204 rows and 30 columns. Each row represents one participant’s response record, while each column corresponds to a specific variable. Variables include: random assignment and grouping; one manipulation check item; seven items measuring perceived procedural fairness; six items on cognitive trust; five items on affective trust; three attention check items; and three demographic variables (gender, age, and education). The final three columns contain computed average scores for procedural fairness, cognitive trust, and affective trust.Additional Notes: The dataset contains no missing values. All variables are labeled using standardized English abbreviations to facilitate reuse and secondary analysis. The file can be directly analyzed in SPSS or exported to other formats as needed.4 Experiment 2b: Mediating Role of Perceived Ability and Benevolence (Procedural Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 2a and to further examine the potential mediating roles of perceived ability and perceived benevolence in shaping trust responses under procedural unfairness.Data Generation and Processing: Participants were working adults recruited through the Credamo online platform. A rolling data collection strategy was used, where responses failing attention checks were excluded in real time. The final dataset includes 235 valid responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in a file named Experiment2b.sav, which is in SPSS format and can be directly opened using SPSS software. It contains 235 rows and 43 columns. Each row corresponds to a single participant, and each column represents a specific measured variable. These include: random assignment and group labels; one manipulation check item; seven items measuring procedural fairness; six items for perceived ability; five items for perceived benevolence; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final five columns contain the computed average scores for procedural fairness, perceived ability, perceived benevolence, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variables are labeled using standardized English abbreviations to support future reuse and secondary analysis. The dataset can be directly analyzed in SPSS and easily converted into other formats if needed.5 Experiment 3a: Effects of AI vs. Human Interactional Unfairness on TrustOverview: This dataset comes from an experimental study that investigates how interactional unfairness, when enacted by either artificial intelligence or human decision-makers, influences individuals’ cognitive and affective trust. Experiment 3a focuses on the main effect of the “decision-maker type” under interactional unfairness conditions.Data Generation and Processing: Participants were college students recruited from two universities in different regions of China through the Credamo survey platform. After excluding responses that failed attention checks, a total of 203 valid cases were retained from an initial pool of 223 responses. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3a.sav, in SPSS format and compatible with SPSS software. It contains 203 rows and 27 columns. Each row represents a single participant, while each column corresponds to a specific measured variable. These include: random assignment and condition labels; one manipulation check item; four items measuring interactional fairness perception; six items for cognitive trust; five items for affective trust; three attention check items; and three demographic variables (gender, age, education). The final three columns contain computed average scores for interactional fairness, cognitive trust, and affective trust.Additional Notes: There are no missing values in the dataset. All variable names are provided using standardized English abbreviations to facilitate secondary analysis. The data can be directly analyzed using SPSS and exported to other formats as needed.6 Experiment 3b: The Mediating Role of Perceived Ability and Benevolence (Interactional Unfairness)Overview: This dataset comes from an experimental study designed to replicate the findings of Experiment 3a and further examine the potential mediating roles of perceived ability and perceived benevolence under conditions of interactional unfairness.Data Generation and Processing: Participants were working adults recruited via the Credamo platform. Attention check questions were embedded in the survey, and responses that failed these checks were excluded in real time. Data collection proceeded in a rolling manner until a total of 227 valid responses were obtained. All data were processed and analyzed using SPSS software.Data Structure and Format: The dataset is stored in the file named Experiment3b.sav, in SPSS format and compatible with SPSS software. It includes 227 rows and
w
Dataset of books called Modeling and data analysis : an introduction with...
workwithdata.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2025). Dataset of books called Modeling and data analysis : an introduction with environmental applications [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=Modeling+and+data+analysis+%3A+an+introduction+with+environmental+applications
Explore at:
Dataset updated
Apr 17, 2025
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is about books. It has 1 row and is filtered where the book is Modeling and data analysis : an introduction with environmental applications. It features 7 columns including author, publication date, language, and book publisher.
Data from: Artificial Intelligence in Healthcare: 2023 Year in Review...
figshare.com
txt
Updated Apr 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julia Maslinski; Rachel Grasfield B; Raghav Awasthi; Shreya Mishra; Dwarikanath Mahapatra; Jacek B Cywinkski; Ashish K. Khanna; kamal maheshwari; Chintan Dave; Avneesh Khare; Francis A. Papay; Piyush Mathur (2024). Artificial Intelligence in Healthcare: 2023 Year in Review Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.25670019.v3
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25670019.v3
Dataset updated
Apr 23, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Julia Maslinski; Rachel Grasfield B; Raghav Awasthi; Shreya Mishra; Dwarikanath Mahapatra; Jacek B Cywinkski; Ashish K. Khanna; kamal maheshwari; Chintan Dave; Avneesh Khare; Francis A. Papay; Piyush Mathur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background:The infodemic we are experiencing with AI related publications in healthcare is unparalleled. The excitement and fear surrounding the adoption of rapidly evolving AI in healthcare applications pose a real challenge. Collaborative learning from published research is one of the best ways to understand the associated opportunities and challenges in the field. To gain a deep understanding of recent developments in this field, we have conducted a quantitative and qualitative review of AI in healthcare research articles published in 2023.Methods:We performed a PubMed search using the terms, “machine learning” or “artificial intelligence” and “2023”, restricted to English language and human subject research as of December 31, 2023 on January 1, 2024. Utilizing a Deep Learning-based approach, we assessed the maturity of publications. Following this, we manually annotated the healthcare specialty, data utilized, and models employed for the identified mature articles. Subsequently, empirical data analysis was performed to elucidate trends and statistics. Similarly, we performed a search for Large Language Model(LLM) based publications for the year 2023.Results:Our PubMed search yielded 23,306 articles, of which 1,612 were classified as mature. Following exclusions, 1,226 articles were selected for final analysis. Among these, the highest number of articles originated from the Imaging specialty (483), followed by Gastroenterology (86), and Ophthalmology (78). Analysis of data types revealed that image data was predominant, utilized in 75.2% of publications, followed by tabular data (12.9%) and text data (11.6%). Deep Learning models were extensively employed, constituting 59.8% of the models used. For the LLM related publications,after exclusions, 584 publications were finally classified into the 26 different healthcare specialties and used for further analysis. The utilization of Large Language Models (LLMs), is highest in general healthcare specialties, at 20.1%, followed by surgery at 8.5%.Conclusion:Image based healthcare specialities such as Radiology, Gastroenterology and Cardiology have dominated the landscape of AI in healthcare research for years. In the future, we are likely to see other healthcare specialties including the education and administrative areas of healthcare be driven by the LLMs and possibly multimodal models in the next era of AI in healthcare research and publications.Data Files Description:Here, we are providing two data files. The first file, named FinalData_2023_YIR, contains 1267 rows with columns including 'DOI', 'Title', 'Abstract', 'Author Name', 'Author Address', 'Specialty', 'Data type', 'Model type', and 'Systematic Reviews'. The columns 'Specialty', 'Data type', 'Model type', and 'Systematic Reviews' were manually annotated by the BrainX AI research team. The second file, named Final_LLM_2023_YIR, consists of 584 rows and columns including 'DOI', 'Title', 'Abstract', 'Author Name', 'Author Address', 'Journal', and 'Specialty'. Here, the 'Specialty' column was also manually annotated by the BrainX AI Research Team.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

HR Analytics Dataset
kaggle.com
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shodolamu Opeyemi (2025). HR Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/hopesb/hr-analytics-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 18, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shodolamu Opeyemi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The uploaded dataset contains detailed information about employees, training programs, and other HR-related metrics. Here's an overview:

General Details:

Rows: 3,150

Columns: 39

Column Names:

Unnamed: 0

FirstName

LastName

StartDate

ExitDate

Title

Supervisor

ADEmail

BusinessUnit

EmployeeStatus

EmployeeType

PayZone

EmployeeClassificationType

TerminationType

TerminationDescription

DepartmentType

Division

DOB

State

JobFunctionDescription

GenderCode

LocationCode

RaceDesc

MaritalDesc

Performance Score

Current Employee Rating

Employee ID

Survey Date

Engagement Score

Satisfaction Score

Work-Life Balance Score

Training Date

Training Program Name

Training Type

Training Outcome

Location

Trainer

Training Duration (Days)

Training Cost

Summary:

Employee Data: Contains details such as names, start and exit dates, job titles, and supervisors.

Performance and Survey Metrics: Includes engagement, satisfaction, and work-life balance scores.

Training Information: Covers program names, training types, outcomes, durations, costs, and trainer details.

Diversity Details: Includes gender, race, and marital status.

Status & Classification: Indicates employee status (active/terminated), type, and termination reasons.
Retail Analysis on Large Dataset
kaggle.com
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahil Prajapati (2024). Retail Analysis on Large Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/8693643
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/8693643
Dataset updated
Jun 14, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sahil Prajapati
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description:

The dataset represents retail transactional data. It contains information about customers, their purchases, products, and transaction details. The data includes various attributes such as customer ID, name, email, phone, address, city, state, zipcode, country, age, gender, income, customer segment, last purchase date, total purchases, amount spent, product category, product brand, product type, feedback, shipping method, payment method, and order status.

Key Points:

Customer Information:

Includes customer details like ID, name, email, phone, address, city, state, zipcode, country, age, and gender. Customer segments are categorized into Premium, Regular, and New. ##Transaction Details:

Transaction-specific data such as transaction ID, last purchase date, total purchases, amount spent, total purchase amount, feedback, shipping method, payment method, and order status. ##Product Information:

Contains product-related details such as product category, brand, and type. Products are categorized into electronics, clothing, grocery, books, and home decor. ##Geographic Information:

Contains location details including city, state, and country. Available for various countries including USA, UK, Canada, Australia, and Germany. ##Temporal Information:

Last purchase date is provided along with separate columns for year, month, date, and time. Allows analysis based on temporal patterns and trends. ##Data Quality:

Some rows contain null values, and others are duplicates, which may need to be handled during data preprocessing. Null values are randomly distributed across rows. Duplicate rows are available at different parts of the dataset. ##Potential Analysis:

Customer segmentation analysis based on demographics, purchase behavior, and feedback. Sales trend analysis over time to identify peak seasons or trends. Product performance analysis to determine popular categories, brands, or types. Geographic analysis to understand regional preferences and trends. Payment and shipping method analysis to optimize services. Customer satisfaction analysis based on feedback and order status. ##Data Preprocessing:

Handling null values and duplicates. Parsing and formatting temporal data. Encoding categorical variables. Scaling numerical variables if required. Splitting data into training and testing sets for modeling.
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
Pharma Data Analysis
kaggle.com
zip
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akanksha Kumari (2024). Pharma Data Analysis [Dataset]. https://www.kaggle.com/datasets/akanksha995579/pharma-data-analysis
Explore at:
zip(25458289 bytes)Available download formats
Dataset updated
Jun 26, 2024
Authors
Akanksha Kumari
Description
The dataset is a comprehensive sales record from Gottlieb-Cruickshank, detailing various transactions that took place in Poland in January 2018. The data includes information on customers, products, and sales teams, with a focus on the pharmaceutical industry. Below is a detailed description of the dataset:

Dataset Description

Columns:

Distributor: The name of the distributing company, which is consistent across all records as "Gottlieb-Cruickshank."

Customer Name: The name of the customer or the purchasing entity.

City: The city in Poland where the customer is located.

Country: The country of the transaction, consistently listed as "Poland."

Latitude: The latitude coordinate of the customer's city.

Longitude: The longitude coordinate of the customer's city.

Channel: The distribution channel, either "Hospital" or "Pharmacy."

Sub-channel: Specifies whether the sub-channel is "Private," "Retail," or "Institution."

Product Name: The name of the pharmaceutical product sold.

Product Class: The classification of the product, such as "Mood Stabilizers," "Antibiotics," or "Analgesics."

Quantity: The number of units sold.

Price: The price per unit of the product.

Sales: The total sales amount, calculated as Quantity * Price.

Month: The month of the transaction, which is "January" for all records.

Year: The year of the transaction, which is "2018" for all records.

Name of Sales Rep: The name of the sales representative handling the transaction.

Manager: The manager overseeing the sales representative.

Sales Team: The sales team to which the sales representative belongs, such as "Delta," "Bravo," or "Alfa."

Example Rows:

Row 1:

Customer Name: Zieme, Doyle and Kunze

City: Lublin

Product Name: Topipizole

Product Class: Mood Stabilizers

Quantity: 4

Price: 368

Sales: 1472

Sales Rep: Mary Gerrard

Manager: Britanny Bold

Sales Team: Delta

Row 2:

Customer Name: Feest PLC

City: Świecie

Product Name: Choriotrisin

Product Class: Antibiotics

Quantity: 7

Price: 591

Sales: 4137

Sales Rep: Jessica Smith

Manager: Britanny Bold

Sales Team: Delta

This dataset can be utilized for various analyses, including sales performance by city, product, and sales teams, as well as geographical distribution of sales within Poland. It provides valuable insights into the pharmaceutical sales strategies and their execution within a specific time frame.
Z
Experimental data and software for: Defaults: a double-edged sword in...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montero-Porras, Eladio (2024). Experimental data and software for: Defaults: a double-edged sword in governing common resources [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10228657
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
Vrije Universiteit Brussel
Authors
Montero-Porras, Eladio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experimental data and software for the paper: Defaults: a double-edged sword in governing common resources

The experiment consisted in three treatments of the Common Pool Resource Dilemma, where three default interventions were applied: pro-social, self-serving and no default. Plus, the participants had to complete an SVO task and a Risk assessment task.

Description of the data and file structure

In the file called all_participants.csv is the full dataset of all participants that took part of the experiment. This includes participants who will end up excluded and dropouts.

The experimental data files come in two formats: wide and long. The wide version, called data_wide_format.csv contains one row per participant and a column for all the fields, including rounds from 1 to 10 of the CPR task. Also, this file includes all demographic information of the participants, times and payments. The ID shown is generated internally and has no relationship with the participants' Prolific ID.

The long version, called data_long_format.csv, contains 10 rows per participant, and columns for the extraction and other variables necessary for analysis. This version contains the necessary data to reproduce all the figures and statistics detailed in the main manuscript.

In both of the previous files, the participants taken into account were the ones who completed the whole experiment. Those who did not complete the comprehension test, dropped out or did not sign the Informed Consent Form were excluded from the experimental data used. More details in the Methods below.

In the file default_opinions.csv, we manually classified the responses by participants to whether they were influenced by the default presented.

The file "Instructions of the experiment.pdf" contains the instructions of the experiment as shown to participants, also screenshots of the platform.
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
Randomised Synthetic Online Game Purchases Data
kaggle.com
zip
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zaclovell (2022). Randomised Synthetic Online Game Purchases Data [Dataset]. https://www.kaggle.com/datasets/zaclovell/randomised-synthetic-online-game-purchases-data
Explore at:
zip(1208739 bytes)Available download formats
Dataset updated
Apr 24, 2022
Authors
zaclovell
Description
1. Why build a dataset?

I wanted to run data analysis and machine learning on a large dataset to build my data science skills but I felt out of touch with the various datasets available so I thought... how about I try and build my own dataset?

2. Why gaming data?

I wondered what data should be in the dataset and settled with online digital game purchases since I am an avid gamer. Imagine getting sales data from the PlayStation Store or Xbox Microsoft Store, this is what I was aiming to replicate.

3. Scope of the dataset

I envisaged the dataset to be data created through the purchase of a digital game on either the UK PlayStation Store or Xbox Microsoft Store. Considering this, the scope of dataset varies depending on which column of data you are viewing, for example: - Date and Time: purchases were defined between a start/end date (this can be altered, see point 4) and, of course, anytime across the 24hr clock - Geographically: purchases were setup to come from any postcode in the UK - in total this is over 1,000,000 active postcodes - Purchases: the list of game titles available for purchase is 24 - Registered Banks: the list of registered banks in the UK (as of 03/2022) was 159

4. Over 42,000 rows isn't enough?

To generate the dataset, I built a function in Python. This function, when called with the number of rows you want in your dataset, will generate the dataset. For example, calling function(1000) will provide you with a dataset with 1000 rows.

Considering this, if just over 42,000 rows of data (42,892 to be exact) isn't enough, feel free to check out the code on my GitHub to run the function yourself with as many rows as you want.

Note: You can also edit the start/end dates of the function depending on which timespan you want the dataset to cover.

5. Disclaimer - this is still a work in progress!

Yes, as stated above, this dataset is still a work in progress and is therefore not 100% perfect. There is a backlog of issues that need to be resolved. Feel free to check out the backlog.

One example of this is how on various columns, the distributions of data is equal, when in fact for the dataset to be entirely random, this should not be the case. An example of this issue is the Time column. These issues will be resolved in a later update.

Last updated: 24/04/2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Work With Data (2025). Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=book&fop0=%3D&fval0=An+introduction+to+data+analysis+in+R+%3A+hands-on+coding%2C+data+mining%2C+visualization+and+statistics+from+scratch

Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch

Explore at:

Dataset updated

Apr 17, 2025

Dataset authored and provided by

Work With Data

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is about books. It has 1 row and is filtered where the book is An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch. It features 7 columns including author, publication date, language, and book publisher.

Clear search

Close search

Google apps

Main menu

Dataset of books called An introduction to data analysis in R : hands-on...

Google Data Analytics Capstone Project

Product Sales Dataset (2023-2024)

🛍️ Product Sales Dataset (2023–2024)

📌 Overview

📂 Dataset Structure

Features

🎯 Potential Use Cases

📊 Example Insights

🏷️ Tags

📜 License

🙌 Acknowledgments

Dataset of books called Data analysis in business research : a step-by-step...

Dataset of books called Core concepts in data analysis : summarization,...

Wikipedia SQLITE Portable DB, Huge 5M+ Rows

Trusted Research Environments: Analysis of Characteristics and Data...

Methodology

Technical details

Cafe Sales - Dirty Data for Cleaning Training

Dirty Cafe Sales Dataset

Overview

File Information

Columns Description

Data Characteristics

Menu Items

Use Cases

Cleaning Steps Suggestions

License

Feedback

Dataset of books called Statistical and computational methods in data...

Experimental Dataset on the Impact of Unfair Behavior by AI and Humans on...

Dataset of books called Modeling and data analysis : an introduction with...

Data from: Artificial Intelligence in Healthcare: 2023 Year in Review...

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

HR Analytics Dataset

Retail Analysis on Large Dataset

Dataset Description:

Key Points:

Customer Information:

Datasets for Sentiment Analysis

Pharma Data Analysis

Dataset Description

Example Rows:

Experimental data and software for: Defaults: a double-edged sword in...

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

Randomised Synthetic Online Game Purchases Data

1. Why build a dataset?

2. Why gaming data?

3. Scope of the dataset

4. Over 42,000 rows isn't enough?

5. Disclaimer - this is still a work in progress!

Last updated: 24/04/2022

Dataset of books called An introduction to data analysis in R : hands-on coding, data mining, visualization and statistics from scratch

Primary Stakeholders: