Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Presentation Date: Sunday, January 8th, 2023 Location: Seattle, Washington, USA Abstract: A talk introducing glue software and its function with astronomy at the 2023 AAS meeting. Files included are Keynote slides (in .key and .pdf formats)
Facebook
TwitterHi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.
In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.
My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.
I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.
Stanley Cheng 2021-10-07
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to the CIC PDF-Malware 2022 dataset! This dataset is meticulously cleaned and curated to support research and development in the field of malware detection within PDF files. The dataset offers a valuable resource for machine learning practitioners, researchers, and data scientists working on cybersecurity projects.
Dataset Overview: The CIC PDF-Malware 2022 dataset comprises a comprehensive collection of features extracted from PDF files, both benign and malicious. It has been thoroughly cleaned to ensure high quality and consistency. Each entry in the dataset includes detailed attributes that can be leveraged for training and testing machine learning models aimed at detecting malware embedded in PDFs.
Key Features:
Feature-Rich Data: Includes various attributes related to PDF files, making it suitable for in-depth analysis and model training. Cleaned and Curated: The dataset has been meticulously cleaned to remove inconsistencies and errors, ensuring reliability and accuracy. Visualizations: We provide insightful visualizations to help understand the dataset's characteristics and distribution. Usage: To facilitate easy utilization of the dataset, we have included example code and tutorials demonstrating how to load and analyze the data. These resources will help you get started quickly and effectively.
Why This Dataset is Valuable:
Research and Development: Ideal for researchers and practitioners focused on enhancing malware detection mechanisms. Benchmarking: Useful for benchmarking new algorithms and models in the context of PDF malware detection. Community Engagement: Engage with the dataset through discussions and collaborative projects to advance cybersecurity research. Getting Started:
Download the dataset and explore the included examples and tutorials. Use the provided visualizations to gain insights into the dataset’s structure and attributes. Share your findings, contribute to discussions, and collaborate with other Kaggle users to maximize the impact of this dataset. Feel free to reach out with any questions or feedback. We look forward to seeing how you utilize this dataset to advance the field of malware detection!
Facebook
TwitterIntroductionRobots are being introduced into increasingly social environments. As these robots become more ingrained in social spaces, they will have to abide by the social norms that guide human interactions. At times, however, robots will violate norms and perhaps even deceive their human interaction partners. This study provides some of the first evidence for how people perceive and evaluate robot deception, especially three types of deception behaviors theorized in the technology ethics literature: External state deception (cues that intentionally misrepresent or omit details from the external world: e.g., lying), Hidden state deception (cues designed to conceal or obscure the presence of a capacity or internal state the robot possesses), and Superficial state deception (cues that suggest a robot has some capacity or internal state that it lacks).MethodsParticipants (N = 498) were assigned to read one of three vignettes, each corresponding to one of the deceptive behavior types. Participants provided responses to qualitative and quantitative measures, which examined to what degree people approved of the behaviors, perceived them to be deceptive, found them to be justified, and believed that other agents were involved in the robots’ deceptive behavior.ResultsParticipants rated hidden state deception as the most deceptive and approved of it the least among the three deception types. They considered external state and superficial state deception behaviors to be comparably deceptive; but while external state deception was generally approved, superficial state deception was not. Participants in the hidden state condition often implicated agents other than the robot in the deception.ConclusionThis study provides some of the first evidence for how people perceive and evaluate the deceptiveness of robot deception behavior types. This study found that people people distinguish among the three types of deception behaviors and see them as differently deceptive and approve of them differently. They also see at least the hidden state deception as stemming more from the designers than the robot itself.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.
Methodology:
Key Analysis Components:
Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.
Data Products:
Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.
Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Traditional methods of data analysis in animal behavior research are usually based on measuring behavior by manually coding a set of chosen behavioral parameters, which is naturally prone to human bias and error, and is also a tedious labor-intensive task. Machine learning techniques are increasingly applied to support researchers in this field, mostly in a supervised manner: for tracking animals, detecting land marks or recognizing actions. Unsupervised methods are increasingly used, but are under-explored in the context of behavior studies and applied contexts such as behavioral testing of dogs. This study explores the potential of unsupervised approaches such as clustering for the automated discovery of patterns in data which have potential behavioral meaning. We aim to demonstrate that such patterns can be useful at exploratory stages of data analysis before forming specific hypotheses. To this end, we propose a concrete method for grouping video trials of behavioral testing of animal individuals into clusters using a set of potentially relevant features. Using an example of protocol for testing in a “Stranger Test”, we compare the discovered clusters against the C-BARQ owner-based questionnaire, which is commonly used for dog behavioral trait assessment, showing that our method separated well between dogs with higher C-BARQ scores for stranger fear, and those with lower scores. This demonstrates potential use of such clustering approach for exploration prior to hypothesis forming and testing in behavioral research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Malaria is a mosquito-borne disease spread by an infected vector (infected female Anopheles mosquito) or through transfusion of plasmodium-infected blood to susceptible individuals. The disease burden has resulted in high global mortality, particularly among children under the age of five. Many intervention responses have been implemented to control malaria disease transmission, including blood screening, Long-Lasting Insecticide Bed Nets (LLIN), treatment with an anti-malaria drug, spraying chemicals/pesticides on mosquito breeding sites, and indoor residual spray, among others. As a result, the SIR (Susceptible—Infected—Recovered) model was developed to study the impact of various malaria control and mitigation strategies. The associated basic reproduction number and stability theory is used to investigate the stability analysis of the model equilibrium points. By constructing an appropriate Lyapunov function, the global stability of the malaria-free equilibrium is investigated. By determining the direction of bifurcation, the implicit function theorem is used to investigate the stability of the model endemic equilibrium. The model is fitted to malaria data from Benue State, Nigeria, using R and MATLAB. Estimates of parameters were made. Following that, an optimal control model is developed and analyzed using Pontryaging's Maximum Principle. The malaria-free equilibrium point is locally and globally stable if the basic reproduction number (R0) and the blood transfusion reproduction number (Rα) are both less or equal to unity. The study of the sensitive parameters of the model revealed that the transmission rate of malaria from mosquito-to-human (βmh), transmission rate from humans-to-mosquito (βhm), blood transfusion reproduction number (Rα) and recruitment rate of mosquitoes (bm) are all sensitive parameters capable of increasing the basic reproduction number (R0) thereby increasing the risk in spreading malaria disease. The result of the optimal control shows that five possible controls are effective in reducing the transmission of malaria. The study recommended the combination of five controls, followed by the combination of four and three controls is effective in mitigating malaria transmission. The result of the optimal simulation also revealed that for communities or areas where resources are scarce, the combination of Long Lasting Insecticide Treated Bednets (u2), Treatment (u3), and Indoor insecticide spray (u5) is recommended. Numerical simulations are performed to validate the model's analytical results.
Facebook
TwitterThis is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.
Facebook
TwitterThis dataset includes 45588785 transactions between 12094228 bitcoin addresses, in the bitcoin network up to 2013.12.28. We preprocessed the original data, by adding synthetic identifiers to addresses and merging addresses which seemed to belong to the same user. Each interaction records the sender address, the destination address, a timestamp, and the transferred quantity in BTCs.
If you're going to use this dataset, please cite our paper: Chrysanthi Kosyfaki, Nikos Mamoulis, "Provenance in Temporal Interaction Networks", 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, May 2022. https://www.cs.uoi.gr/~nikos/icde22.pdf
Facebook
TwitterPresentation Date: Wednesday, June 28, 2023 Location: Center for Astrophysics, Cambridge, MA Abstract: A demonstration at the 2023 New England Star and Planet Formation Workshop of how the glue exploratory data analysis software has helped with recent discoveries about the structure of the local Milky Way. Files included are Keynote slides (in .key and .pdf formats)
Facebook
TwitterThis collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).
This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.
Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.
Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.
UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains an ArcGIS Toolbox with 6 ArcGIS Models: WellClipsToWormsSections, BufferedRasterToClippedRaster, ExtractThermalPropertiesToCrossSection, AddExtraInfoToCrossSection, and CrossSectionExtraction.
The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.
Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.
The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.
A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.
Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains raw data, code and analysis scripts related to experiments performed in the ‘A combined microfluidic deep learning approach for lung cancer cell high throughput screening toward automatic cancer screening applications’. The data, code, and documentation provided here to facilitate reproducible research and enable further exploration and analysis of the experimental results.
Analysis Code:
Languages: MATLAB 2020a or later with Deep Learning Toolbox
Description: This repository contains MATLAB scripts for data preprocessing, deep learning-based classification, and visualization of lung cancer cell images. The scripts train convolutional neural networks (CNNs) to classify six lung cell lines, including normal and five cancer subtypes.
Documentation:
File: LungCancer_CellLine_Code.zip
Description: This file provides exemplary code and sample images used for the machine learning approach.
File: Supplementary information and instructions.pdf
Description: This file provides an instruction and a description of the individual steps from raw data to image analysis.
File: Original Image data and Metadata Example - pc9.zip
Description: This .zip container provides an example of raw data in a native .vsi file format with folders containing the .ets file, with metadata documentation of the imaging parameters for a microfluidic channel imaged with the IX83 microscope.
File: Data augmentation documentation.docx (and Data augmentation documentation.pdf)
Description: This document provides descriptions of how data augmentation was performed.
File: Raw data.zip
Description: This file contains image raw data.
File: GrayCellData.rar
Description: This file contains image data converted to grayscale images.
File: CellData_Full.rar
Description: This file contains RGB image data.
Cell Lines: The lung normal cell and non-small lung cancer cells (PC-9, SK-LU-1, H-1975, A-427, and A-549)
Plate Format: Plasma-bonded and coated microfluidics chip platform fabricated with silicon sheets and sterile object glass slides.
Surface Coating
Prior to cell seeding, the surface of the polydimethylsiloxane (PDMS) microfluidic chip was treated with collagen to enhance cell adhesion. A 0.1% (w/v) collagen solution was prepared using Type I collagen (derived from rat tail) dissolved in a 0.02 M acetic acid buffer. The PDMS surfaces were incubated with the collagen solution for 2 hours at room temperature to allow for proper coating. Following this, the chips were rinsed with phosphate-buffered saline (PBS) to remove any unbound collagen. Collagen, being a key extracellular matrix component, provides a conducive environment for cell attachment and proliferation. This surface modification was crucial for ensuring that the cells would adhere effectively to the microfluidic architecture, promoting optimal growth conditions. The collagen coating facilitated stronger cell-matrix interactions, thereby improving the overall experimental reliability and enabling accurate analysis of cell behavior in the microfluidic system.
Seeding Density
In this study, various cell types (lung normal cells and non-small cell lung cancer cells: PC-9, SK-LU-1, H-1975, A-427, and A-549) were cultured within a microfluidic chip designed with a total length of 75 mm and a width of 25 mm, featuring three separate chambers, each with a diameter of 900 μm. The seeding density was calculated to be approximately 5,000 cells/mL. Given the chamber dimensions, this density was optimized to ensure that the cells could achieve ~70% confluency within a reasonable timeframe while maintaining their viability and functionality. The initial seeding in a 25 cm² culture flask allowed for efficient expansion and preparation of the cells prior to their transfer to the microfluidic environment (the cell culture medium was DMEM or RPMI supplemented with 10% FBS and 1% PS).
Cultivation Duration
After trypsin treatment of cells cultured in a flask, the cells were allowed to adhere to the microfluidic chip for a duration of 48-72 hours post-injection. This incubation period was essential for the cells to establish stable adhesion to the collagen-coated surfaces, enabling them to regain their morphology and functionality. It ensured that the cellular environment within the microfluidic chambers mimicked in vivo conditions, allowing for proper cell spreading and growth.
Medium Composition
The medium utilized for cell cultivation consisted of DMEM (Dulbecco's Modified Eagle Medium) or RPMI-1640, supplemented with 10% fetal bovine serum (FBS) and 1% penicillin-streptomycin (PS), tailored to the specific cell types used. This composition was chosen to provide the necessary nutrients, growth factors, and antibiotics to support cell proliferation and prevent contamination. DMEM and RPMI are known to support a wide range of mammalian cell types, thereby enhancing the versatility of the experimental setup. The medium was pre-warmed to 37°C before use, and the cells were maintained in a humidified incubator at 37°C with 5% CO₂ during cultivation.
Imaging Setup
The imaging data was acquired using an automated IX83 microscope (Olympus, Japan), featuring a Merzhäuser motorized stage, a Hamamatsu ORCA-Flash4.0 camera, and a Lumencolor Spectra X fluorescent light source. This setup ensures high-resolution fluorescence imaging with precise stage control and sensitive image capture. Data was recorded automatically after adjustment of the z-axis using a multi-region area of interest on each microfluidic channel with the focus map function (medium density setting) with cellSens Dimension software (Version 2.1-2.3, Olympus). The DAPI staining of the blue fluorescence channel was used to facilitate large-area adjustment of the focus map prior to automated imaging. The green fluorescence channel representing the phalloidin staining of f-actin was used as a single channel exported images for the deep learning procedure outlined in the paper.
1. Extract the Raw Data:
Unzip the Raw data.zip file into your working directory.
2. Environment Setup:
Read the documentation Supplementary information and instructions.pdf and the readme.txt in the code for more details on the setup.
3. Running the Analysis:
Open the file Supplementary information and instructions.pdf for a detailed description.
Data Exploration: The analysis scripts include functions for exploratory data analysis (EDA). You can modify these scripts to investigate specific experimental conditions.
Reproducibility
Follow the code comments and documentation to replicate the analyses. Ensure that the environment and dependencies are correctly configured as described in the setup section.
Licensing
This repository is licensed as follows: Code is accessible under BSD 2-Clause "Simplified" license and data under a Creative Commons Attribution 4.0 International license.
Acknowledgement:
This work was supported by the Iran National Science Foundation (INSF) Grant No. 96006759.
For data acquisition:
Abdullah Allahverdi, a-allahverdi@modares.ac.ir;
Hadi Hashemzadeh, Hashemzadeh.hadi@gmail.com;
Mario Rothbauer, mario.rothbauer@tuwien.ac.at
For data processing and augmentation:
Seyedehsamaneh Shojaei, s.shojaie@irost.ir, samane.shojaie@gmail.com
Facebook
TwitterThis is Agriculture dataset of Bihar state of India.
Dataset contains 6 csv files and 1 pdf and 1 image file
column descriptors
Facebook
TwitterAccording to this BRFSS is:
BRFSS is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
To learn more about the data see the official page.
Complete description about each column of the CSV file can be found in the codebook.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains the core data used to investigate what determines Munich’s residents’ engagement in activities for the protection of urban green spaces (UGS). We conducted an exploratory factor analysis and a structural equation modelling based on the data from an online and in-person questionnaire.
List of data and content
Data processing
The software for performing the exploratory factor analysis and structural equation modelling was Mplus 8.8 (Muthén and Muthén, 1998). The details on the methodological steps are available in the published publication. To quickly understand the EFA, CFA and SEM model and settings the details are provided in the hyperlinks.
Acknowledgments
The authors thank to all participants for their participation in the online and in-person questionnaires. Data processing and analysis would not have been possible without the help of Tomomi Saito.
This work was supported by a grant from the Alexander von Humboldt Foundation and by the Leibniz Best Minds Competition, Leibniz-Junior Research Group, Grant J76/2019.
Facebook
Twitterhttps://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-3387https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-3387
Dataset containing supplemental material for the publication "2D, 2.5D, or 3D? An Exploratory Study on Multilayer Network Visualizations in Virtual Reality" This dataset contains: 1) archive containing all raw quantitative results, 2) archive containing all raw qualitative data, 3) archive containing the graphs used for the experiment (.graphml file format), 4) the code to generate the graph library (C++ files using OGDF), 5) a PDF document containing detailed results (with p-values and more charts), 6) a video showing the experimentation from a participant's point of view. 7) complete graph library generated by our graph generator for the experiment
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset encompasses reviews from the Steam video game platform along with information about bundled games. It entails user reviews, purchases, plays, recommendations, product bundles, and pricing information.
Basic Statistics: - Reviews: 7,793,069 - Users: 2,567,538 - Items: 15,474 - Bundles: 615
Metadata: - Reviews - Purchases, Plays, Recommends ("likes") - Product Bundles - Pricing Information
Example (Bundle):
json
{
'bundle_final_price': '$29.66',
'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...',
'bundle_price': '$32.96',
'bundle_name': 'Two Tribes Complete Pack!',
'bundle_id': '1482',
'items': [
{
'genre': 'Casual, Indie',
'item_id': '38700',
'discounted_price': '$4.99',
'item_url': 'http://store.steampowered.com/app/38700',
'item_name': 'Toki Tori'
},
{
'genre': 'Adventure, Casual, Indie',
'item_id': '201420',
'discounted_price': '$14.99',
'item_url': 'http://store.steampowered.com/app/201420',
'item_name': 'Toki Tori 2+'
},
{
'genre': 'Strategy, Indie, Casual',
'item_id': '38720',
'discounted_price': '$4.99',
'item_url': 'http://store.steampowered.com/app/38720',
'item_name': 'RUSH'
},
{
'genre': 'Action, Indie',
'item_id': '38740',
'discounted_price': '$7.99',
'item_url': 'http://store.steampowered.com/app/38740',
'item_name': 'EDGE'
}
],
'bundle_discount': '10%'
}
Citation: - Self-attentive sequential recommendation, Wang-Cheng Kang, Julian McAuley, ICDM, 2018 [pdf] - Item recommendation on monotonic behavior chains, Mengting Wan, Julian McAuley, RecSys, 2018 [pdf] - Generating and personalizing bundle recommendations on Steam, Apurva Pathak, Kshitiz Gupta, Julian McAuley, SIGIR, 2017 [pdf]
Facebook
TwitterMost text used in this notebook from ex1.pdf of Coursera
Look at ex1.pdf to get more intuition about the task
The task will be implemented in three ways and three notebooks and it all about linear regression
In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck.
Facebook
TwitterThis dataset contains race data from the past ten years of NCAA for the 100 freestyle (men) event. I collected this data using my own Python Script in which you follow along with a race by pressing the "Enter" button with each stroke. Upon the completion of the script, csv and pdf files are generated containing data from the race. I aggregated this data for the completion of my first project.
In order to aggregate, organize, and visualize the data, I had to use a variety of software such as BigQuery (SQL), Python, Tableau, and Google Sheets. This project shows my ability to use a variety of different tools used for data analysis.
Facebook
TwitterThis dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.
About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.
Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.
This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.
This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning