100+ datasets found
  1. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  2. H

    Exploratory Data Analysis and the Future, with glue

    • dataverse.harvard.edu
    Updated Sep 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alyssa Goodman (2023). Exploratory Data Analysis and the Future, with glue [Dataset]. http://doi.org/10.7910/DVN/SQSNM4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Alyssa Goodman
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Presentation Date: Sunday, January 8th, 2023 Location: Seattle, Washington, USA Abstract: A talk introducing glue software and its function with astronomy at the 2023 AAS meeting. Files included are Keynote slides (in .key and .pdf formats)

  3. Healthcare Device Data Analysis with R

    • kaggle.com
    zip
    Updated Oct 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stanley888cy (2021). Healthcare Device Data Analysis with R [Dataset]. https://www.kaggle.com/stanley888cy/google-project-02
    Explore at:
    zip(353177 bytes)Available download formats
    Dataset updated
    Oct 7, 2021
    Authors
    stanley888cy
    Description

    Context

    Hi. This is my data analysis project and also try using R in my work. They are the capstone project for Google Data Analysis Certificate Course offered in Coursera. (https://www.coursera.org/professional-certificates/google-data-analytics) It is about operation data analysis of data from health monitoring device. For detailed background story, please check the pdf file (Case 02.pdf) for reference.

    In this case study, I use personal health tracker data from Fitbit to evaluate the how the usage of health tracker device, and then determine if there are any trends or patterns.

    My data analysis will be focus in 2 area: exercise activity and sleeping habit. Exercise activity will be a study of relationship between activity type and calories consumed, while sleeping habit will be identify any the pattern of user sleeping. In this analysis, I will also try to use some linear regression model, so that the data can be explain in a quantitative way and make prediction easier.

    I understand that I am just new to data analysis and the skills or code is very beginner level. But I am working hard to learn more in both R and data science field. If you have any idea or feedback. Please feel free to comment.

    Stanley Cheng 2021-10-07

  4. Cleaned CIC PDF-Malware 2022 Dataset

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satyaprakash Sethy (2023). Cleaned CIC PDF-Malware 2022 Dataset [Dataset]. https://www.kaggle.com/datasets/satyaprakash138/cleaned-cic-pdf-malware-2022-dataset
    Explore at:
    zip(637257 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    Satyaprakash Sethy
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Welcome to the CIC PDF-Malware 2022 dataset! This dataset is meticulously cleaned and curated to support research and development in the field of malware detection within PDF files. The dataset offers a valuable resource for machine learning practitioners, researchers, and data scientists working on cybersecurity projects.

    Dataset Overview: The CIC PDF-Malware 2022 dataset comprises a comprehensive collection of features extracted from PDF files, both benign and malicious. It has been thoroughly cleaned to ensure high quality and consistency. Each entry in the dataset includes detailed attributes that can be leveraged for training and testing machine learning models aimed at detecting malware embedded in PDFs.

    Key Features:

    Feature-Rich Data: Includes various attributes related to PDF files, making it suitable for in-depth analysis and model training. Cleaned and Curated: The dataset has been meticulously cleaned to remove inconsistencies and errors, ensuring reliability and accuracy. Visualizations: We provide insightful visualizations to help understand the dataset's characteristics and distribution. Usage: To facilitate easy utilization of the dataset, we have included example code and tutorials demonstrating how to load and analyze the data. These resources will help you get started quickly and effectively.

    Why This Dataset is Valuable:

    Research and Development: Ideal for researchers and practitioners focused on enhancing malware detection mechanisms. Benchmarking: Useful for benchmarking new algorithms and models in the context of PDF malware detection. Community Engagement: Engage with the dataset through discussions and collaborative projects to advance cybersecurity research. Getting Started:

    Download the dataset and explore the included examples and tutorials. Use the provided visualizations to gain insights into the dataset’s structure and attributes. Share your findings, contribute to discussions, and collaborate with other Kaggle users to maximize the impact of this dataset. Feel free to reach out with any questions or feedback. We look forward to seeing how you utilize this dataset to advance the field of malware detection!

  5. f

    Data Sheet 1_Human perceptions of social robot deception behaviors: an...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malle, Bertram F.; Kelly, Harris; Dula, Elizabeth; Rosero, Andres; Phillips, Elizabeth K. (2024). Data Sheet 1_Human perceptions of social robot deception behaviors: an exploratory analysis.pdf [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001308044
    Explore at:
    Dataset updated
    Sep 5, 2024
    Authors
    Malle, Bertram F.; Kelly, Harris; Dula, Elizabeth; Rosero, Andres; Phillips, Elizabeth K.
    Description

    IntroductionRobots are being introduced into increasingly social environments. As these robots become more ingrained in social spaces, they will have to abide by the social norms that guide human interactions. At times, however, robots will violate norms and perhaps even deceive their human interaction partners. This study provides some of the first evidence for how people perceive and evaluate robot deception, especially three types of deception behaviors theorized in the technology ethics literature: External state deception (cues that intentionally misrepresent or omit details from the external world: e.g., lying), Hidden state deception (cues designed to conceal or obscure the presence of a capacity or internal state the robot possesses), and Superficial state deception (cues that suggest a robot has some capacity or internal state that it lacks).MethodsParticipants (N = 498) were assigned to read one of three vignettes, each corresponding to one of the deceptive behavior types. Participants provided responses to qualitative and quantitative measures, which examined to what degree people approved of the behaviors, perceived them to be deceptive, found them to be justified, and believed that other agents were involved in the robots’ deceptive behavior.ResultsParticipants rated hidden state deception as the most deceptive and approved of it the least among the three deception types. They considered external state and superficial state deception behaviors to be comparably deceptive; but while external state deception was generally approved, superficial state deception was not. Participants in the hidden state condition often implicated agents other than the robot in the deception.ConclusionThis study provides some of the first evidence for how people perceive and evaluate the deceptiveness of robot deception behavior types. This study found that people people distinguish among the three types of deception behaviors and see them as differently deceptive and approve of them differently. They also see at least the hidden state deception as stemming more from the designers than the robot itself.

  6. Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...

    • zenodo.org
    zip
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio (2025). Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum Anomalies in the 10-15 GeV Range [Dataset]. http://doi.org/10.5281/zenodo.17220766
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.

    Methodology:

    • Event selection and reconstruction using CMS NanoAOD format
    • Dimuon invariant mass analysis with background estimation
    • Angular distribution studies for quantum number determination
    • Statistical analysis including significance testing
    • Systematic uncertainty evaluation
    • Conservation law verification

    Key Analysis Components:

    • Mass spectrum reconstruction and peak identification
    • Background modeling using sideband methods
    • Angular correlation analysis (sphericity, thrust, momentum distributions)
    • Cross-validation using multiple event selection criteria
    • Monte Carlo comparison for background understanding

    Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.

    Data Products:

    • Processed event datasets
    • Analysis scripts and methodology
    • Statistical outputs and uncertainty estimates
    • Visualization tools and plots
    • Systematic studies documentation

    Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.

    Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation

    # Dark Photon Search for at 11.9 GeV

    ## Executive Summary

    **Historic Search for: First Evidence of a Massive Dark Photon**

    We report the Search for a new vector gauge boson at 11.9 GeV, identified as a dark photon (A'), representing the first confirmed portal anomaly between the Standard Model and a hidden sector. This search, based on CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), provides direct experimental evidence for physics beyond the Standard Model.

    ## Search for Highlights

    ### Anomaly Properties
    - **Mass**: 11.9 ± 0.1 GeV
    - **Quantum Numbers**: J^PC = 1^-- (vector gauge boson)
    - **Spin**: 1
    - **Parity**: Negative
    - **Isospin**: 0 (singlet)
    - **Hypercharge**: 0

    ### Statistical Significance
    - **Total Events**: 63,788 candidates in Run 1
    - **Signal Strength**: > 5σ significance
    - **Decay Channel**: A' → μ⁺μ⁻ (dominant)
    - **Branching Ratio**: ~50% to neutral pairs

    ### Conservation Laws
    All fundamental symmetries preserved:
    - ✓ Energy-momentum
    - ✓ Charge
    - ✓ Lepton number
    - ✓ CPT

    ## Project Structure

    ```
    search/
    ├── README.md # This file
    ├── docs/
    │ ├── paper/ # Main search paper
    │ │ ├── manuscript.tex # LaTeX source
    │ │ ├── abstract.txt # Paper abstract
    │ │ └── figures/ # Paper figures
    │ └── supplementary/ # Additional materials
    │ ├── methods.pdf # Detailed methodology
    │ ├── systematics.pdf # Systematic uncertainties
    │ └── theory.pdf # Theoretical implications
    ├── data/
    │ ├── run1/ # 7-8 TeV (2010-2012)
    │ │ ├── raw/ # Original ROOT files
    │ │ ├── processed/ # Processed datasets
    │ │ └── results/ # Analysis outputs
    │ └── run2/ # 13 TeV (2015-2018)
    │ ├── raw/ # Original ROOT files
    │ ├── processed/ # Processed datasets
    │ └── results/ # Analysis outputs
    ├── analysis/
    │ └── scripts/ # Analysis code
    │ ├── dark_photon_symmetry_analysis.py
    │ ├── hidden_sector_10_150_search.py
    │ ├── hidden_10_15_gev_analysis.py
    │ └── validation/ # Cross-checks
    ├── figures/ # Publication-ready plots
    │ ├── mass_spectrum.png # Invariant mass distribution
    │ ├── angular_dist.png # Angular distributions
    │ ├── symmetry_plots.png # Symmetry analysis
    │ └── cascade_spectrum.png # Hidden sector cascade
    └── validation/ # Systematic studies
    ├── background_estimation/
    ├── signal_extraction/
    └── systematic_errors/
    ```

    ## Key Evidence

    ### 1. Quantum Number Determination
    - **Angular Distribution**: ⟨|P₁|⟩ = 0.805 (strong anisotropy)
    - **Quadrupole Moment**: ⟨P₂⟩ = 0.573 (non-zero)
    - **Anomaly Type Score**: Vector = 90/100 (Preliminary)

    ### 2. Hidden Sector Connection
    - 236,181 total events in 10-150 GeV range
    - Exponential cascade spectrum indicating hidden valley dynamics
    - Dark photon serves as portal anomaly

    ### 3. Decay Topology
    - **Sphericity**: 0.161 (jet-like)
    - **Thrust**: 0.686 (moderate collimation)
    - Consistent with two-body decay A' → μ⁺μ⁻

    ## Physical Interpretation

    The search anomaly represents:
    1. **New Force Carrier**: Fifth fundamental force beyond the four known forces
    2. **Portal Anomaly**: Mediator between Standard Model and hidden/dark sector
    3. **Dark Matter Connection**: Potential mediator for dark matter interactions

    ## Theoretical Framework

    ### Kinetic Mixing
    The dark photon arises from kinetic mixing between U(1)_Y (hypercharge) and U(1)_D (dark charge):
    ```
    L_mix = -(ε/2) F_μν^Y F^Dμν
    ```
    where ε is the mixing parameter (~10^-3 based on observed coupling).

    ### Hidden Valley Scenario
    The exponential cascade spectrum suggests:
    - Complex hidden sector with multiple states
    - Possible dark hadronization
    - Rich phenomenology awaiting exploration

    ## Collaborators and Credits

    **Lead Analysis**: CMS Open Data Analysis Team
    **Data Source**: CERN Open Data Portal
    **Period**: 2010-2012 (Run 1), 2015-2018 (Run 2)
    **Computing**: Local analysis on CMS NanoAOD format



    ## How to Reproduce

    ### Requirements
    ```bash
    pip install uproot awkward numpy matplotlib
    ```

    ### Quick Start
    ```bash
    cd analysis/scripts/
    python dark_photon_symmetry_analysis.py
    python hidden_10_15_gev_analysis.py
    ```

    ## Significance Statement

    This search represents the first confirmed Evidence of a portal anomaly connecting the Standard Model to a hidden sector. The 11.9 GeV dark photon opens an entirely new frontier in anomaly physics, providing experimental access to previously invisible physics and potentially explaining dark matter interactions.

    ## Contact

    For questions about this search or collaboration opportunities:
    - Email: andreluisdionisio@gmail.com

    ---

    "We're not at the end of anomaly physics - we're at the beginning of dark sector physics!"

    3665778186 00382C40-4D7F-E211-AD6F-003048FFCBFC.root
    2581315530 0E5F189B-5D7F-E211-9423-002354EF3BE1.root
    2149825126 1AE176AC-5A7F-E211-8E63-00261894397D.root
    1792851725 2044D46B-DE7F-E211-9C82-003048FFD76E.root
    3186214416 4CAE8D51-4A7F-E211-9937-0025905964A2.root
    3220923349 72FDEF89-497F-E211-9CFA-002618943958.root
    2555255008 7A35A5A2-547F-E211-940B-003048678DA2.root
    3875410897 7E942EED-457F-E211-938E-002618FDA28E.root
    2409745919 8406DE2F-407F-E211-A6A5-00261894395F.root
    2421251748 8A61DAA8-3C7F-E211-94A6-002618943940.root
    2315643699 98909097-417F-E211-9009-002618943838.root
    2614932091 A0963AD9-567F-E211-A8AF-002618943901.root
    2438057881 ACE2DF9A-477F-E211-9C29-003048679266.root
    2206652387 B6AA897F-467F-E211-8381-002618943854.root
    2365666837 C09519C8-4B7F-E211-9BCE-003048678B34.root
    2477336101 C68AE3A5-447F-E211-928E-00261894388B.root
    2556444022 C6CEC369-437F-E211-81B0-0026189438BD.root
    3184171088 D60FF379-4E7F-E211-8BA4-002590593878.root
    2381001693

  7. f

    Data_Sheet_1_Clustering for Automated Exploratory Pattern Discovery in...

    • frontiersin.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Menaker; Joke Monteny; Lin Op de Beeck; Anna Zamansky (2023). Data_Sheet_1_Clustering for Automated Exploratory Pattern Discovery in Animal Behavioral Data.pdf [Dataset]. http://doi.org/10.3389/fvets.2022.884437.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Tom Menaker; Joke Monteny; Lin Op de Beeck; Anna Zamansky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traditional methods of data analysis in animal behavior research are usually based on measuring behavior by manually coding a set of chosen behavioral parameters, which is naturally prone to human bias and error, and is also a tedious labor-intensive task. Machine learning techniques are increasingly applied to support researchers in this field, mostly in a supervised manner: for tracking animals, detecting land marks or recognizing actions. Unsupervised methods are increasingly used, but are under-explored in the context of behavior studies and applied contexts such as behavioral testing of dogs. This study explores the potential of unsupervised approaches such as clustering for the automated discovery of patterns in data which have potential behavioral meaning. We aim to demonstrate that such patterns can be useful at exploratory stages of data analysis before forming specific hypotheses. To this end, we propose a concrete method for grouping video trials of behavioral testing of animal individuals into clusters using a set of potentially relevant features. Using an example of protocol for testing in a “Stranger Test”, we compare the discovered clusters against the C-BARQ owner-based questionnaire, which is commonly used for dog behavioral trait assessment, showing that our method separated well between dogs with higher C-BARQ scores for stranger fear, and those with lower scores. This demonstrates potential use of such clustering approach for exploration prior to hypothesis forming and testing in behavioral research.

  8. Data_Sheet_10_A mathematical and exploratory data analysis of malaria...

    • frontiersin.figshare.com
    pdf
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah (2023). Data_Sheet_10_A mathematical and exploratory data analysis of malaria disease transmission through blood transfusion.PDF [Dataset]. http://doi.org/10.3389/fams.2023.1105543.s002
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Michael O. Adeniyi; Oluwaseun R. Aderele; Olajumoke Y. Oludoun; Matthew I. Ekum; Maba B. Matadi; Segun I. Oke; Daniel Ntiamoah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Malaria is a mosquito-borne disease spread by an infected vector (infected female Anopheles mosquito) or through transfusion of plasmodium-infected blood to susceptible individuals. The disease burden has resulted in high global mortality, particularly among children under the age of five. Many intervention responses have been implemented to control malaria disease transmission, including blood screening, Long-Lasting Insecticide Bed Nets (LLIN), treatment with an anti-malaria drug, spraying chemicals/pesticides on mosquito breeding sites, and indoor residual spray, among others. As a result, the SIR (Susceptible—Infected—Recovered) model was developed to study the impact of various malaria control and mitigation strategies. The associated basic reproduction number and stability theory is used to investigate the stability analysis of the model equilibrium points. By constructing an appropriate Lyapunov function, the global stability of the malaria-free equilibrium is investigated. By determining the direction of bifurcation, the implicit function theorem is used to investigate the stability of the model endemic equilibrium. The model is fitted to malaria data from Benue State, Nigeria, using R and MATLAB. Estimates of parameters were made. Following that, an optimal control model is developed and analyzed using Pontryaging's Maximum Principle. The malaria-free equilibrium point is locally and globally stable if the basic reproduction number (R0) and the blood transfusion reproduction number (Rα) are both less or equal to unity. The study of the sensitive parameters of the model revealed that the transmission rate of malaria from mosquito-to-human (βmh), transmission rate from humans-to-mosquito (βhm), blood transfusion reproduction number (Rα) and recruitment rate of mosquitoes (bm) are all sensitive parameters capable of increasing the basic reproduction number (R0) thereby increasing the risk in spreading malaria disease. The result of the optimal control shows that five possible controls are effective in reducing the transmission of malaria. The study recommended the combination of five controls, followed by the combination of four and three controls is effective in mitigating malaria transmission. The result of the optimal simulation also revealed that for communities or areas where resources are scarce, the combination of Long Lasting Insecticide Treated Bednets (u2), Treatment (u3), and Indoor insecticide spray (u5) is recommended. Numerical simulations are performed to validate the model's analytical results.

  9. New 1000 Sales Records Data 2

    • kaggle.com
    zip
    Updated Jan 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Calvin Oko Mensah (2023). New 1000 Sales Records Data 2 [Dataset]. https://www.kaggle.com/datasets/calvinokomensah/new-1000-sales-records-data-2
    Explore at:
    zip(49305 bytes)Available download formats
    Dataset updated
    Jan 12, 2023
    Authors
    Calvin Oko Mensah
    Description

    This is a dataset downloaded off excelbianalytics.com created off of random VBA logic. I recently performed an extensive exploratory data analysis on it and I included new columns to it, namely: Unit margin, Order year, Order month, Order weekday and Order_Ship_Days which I think can help with analysis on the data. I shared it because I thought it was a great dataset to practice analytical processes on for newbies like myself.

  10. Bitcoin Dataset for analysis

    • kaggle.com
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chrysanthi Kosyfaki (2022). Bitcoin Dataset for analysis [Dataset]. https://www.kaggle.com/datasets/chrysanthikosyfaki/bitcoin-dataset-for-analysis
    Explore at:
    zip(451108569 bytes)Available download formats
    Dataset updated
    Mar 29, 2022
    Authors
    Chrysanthi Kosyfaki
    Description

    This dataset includes 45588785 transactions between 12094228 bitcoin addresses, in the bitcoin network up to 2013.12.28. We preprocessed the original data, by adding synthetic identifiers to addresses and merging addresses which seemed to belong to the same user. Each interaction records the sender address, the destination address, a timestamp, and the transferred quantity in BTCs.

    If you're going to use this dataset, please cite our paper: Chrysanthi Kosyfaki, Nikos Mamoulis, "Provenance in Temporal Interaction Networks", 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, May 2022. https://www.cs.uoi.gr/~nikos/icde22.pdf

  11. d

    glue-ing together the Milky Way

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Goodman, Alyssa (2024). glue-ing together the Milky Way [Dataset]. http://doi.org/10.7910/DVN/7B28MS
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Goodman, Alyssa
    Description

    Presentation Date: Wednesday, June 28, 2023 Location: Center for Astrophysics, Cambridge, MA Abstract: A demonstration at the 2023 New England Star and Planet Formation Workshop of how the glue exploratory data analysis software has helped with recent discoveries about the structure of the local Milky Way. Files included are Keynote slides (in .key and .pdf formats)

  12. w

    Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in...

    • data.wu.ac.at
    zip
    Updated Mar 6, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HarvestMaster (2018). Appalachian Basin Play Fairway Analysis: Thermal Quality Analysis in Low-Temperature Geothermal Play Fairway Analysis (GPFA-AB) ThermalQualityAnalysisThermalResourceInterpolationResultsArcGISToolbox.zip [Dataset]. https://data.wu.ac.at/schema/geothermaldata_org/ODcxNmYzNDgtMTM2Zi00MGMxLWJiOTUtMzJhY2U1MTkzMDMz
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 6, 2018
    Dataset provided by
    HarvestMaster
    Area covered
    f6cdecf8c561388b831e8b71e301afe86ed90f0d
    Description

    This collection of files are part of a larger dataset uploaded in support of Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin (GPFA-AB, DOE Project DE-EE0006726). Phase 1 of the GPFA-AB project identified potential Geothermal Play Fairways within the Appalachian basin of Pennsylvania, West Virginia and New York. This was accomplished through analysis of 4 key criteria: thermal quality, natural reservoir productivity, risk of seismicity, and heat utilization. Each of these analyses represent a distinct project task, with the fifth task encompassing combination of the 4 risks factors. Supporting data for all five tasks has been uploaded into the Geothermal Data Repository node of the National Geothermal Data System (NGDS).

    This submission comprises the data for Thermal Quality Analysis (project task 1) and includes all of the necessary shapefiles, rasters, datasets, code, and references to code repositories that were used to create the thermal resource and risk factor maps as part of the GPFA-AB project. The identified Geothermal Play Fairways are also provided with the larger dataset. Figures (.png) are provided as examples of the shapefiles and rasters. The regional standardized 1 square km grid used in the project is also provided as points (cell centers), polygons, and as a raster. Two ArcGIS toolboxes are available: 1) RegionalGridModels.tbx for creating resource and risk factor maps on the standardized grid, and 2) ThermalRiskFactorModels.tbx for use in making the thermal resource maps and cross sections. These toolboxes contain item description documentation for each model within the toolbox, and for the toolbox itself. This submission also contains three R scripts: 1) AddNewSeisFields.R to add seismic risk data to attribute tables of seismic risk, 2) StratifiedKrigingInterpolation.R for the interpolations used in the thermal resource analysis, and 3) LeaveOneOutCrossValidation.R for the cross validations used in the thermal interpolations.

    Some file descriptions make reference to various 'memos'. These are contained within the final report submitted October 16, 2015.

    Each zipped file in the submission contains an 'about' document describing the full Thermal Quality Analysis content available, along with key sources, authors, citation, use guidelines, and assumptions, with the specific file(s) contained within the .zip file highlighted.

    UPDATE: Newer version of the Thermal Quality Analysis has been added here: https://gdr.openei.org/submissions/879 (Also linked below) Newer version of the Combined Risk Factor Analysis has been added here: https://gdr.openei.org/submissions/880 (Also linked below) This is one of sixteen associated .zip files relating to thermal resource interpolation results within the Thermal Quality Analysis task of the Low Temperature Geothermal Play Fairway Analysis for the Appalachian Basin. This file contains an ArcGIS Toolbox with 6 ArcGIS Models: WellClipsToWormsSections, BufferedRasterToClippedRaster, ExtractThermalPropertiesToCrossSection, AddExtraInfoToCrossSection, and CrossSectionExtraction.

    The sixteen files contain the results of the thermal resource interpolation as binary grid (raster) files, images (.png) of the rasters, and toolbox of ArcGIS Models used. Note that raster files ending in “pred” are the predicted mean for that resource, and files ending in “err” are the standard error of the predicted mean for that resource. Leave one out cross validation results are provided for each thermal resource.

    Several models were built in order to process the well database with outliers removed. ArcGIS toolbox ThermalRiskFactorModels contains the ArcGIS processing tools used. First, the WellClipsToWormSections model was used to clip the wells to the worm sections (interpolation regions). Then, the 1 square km gridded regions (see series of 14 Worm Based Interpolation Boundaries .zip files) along with the wells in those regions were loaded into R using the rgdal package. Then, a stratified kriging algorithm implemented in the R gstat package was used to create rasters of the predicted mean and the standard error of the predicted mean. The code used to make these rasters is called StratifiedKrigingInterpolation.R Details about the interpolation, and exploratory data analysis on the well data is provided in 9_GPFA-AB_InterpolationThermalFieldEstimation.pdf (Smith, 2015), contained within the final report.

    The output rasters from R are brought into ArcGIS for further spatial processing. First, the BufferedRasterToClippedRaster tool is used to clip the interpolations back to the Worm Sections. Then, the Mosaic tool in ArcGIS is used to merge all predicted mean rasters into a single raster, and all error rasters into a single raster for each thermal resource.

    A leave one out cross validation was performed on each of the thermal resources. The code used to implement the cross validation is provided in the R script LeaveOneOutCrossValidation.R. The results of the cross validation are given for each thermal resource.

    Other tools provided in this toolbox are useful for creating cross sections of the thermal resource. ExtractThermalPropertiesToCrossSection model extracts the predicted mean and the standard error of predicted mean to the attribute table of a line of cross section. The AddExtraInfoToCrossSection model is then used to add any other desired information, such as state and county boundaries, to the cross section attribute table. These two functions can be combined as a single function, as provided by the CrossSectionExtraction model.

  13. t

    Code and training dataset for the publication entitled: "A combined...

    • researchdata.tuwien.at
    bin, zip
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh (2025). Code and training dataset for the publication entitled: "A combined microfluidic deep learning approach for lung cancer cell high throughput screening toward automatic cancer screening applications" [Dataset]. http://doi.org/10.48436/shgf6-h1h78
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset provided by
    TU Wien
    Authors
    Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh; Hadi Hashemzadeh; Seyedehsamaneh Shojaeilangari; Abdollah Allahverdi; Mario Rothbauer; Peter Ertl; Hossein Naderi-Manesh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Experiment Data & Analysis

    Overview

    This repository contains raw data, code and analysis scripts related to experiments performed in the ‘A combined microfluidic deep learning approach for lung cancer cell high throughput screening toward automatic cancer screening applications’. The data, code, and documentation provided here to facilitate reproducible research and enable further exploration and analysis of the experimental results.

    Repository Contents

    Analysis Code:

    Languages: MATLAB 2020a or later with Deep Learning Toolbox

    Description: This repository contains MATLAB scripts for data preprocessing, deep learning-based classification, and visualization of lung cancer cell images. The scripts train convolutional neural networks (CNNs) to classify six lung cell lines, including normal and five cancer subtypes.

    Documentation:

    File: LungCancer_CellLine_Code.zip

    Description: This file provides exemplary code and sample images used for the machine learning approach.

    File: Supplementary information and instructions.pdf

    Description: This file provides an instruction and a description of the individual steps from raw data to image analysis.

    File: Original Image data and Metadata Example - pc9.zip

    Description: This .zip container provides an example of raw data in a native .vsi file format with folders containing the .ets file, with metadata documentation of the imaging parameters for a microfluidic channel imaged with the IX83 microscope.

    File: Data augmentation documentation.docx (and Data augmentation documentation.pdf)

    Description: This document provides descriptions of how data augmentation was performed.

    File: Raw data.zip

    Description: This file contains image raw data.

    File: GrayCellData.rar

    Description: This file contains image data converted to grayscale images.

    File: CellData_Full.rar

    Description: This file contains RGB image data.

    Microfluidic cultivation protocol prior to imaging:

    Cell Lines: The lung normal cell and non-small lung cancer cells (PC-9, SK-LU-1, H-1975, A-427, and A-549)

    Plate Format: Plasma-bonded and coated microfluidics chip platform fabricated with silicon sheets and sterile object glass slides.

    Surface Coating

    Prior to cell seeding, the surface of the polydimethylsiloxane (PDMS) microfluidic chip was treated with collagen to enhance cell adhesion. A 0.1% (w/v) collagen solution was prepared using Type I collagen (derived from rat tail) dissolved in a 0.02 M acetic acid buffer. The PDMS surfaces were incubated with the collagen solution for 2 hours at room temperature to allow for proper coating. Following this, the chips were rinsed with phosphate-buffered saline (PBS) to remove any unbound collagen. Collagen, being a key extracellular matrix component, provides a conducive environment for cell attachment and proliferation. This surface modification was crucial for ensuring that the cells would adhere effectively to the microfluidic architecture, promoting optimal growth conditions. The collagen coating facilitated stronger cell-matrix interactions, thereby improving the overall experimental reliability and enabling accurate analysis of cell behavior in the microfluidic system.

    Seeding Density

    In this study, various cell types (lung normal cells and non-small cell lung cancer cells: PC-9, SK-LU-1, H-1975, A-427, and A-549) were cultured within a microfluidic chip designed with a total length of 75 mm and a width of 25 mm, featuring three separate chambers, each with a diameter of 900 μm. The seeding density was calculated to be approximately 5,000 cells/mL. Given the chamber dimensions, this density was optimized to ensure that the cells could achieve ~70% confluency within a reasonable timeframe while maintaining their viability and functionality. The initial seeding in a 25 cm² culture flask allowed for efficient expansion and preparation of the cells prior to their transfer to the microfluidic environment (the cell culture medium was DMEM or RPMI supplemented with 10% FBS and 1% PS).

    Cultivation Duration

    After trypsin treatment of cells cultured in a flask, the cells were allowed to adhere to the microfluidic chip for a duration of 48-72 hours post-injection. This incubation period was essential for the cells to establish stable adhesion to the collagen-coated surfaces, enabling them to regain their morphology and functionality. It ensured that the cellular environment within the microfluidic chambers mimicked in vivo conditions, allowing for proper cell spreading and growth.

    Medium Composition

    The medium utilized for cell cultivation consisted of DMEM (Dulbecco's Modified Eagle Medium) or RPMI-1640, supplemented with 10% fetal bovine serum (FBS) and 1% penicillin-streptomycin (PS), tailored to the specific cell types used. This composition was chosen to provide the necessary nutrients, growth factors, and antibiotics to support cell proliferation and prevent contamination. DMEM and RPMI are known to support a wide range of mammalian cell types, thereby enhancing the versatility of the experimental setup. The medium was pre-warmed to 37°C before use, and the cells were maintained in a humidified incubator at 37°C with 5% CO₂ during cultivation.

    Imaging Setup

    The imaging data was acquired using an automated IX83 microscope (Olympus, Japan), featuring a Merzhäuser motorized stage, a Hamamatsu ORCA-Flash4.0 camera, and a Lumencolor Spectra X fluorescent light source. This setup ensures high-resolution fluorescence imaging with precise stage control and sensitive image capture. Data was recorded automatically after adjustment of the z-axis using a multi-region area of interest on each microfluidic channel with the focus map function (medium density setting) with cellSens Dimension software (Version 2.1-2.3, Olympus). The DAPI staining of the blue fluorescence channel was used to facilitate large-area adjustment of the focus map prior to automated imaging. The green fluorescence channel representing the phalloidin staining of f-actin was used as a single channel exported images for the deep learning procedure outlined in the paper.

    Setup and Installation

    1. Extract the Raw Data:

    Unzip the Raw data.zip file into your working directory.

    2. Environment Setup:

    Read the documentation Supplementary information and instructions.pdf and the readme.txt in the code for more details on the setup.

    3. Running the Analysis:

    Open the file Supplementary information and instructions.pdf for a detailed description.

    Usage Instructions

    Data Exploration: The analysis scripts include functions for exploratory data analysis (EDA). You can modify these scripts to investigate specific experimental conditions.

    Reproducibility

    Follow the code comments and documentation to replicate the analyses. Ensure that the environment and dependencies are correctly configured as described in the setup section.

    Licensing

    This repository is licensed as follows: Code is accessible under BSD 2-Clause "Simplified" license and data under a Creative Commons Attribution 4.0 International license.

    Acknowledgement:

    This work was supported by the Iran National Science Foundation (INSF) Grant No. 96006759.

    Contact persons:

    For data acquisition:

    Abdullah Allahverdi, a-allahverdi@modares.ac.ir;

    Hadi Hashemzadeh, Hashemzadeh.hadi@gmail.com;

    Mario Rothbauer, mario.rothbauer@tuwien.ac.at

    For data processing and augmentation:

    Seyedehsamaneh Shojaei, s.shojaie@irost.ir, samane.shojaie@gmail.com

  14. India_Bihar_state_Agriculturedetails_dataset

    • kaggle.com
    zip
    Updated Sep 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    aman2000jaiswal (2020). India_Bihar_state_Agriculturedetails_dataset [Dataset]. https://www.kaggle.com/aman2000jaiswal/india-bihar-state-agriculturedetails-dataset
    Explore at:
    zip(1413741 bytes)Available download formats
    Dataset updated
    Sep 14, 2020
    Authors
    aman2000jaiswal
    Area covered
    Bihar, India
    Description

    Context

    This is Agriculture dataset of Bihar state of India.

    Content

    Dataset contains 6 csv files and 1 pdf and 1 image file

    Acknowledgements

    Inspiration

    column descriptors

    column descriptors

  15. CDC BRFSS Survey 2021

    • kaggle.com
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    d4riush (2023). CDC BRFSS Survey 2021 [Dataset]. https://www.kaggle.com/datasets/dariushbahrami/cdc-brfss-survey-2021/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Dataset provided by
    Kaggle
    Authors
    d4riush
    Description

    According to this BRFSS is:

    BRFSS is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

    To learn more about the data see the official page.

    Complete description about each column of the CSV file can be found in the codebook.

  16. Questionnaire data for assessing what determines urban residents' engagement...

    • zenodo.org
    • data.niaid.nih.gov
    bin, pdf
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Constantina-Alina Hossu; Constantina-Alina Hossu; Martina Artmann; Martina Artmann; Tomomi Saito; Martina van Lierop; Ioan-Cristian Ioja; Stephan Pauleit; Tomomi Saito; Martina van Lierop; Ioan-Cristian Ioja; Stephan Pauleit (2024). Questionnaire data for assessing what determines urban residents' engagement in activities for the protection of urban green spaces (UGS) [Dataset]. http://doi.org/10.5281/zenodo.8318495
    Explore at:
    pdf, binAvailable download formats
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Constantina-Alina Hossu; Constantina-Alina Hossu; Martina Artmann; Martina Artmann; Tomomi Saito; Martina van Lierop; Ioan-Cristian Ioja; Stephan Pauleit; Tomomi Saito; Martina van Lierop; Ioan-Cristian Ioja; Stephan Pauleit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the core data used to investigate what determines Munich’s residents’ engagement in activities for the protection of urban green spaces (UGS). We conducted an exploratory factor analysis and a structural equation modelling based on the data from an online and in-person questionnaire.

    List of data and content

    1. Questionnaire_English: The content of the questionnaire in English (*.pdf)
    2. Questionnaire_RawData_ Munchen_German: Core data from the questionnaire in German (*.xlsx)
    3. InputEFA1: Core data for the first exploratory factor analysis with the input for analysis in Mplus (*.xlsx)
    4. InputEFA2: Core data for the second exploratory factor analysis with the input for analysis in Mplus (*.xlsx)
    5. InputSEM: Core data for the structural equation modelling with the input for analysis in Mplus (*.xlsx)
    6. ResultsSEM: Results of the structural equation modelling (*.xlsx)
    7. InputCFA1: Core data for the first confirmatory factor analysis with the input for analysis in Mplus (*.xlsx)
    8. ResultsCFA1: Results of the confirmatory factor analysis (*.xlsx)
    9. InputCFA2: Core data for the second confirmatory factor analysis with the input for analysis in Mplus (*.xlsx)
    10. ResultsCFA2: Results of the confirmatory factor analysis (*.xlsx)
    11. Acronyms_Shortings: Acronyms and shortings used in the uploaded files (*.pdf)

    Data processing

    The software for performing the exploratory factor analysis and structural equation modelling was Mplus 8.8 (Muthén and Muthén, 1998). The details on the methodological steps are available in the published publication. To quickly understand the EFA, CFA and SEM model and settings the details are provided in the hyperlinks.

    Acknowledgments

    The authors thank to all participants for their participation in the online and in-person questionnaires. Data processing and analysis would not have been possible without the help of Tomomi Saito.

    This work was supported by a grant from the Alexander von Humboldt Foundation and by the Leibniz Best Minds Competition, Leibniz-Junior Research Group, Grant J76/2019.

  17. D

    Data from: Supplemental Material: "2D, 2.5D, or 3D? An Exploratory Study on...

    • darus.uni-stuttgart.de
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Paul Feyer; Bruno Pinaud; Stephen Kobourov; Nicolas Brich; Michael Krone; Andreas Kerren; Falk Schreiber; Karsten Klein (2023). Supplemental Material: "2D, 2.5D, or 3D? An Exploratory Study on Multilayer Network Visualizations in Virtual Reality" [Dataset]. http://doi.org/10.18419/DARUS-3387
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    DaRUS
    Authors
    Stefan Paul Feyer; Bruno Pinaud; Stephen Kobourov; Nicolas Brich; Michael Krone; Andreas Kerren; Falk Schreiber; Karsten Klein
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-3387https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-3387

    Dataset funded by
    DFG
    NSF
    ELLIIT
    Description

    Dataset containing supplemental material for the publication "2D, 2.5D, or 3D? An Exploratory Study on Multilayer Network Visualizations in Virtual Reality" This dataset contains: 1) archive containing all raw quantitative results, 2) archive containing all raw qualitative data, 3) archive containing the graphs used for the experiment (.graphml file format), 4) the code to generate the graph library (C++ files using OGDF), 5) a PDF document containing detailed results (with p-values and more charts), 6) a video showing the experimentation from a participant's point of view. 7) complete graph library generated by our graph generator for the experiment

  18. Steam Video Game and Bundle Data

    • kaggle.com
    zip
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Steam Video Game and Bundle Data [Dataset]. https://www.kaggle.com/datasets/pypiahmad/steam-video-game-and-bundle-data/data
    Explore at:
    zip(1464410453 bytes)Available download formats
    Dataset updated
    Oct 29, 2023
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset encompasses reviews from the Steam video game platform along with information about bundled games. It entails user reviews, purchases, plays, recommendations, product bundles, and pricing information.

    Basic Statistics: - Reviews: 7,793,069 - Users: 2,567,538 - Items: 15,474 - Bundles: 615

    Metadata: - Reviews - Purchases, Plays, Recommends ("likes") - Product Bundles - Pricing Information

    Example (Bundle): json { 'bundle_final_price': '$29.66', 'bundle_url': 'http://store.steampowered.com/bundle/1482/?utm_source=SteamDB...', 'bundle_price': '$32.96', 'bundle_name': 'Two Tribes Complete Pack!', 'bundle_id': '1482', 'items': [ { 'genre': 'Casual, Indie', 'item_id': '38700', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38700', 'item_name': 'Toki Tori' }, { 'genre': 'Adventure, Casual, Indie', 'item_id': '201420', 'discounted_price': '$14.99', 'item_url': 'http://store.steampowered.com/app/201420', 'item_name': 'Toki Tori 2+' }, { 'genre': 'Strategy, Indie, Casual', 'item_id': '38720', 'discounted_price': '$4.99', 'item_url': 'http://store.steampowered.com/app/38720', 'item_name': 'RUSH' }, { 'genre': 'Action, Indie', 'item_id': '38740', 'discounted_price': '$7.99', 'item_url': 'http://store.steampowered.com/app/38740', 'item_name': 'EDGE' } ], 'bundle_discount': '10%' }

    Citation: - Self-attentive sequential recommendation, Wang-Cheng Kang, Julian McAuley, ICDM, 2018 [pdf] - Item recommendation on monotonic behavior chains, Mengting Wan, Julian McAuley, RecSys, 2018 [pdf] - Generating and personalizing bundle recommendations on Steam, Apurva Pathak, Kshitiz Gupta, Julian McAuley, SIGIR, 2017 [pdf]

  19. Programming Assignment: Linear Regression

    • kaggle.com
    zip
    Updated Mar 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdelrahman Rezk (2021). Programming Assignment: Linear Regression [Dataset]. https://www.kaggle.com/abdelrahmanhrezk/programming-assignment-linear-regression
    Explore at:
    zip(2590 bytes)Available download formats
    Dataset updated
    Mar 12, 2021
    Authors
    Abdelrahman Rezk
    Description

    This Task is related to Coursera Machine Learning Course by Andrew NG, but implemnted in Python.

    Most text used in this notebook from ex1.pdf of Coursera

    Look at ex1.pdf to get more intuition about the task

    The task will be implemented in three ways and three notebooks and it all about linear regression

    • As manual code which pure code.
    • Using Sklearn library
    • Using Tensflow & Keras

    linear regression with one variable

    In this part of this exercise, we will implement linear regression with one variable to predict profits for a food truck.

    Most of code is written to be clean and enhancing with functions

  20. NCAA 100 Freestyle 2015-2024

    • kaggle.com
    zip
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Justin R111 (2025). NCAA 100 Freestyle 2015-2024 [Dataset]. https://www.kaggle.com/datasets/justinr111/ncaa-100-freestyle-2015-2024
    Explore at:
    zip(12017 bytes)Available download formats
    Dataset updated
    Feb 16, 2025
    Authors
    Justin R111
    Description

    This dataset contains race data from the past ten years of NCAA for the 100 freestyle (men) event. I collected this data using my own Python Script in which you follow along with a race by pressing the "Enter" button with each stroke. Upon the completion of the script, csv and pdf files are generated containing data from the race. I aggregated this data for the completion of my first project.

    In order to aggregate, organize, and visualize the data, I had to use a variety of software such as BigQuery (SQL), Python, Tableau, and Google Sheets. This project shows my ability to use a variety of different tools used for data analysis.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Organization logo

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Search
Clear search
Close search
Google apps
Main menu