100+ datasets found
  1. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  2. Exploratory data analysis of a clinical study group: Development of a...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański (2023). Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data [Dataset]. http://doi.org/10.1371/journal.pone.0201950
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Łukasz Łaczmański
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

  3. Exploratory Data Analysis on Automobile Dataset

    • kaggle.com
    zip
    Updated Sep 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monis Ahmad (2022). Exploratory Data Analysis on Automobile Dataset [Dataset]. https://www.kaggle.com/datasets/monisahmad/automobile
    Explore at:
    zip(4915 bytes)Available download formats
    Dataset updated
    Sep 12, 2022
    Authors
    Monis Ahmad
    Description

    Dataset

    This dataset was created by Monis Ahmad

    Contents

  4. B

    Exploratory Data Analysis of Airbnb Data

    • borealisdata.ca
    • dataone.org
    Updated Dec 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Imad Ahmad; Ibtassam Rasheed; Yip Chi Man (2022). Exploratory Data Analysis of Airbnb Data [Dataset]. http://doi.org/10.5683/SP3/F2OCZF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2022
    Dataset provided by
    Borealis
    Authors
    Imad Ahmad; Ibtassam Rasheed; Yip Chi Man
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals. The purpose of this study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities. We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!

  5. f

    Data from: FactExplorer: Fact Embedding-Based Exploratory Data Analysis for...

    • tandf.figshare.com
    pdf
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang (2025). FactExplorer: Fact Embedding-Based Exploratory Data Analysis for Tabular Data [Dataset]. http://doi.org/10.6084/m9.figshare.28399639.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite exploratory data analysis (EDA) is a powerful approach for uncovering insights from unfamiliar datasets, existing EDA tools face challenges in assisting users to assess the progress of exploration and synthesize coherent insights from isolated findings. To address these challenges, we present FactExplorer, a novel fact-based EDA system that shifts the analysis focus from raw data to data facts. FactExplorer employs a hybrid logical-visual representation, providing users with a comprehensive overview of all potential facts at the outset of their exploration. Moreover, FactExplorer introduces fact-mining techniques, including topic-based drill-down and transition path search capabilities. These features facilitate in-depth analysis of facts and enhance the understanding of interconnections between specific facts. Finally, we present a usage scenario and conduct a user study to assess the effectiveness of FactExplorer. The results indicate that FactExplorer facilitates the understanding of isolated findings and enables users to steer a thorough and effective EDA.

  6. EDA perform on Titanic Dataset

    • kaggle.com
    zip
    Updated Mar 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Osama (2025). EDA perform on Titanic Dataset [Dataset]. https://www.kaggle.com/datasets/osamii840/eda-perform-on-titanic-dataset/versions/1
    Explore at:
    zip(1751653 bytes)Available download formats
    Dataset updated
    Mar 9, 2025
    Authors
    Mohammad Osama
    Description

    Dataset

    This dataset was created by Mohammad Osama

    Contents

  7. h

    diabetes_eda_analysis

    • huggingface.co
    Updated Nov 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GUY SHILO (2025). diabetes_eda_analysis [Dataset]. https://huggingface.co/datasets/guyshilo12/diabetes_eda_analysis
    Explore at:
    Dataset updated
    Nov 19, 2025
    Authors
    GUY SHILO
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Diabetes Dataset — Exploratory Data Analysis (EDA)

    This repository contains a diabetes-related tabular dataset and a complete Exploratory Data Analysis (EDA).The main objective of this project was to learn how to conduct a structured EDA, apply best practices, and extract meaningful insights from real-world health data.
    The analysis includes correlations, distributions, group comparisons, class balance exploration, and statistical interpretations that illustrate how different… See the full description on the dataset page: https://huggingface.co/datasets/guyshilo12/diabetes_eda_analysis.

  8. What to do in paris

    • kaggle.com
    zip
    Updated Dec 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miloud Belarebia (2020). What to do in paris [Dataset]. https://www.kaggle.com/milobele/what-to-do-in-paris
    Explore at:
    zip(1657347 bytes)Available download formats
    Dataset updated
    Dec 11, 2020
    Authors
    Miloud Belarebia
    Area covered
    Paris
    Description

    Context

    The What to do in Paris site is a participative agenda, Parisian places such as the City Libraries and Museums, parks and gardens, entertainment centers, swimming pools, theaters, major venues such as the Gaîté Lyrique, the CENTQUATRE, the Carreau du Temple, concert halls, associations and even Parisians are invited to insert their events in the site.

    The source of the data

    Content

    What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

    Acknowledgements

    We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

    Inspiration

    Your data will be in front of the world's largest data science community. What questions do you want to see answered?

  9. Exploratory Data Analysis of Bank Marketing Data

    • kaggle.com
    zip
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mithilesh Kale (2024). Exploratory Data Analysis of Bank Marketing Data [Dataset]. https://www.kaggle.com/datasets/mithilesh9/exploratory-data-analysis-of-bank-marketing-data
    Explore at:
    zip(11661136 bytes)Available download formats
    Dataset updated
    Jun 4, 2024
    Authors
    Mithilesh Kale
    Description

    I did exploratory data analysis using this data.

    Offers a window into the world of bank telemarketing, with the goal of understanding how customers respond to campaigns promoting term deposit subscriptions. It provides a rich collection of information, including:

    Customer Demographics: A snapshot of who your customers are (age, job, marital status, etc.). Campaign History: Insights into how customers have reacted to past campaigns (contact method, duration). Call Metrics: Data on call duration and conversion rates, both on an individual call level and overall. Originally sourced from a public repository, this dataset offers valuable potential for analysis. It's perfect for exploring: Customer Behavior: What are the characteristics of customers who do (and don't) sign up for term deposits? Campaign Effectiveness: Which types of campaigns or communication strategies are most successful?

    By conducting exploratory data analysis (including univariate, bivariate, and segmented approaches), you can uncover hidden patterns and optimize future marketing efforts. This data is your key to better understanding your customers and driving higher subscription rates.

  10. Marketing Analytics

    • kaggle.com
    zip
    Updated Mar 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Daoud (2022). Marketing Analytics [Dataset]. https://www.kaggle.com/datasets/jackdaoud/marketing-data/discussion
    Explore at:
    zip(658411 bytes)Available download formats
    Dataset updated
    Mar 6, 2022
    Authors
    Jack Daoud
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This data is publicly available on GitHub here. It can be utilized for EDA, Statistical Analysis, and Visualizations.

    Content

    The data set ifood_df.csv consists of 2206 customers of XYZ company with data on: - Customer profiles - Product preferences - Campaign successes/failures - Channel performance

    Acknowledgement

    I do not own this dataset. I am simply making it accessible on this platform via the public GitHub link.

  11. u

    SERL

    • datacatalogue.ukdataservice.ac.uk
    • beta.ukdataservice.ac.uk
    Updated Aug 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UK Data Service (2020). SERL [Dataset]. http://doi.org/10.5255/UKDA-SN-8643-2
    Explore at:
    Dataset updated
    Aug 3, 2020
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Time period covered
    Jan 1, 2019 - May 30, 2020
    Area covered
    United Kingdom
    Description

    The Smart Energy Research Lab Exploratory Data, 2019-2020 is an initial study part of the SERL project and is to be accessed by SERL researchers to conduct exploratory analysis ahead of provisioning SERL data to the wider academic research community.

    The goals of the SERL portal are to provide:

    • a consistent, trusted, and sustainable channel for researchers to access large-scale, high-resolution energy data, thereby providing a reliable empirical dataset for research;
    • an effective mechanism for collecting energy data alongside other variables from national surveys (e.g. English Housing Survey) or individual research projects
    • a confidential, ongoing repository of smart meter data enhanced with contextual dwelling, household and neighbourhood attributes for use in primary and secondary data analysis.

    Further information about SERL can be found on "https://serl.ac.uk/" target="_blank"> serl.ac.uk.

    Besides the Smart Energy Research Lab data (smart meter readings and contextual data), the study also contains the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 data for which users should note that neither the European Commission nor the European Centre for Medium-Range Weather Forecasts will be held responsible for any use that may be made of the Copernicus information or data it contains. The Energy Performance of Buildings Data is also included and users must read and abide by the Copyright Information Notice, provided by the Ministry of Housing, Communities and Local Government, that covers the use of Royal Mail information and non-address data provided under the Open Government Licence v3.0.

    For the second edition (August 2020), revised data and documentation have been deposited. The second edition extends the initial data and now includes smart meter data – from January 2019 to May 2020, contextual data - a short SERL survey completed by participant households, household information data, building characteristics, Energy Performance Certificate (EPC) data and weather data.

  12. Data from: Best Practices for Your Exploratory Factor Analysis: A Factor...

    • scielo.figshare.com
    tiff
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Rogers (2023). Best Practices for Your Exploratory Factor Analysis: A Factor Tutorial [Dataset]. http://doi.org/10.6084/m9.figshare.20337249.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Pablo Rogers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT Context: exploratory factor analysis (EFA) is one of the statistical methods most widely used in administration; however, its current practice coexists with rules of thumb and heuristics given half a century ago. Objective: the purpose of this article is to present the best practices and recent recommendations for a typical EFA in administration through a practical solution accessible to researchers. Methods: in this sense, in addition to discussing current practices versus recommended practices, a tutorial with real data on Factor is illustrated. The Factor software is still little known in the administration area, but is freeware, easy-to-use (point and click), and powerful. The step-by-step tutorial illustrated in the article, in addition to the discussions raised and an additional example, is also available in the format of tutorial videos. Conclusion: through the proposed didactic methodology (article-tutorial + video-tutorial), we encourage researchers/methodologists who have mastered a particular technique to do the same. Specifically about EFA, we hope that the presentation of the Factor software, as a first solution, can transcend the current outdated rules of thumb and heuristics, by making best practices accessible to administration researchers.

  13. Data from: Superheat: An R Package for Creating Beautiful and Extendable...

    • tandf.figshare.com
    bin
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rebecca L. Barter; Bin Yu (2024). Superheat: An R Package for Creating Beautiful and Extendable Heatmaps for Visualizing Complex Data [Dataset]. http://doi.org/10.6084/m9.figshare.6287693.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Rebecca L. Barter; Bin Yu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics, they remain a severely underutilized visualization tool in modern data analysis. This article introduces superheat, a new R package that provides an extremely flexible and customizable platform for visualizing complex datasets. Superheat produces attractive and extendable heatmaps to which the user can add a response variable as a scatterplot, model results as boxplots, correlation information as barplots, and more. The goal of this article is two-fold: (1) to demonstrate the potential of the heatmap as a core visualization method for a range of data types, and (2) to highlight the customizability and ease of implementation of the superheat R package for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three reproducible case studies, each based on publicly available data sources.

  14. Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum...

    • zenodo.org
    zip
    Updated Sep 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio (2025). Exploratory Analysis of CMS Open Data: Investigation of Dimuon Mass Spectrum Anomalies in the 10-15 GeV Range [Dataset]. http://doi.org/10.5281/zenodo.17220766
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 29, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andre Luis Tomaz Dionísio; Andre Luis Tomaz Dionísio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains the results of an exploratory analysis of CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), focusing on the dimuon invariant mass spectrum in the 10-15 GeV range. The analysis investigates potential anomalies at 11.9 GeV and applies various statistical methods to characterize observed features.

    Methodology:

    • Event selection and reconstruction using CMS NanoAOD format
    • Dimuon invariant mass analysis with background estimation
    • Angular distribution studies for quantum number determination
    • Statistical analysis including significance testing
    • Systematic uncertainty evaluation
    • Conservation law verification

    Key Analysis Components:

    • Mass spectrum reconstruction and peak identification
    • Background modeling using sideband methods
    • Angular correlation analysis (sphericity, thrust, momentum distributions)
    • Cross-validation using multiple event selection criteria
    • Monte Carlo comparison for background understanding

    Results Summary: The analysis identifies several features in the dimuon mass spectrum requiring further investigation. Preliminary observations suggest potential anomalies around 11.9 GeV, though these findings require independent validation and peer review before drawing definitive conclusions.

    Data Products:

    • Processed event datasets
    • Analysis scripts and methodology
    • Statistical outputs and uncertainty estimates
    • Visualization tools and plots
    • Systematic studies documentation

    Limitations: This work represents preliminary exploratory analysis. Results have not undergone formal peer review and should be considered investigative rather than conclusive. Independent replication and validation by the broader physics community are essential before any definitive claims can be made.

    Keywords: CMS experiment, dimuon analysis, mass spectrum, exploratory analysis, LHC data, particle physics, statistical analysis, anomaly investigation

    # Dark Photon Search for at 11.9 GeV

    ## Executive Summary

    **Historic Search for: First Evidence of a Massive Dark Photon**

    We report the Search for a new vector gauge boson at 11.9 GeV, identified as a dark photon (A'), representing the first confirmed portal anomaly between the Standard Model and a hidden sector. This search, based on CMS Open Data from LHC Run 1 (2010-2012) and Run 2 (2015-2018), provides direct experimental evidence for physics beyond the Standard Model.

    ## Search for Highlights

    ### Anomaly Properties
    - **Mass**: 11.9 ± 0.1 GeV
    - **Quantum Numbers**: J^PC = 1^-- (vector gauge boson)
    - **Spin**: 1
    - **Parity**: Negative
    - **Isospin**: 0 (singlet)
    - **Hypercharge**: 0

    ### Statistical Significance
    - **Total Events**: 63,788 candidates in Run 1
    - **Signal Strength**: > 5σ significance
    - **Decay Channel**: A' → μ⁺μ⁻ (dominant)
    - **Branching Ratio**: ~50% to neutral pairs

    ### Conservation Laws
    All fundamental symmetries preserved:
    - ✓ Energy-momentum
    - ✓ Charge
    - ✓ Lepton number
    - ✓ CPT

    ## Project Structure

    ```
    search/
    ├── README.md # This file
    ├── docs/
    │ ├── paper/ # Main search paper
    │ │ ├── manuscript.tex # LaTeX source
    │ │ ├── abstract.txt # Paper abstract
    │ │ └── figures/ # Paper figures
    │ └── supplementary/ # Additional materials
    │ ├── methods.pdf # Detailed methodology
    │ ├── systematics.pdf # Systematic uncertainties
    │ └── theory.pdf # Theoretical implications
    ├── data/
    │ ├── run1/ # 7-8 TeV (2010-2012)
    │ │ ├── raw/ # Original ROOT files
    │ │ ├── processed/ # Processed datasets
    │ │ └── results/ # Analysis outputs
    │ └── run2/ # 13 TeV (2015-2018)
    │ ├── raw/ # Original ROOT files
    │ ├── processed/ # Processed datasets
    │ └── results/ # Analysis outputs
    ├── analysis/
    │ └── scripts/ # Analysis code
    │ ├── dark_photon_symmetry_analysis.py
    │ ├── hidden_sector_10_150_search.py
    │ ├── hidden_10_15_gev_analysis.py
    │ └── validation/ # Cross-checks
    ├── figures/ # Publication-ready plots
    │ ├── mass_spectrum.png # Invariant mass distribution
    │ ├── angular_dist.png # Angular distributions
    │ ├── symmetry_plots.png # Symmetry analysis
    │ └── cascade_spectrum.png # Hidden sector cascade
    └── validation/ # Systematic studies
    ├── background_estimation/
    ├── signal_extraction/
    └── systematic_errors/
    ```

    ## Key Evidence

    ### 1. Quantum Number Determination
    - **Angular Distribution**: ⟨|P₁|⟩ = 0.805 (strong anisotropy)
    - **Quadrupole Moment**: ⟨P₂⟩ = 0.573 (non-zero)
    - **Anomaly Type Score**: Vector = 90/100 (Preliminary)

    ### 2. Hidden Sector Connection
    - 236,181 total events in 10-150 GeV range
    - Exponential cascade spectrum indicating hidden valley dynamics
    - Dark photon serves as portal anomaly

    ### 3. Decay Topology
    - **Sphericity**: 0.161 (jet-like)
    - **Thrust**: 0.686 (moderate collimation)
    - Consistent with two-body decay A' → μ⁺μ⁻

    ## Physical Interpretation

    The search anomaly represents:
    1. **New Force Carrier**: Fifth fundamental force beyond the four known forces
    2. **Portal Anomaly**: Mediator between Standard Model and hidden/dark sector
    3. **Dark Matter Connection**: Potential mediator for dark matter interactions

    ## Theoretical Framework

    ### Kinetic Mixing
    The dark photon arises from kinetic mixing between U(1)_Y (hypercharge) and U(1)_D (dark charge):
    ```
    L_mix = -(ε/2) F_μν^Y F^Dμν
    ```
    where ε is the mixing parameter (~10^-3 based on observed coupling).

    ### Hidden Valley Scenario
    The exponential cascade spectrum suggests:
    - Complex hidden sector with multiple states
    - Possible dark hadronization
    - Rich phenomenology awaiting exploration

    ## Collaborators and Credits

    **Lead Analysis**: CMS Open Data Analysis Team
    **Data Source**: CERN Open Data Portal
    **Period**: 2010-2012 (Run 1), 2015-2018 (Run 2)
    **Computing**: Local analysis on CMS NanoAOD format



    ## How to Reproduce

    ### Requirements
    ```bash
    pip install uproot awkward numpy matplotlib
    ```

    ### Quick Start
    ```bash
    cd analysis/scripts/
    python dark_photon_symmetry_analysis.py
    python hidden_10_15_gev_analysis.py
    ```

    ## Significance Statement

    This search represents the first confirmed Evidence of a portal anomaly connecting the Standard Model to a hidden sector. The 11.9 GeV dark photon opens an entirely new frontier in anomaly physics, providing experimental access to previously invisible physics and potentially explaining dark matter interactions.

    ## Contact

    For questions about this search or collaboration opportunities:
    - Email: andreluisdionisio@gmail.com

    ---

    "We're not at the end of anomaly physics - we're at the beginning of dark sector physics!"

    3665778186 00382C40-4D7F-E211-AD6F-003048FFCBFC.root
    2581315530 0E5F189B-5D7F-E211-9423-002354EF3BE1.root
    2149825126 1AE176AC-5A7F-E211-8E63-00261894397D.root
    1792851725 2044D46B-DE7F-E211-9C82-003048FFD76E.root
    3186214416 4CAE8D51-4A7F-E211-9937-0025905964A2.root
    3220923349 72FDEF89-497F-E211-9CFA-002618943958.root
    2555255008 7A35A5A2-547F-E211-940B-003048678DA2.root
    3875410897 7E942EED-457F-E211-938E-002618FDA28E.root
    2409745919 8406DE2F-407F-E211-A6A5-00261894395F.root
    2421251748 8A61DAA8-3C7F-E211-94A6-002618943940.root
    2315643699 98909097-417F-E211-9009-002618943838.root
    2614932091 A0963AD9-567F-E211-A8AF-002618943901.root
    2438057881 ACE2DF9A-477F-E211-9C29-003048679266.root
    2206652387 B6AA897F-467F-E211-8381-002618943854.root
    2365666837 C09519C8-4B7F-E211-9BCE-003048678B34.root
    2477336101 C68AE3A5-447F-E211-928E-00261894388B.root
    2556444022 C6CEC369-437F-E211-81B0-0026189438BD.root
    3184171088 D60FF379-4E7F-E211-8BA4-002590593878.root
    2381001693

  15. h

    EDA-US-Bankruptcy-Prediction

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    reef zehavi, EDA-US-Bankruptcy-Prediction [Dataset]. https://huggingface.co/datasets/reefzehavi/EDA-US-Bankruptcy-Prediction
    Explore at:
    Authors
    reef zehavi
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Area covered
    United States
    Description

    Assignment 1: EDA - US Company Bankruptcy Prediction

    Student Name: Reef Zehavi Date: November 10, 2025

      📹 Project Presentation Video
    

    [(https://www.loom.com/share/6920e493e8654ef3bb4f67a10eb9b03d)]

      1. Overview and Project Goal
    

    The goal of this project is to perform Exploratory Data Analysis (EDA) on a fundamental dataset of American companies. The analysis focuses on understanding the financial characteristics that differentiate between companies that survived… See the full description on the dataset page: https://huggingface.co/datasets/reefzehavi/EDA-US-Bankruptcy-Prediction.

  16. Customer Purchase Behavior Dataset

    • kaggle.com
    zip
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amisha Chaudhari (2025). Customer Purchase Behavior Dataset [Dataset]. https://www.kaggle.com/datasets/amishachaudhary/customer-purchase-behavior-dataset
    Explore at:
    zip(1043712 bytes)Available download formats
    Dataset updated
    Mar 12, 2025
    Authors
    Amisha Chaudhari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Overview: This dataset contains transactional data of customers from an e-commerce platform, to analyze and understand their purchasing behavior. The dataset includes customer ID, product purchased, purchase amount, purchase date, and product category.

    Purpose of the Dataset: The primary objective of this dataset is to provide an opportunity to perform data exploration and preprocessing, allowing users to practice and enhance their data cleaning and analysis skills. The dataset has been intentionally modified to simulate a "messy" scenario, where some values have been removed, and inconsistencies have been introduced, which provides a real-world challenge for users to handle during data preparation.

    Key Features: CustomerID: Unique identifier for each customer. ProductID: Unique identifier for each product purchased. PurchaseAmount: Amount spent by the customer on a particular transaction. PurchaseDate: Date when the transaction took place. ProductCategory: Category of the purchased product.

    Analysis Opportunities:

    Perform data cleaning and preprocessing to handle missing values, duplicates, and outliers. Conduct exploratory data analysis (EDA) to uncover trends and patterns in customer behavior. Apply machine learning models like clustering and association rule mining for segmenting customers and understanding purchasing patterns.

  17. Z

    Data from: Handling of Personal Data by Smart Home Equipment: an Exploratory...

    • data.niaid.nih.gov
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coleti, Thiago Adriano; Balancieri, Renato; Menolli, André; Coleti, Thiago Adriano; Balancieri, Renato; Menolli, André; Morandini, Marcelo; Mahmoud, Omar Ali; Sotti, Victor Hugo (2023). Handling of Personal Data by Smart Home Equipment: an Exploratory Analysis in the Context of LGPD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10372205
    Explore at:
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Universidade Estadual de Maringá
    Universidade de São Paulo
    Universidade Estadual do Paraná (UNESPAR)
    Universidade Estadual do Maringá
    Universidade Estadual do Norte do Paraná
    Authors
    Coleti, Thiago Adriano; Balancieri, Renato; Menolli, André; Coleti, Thiago Adriano; Balancieri, Renato; Menolli, André; Morandini, Marcelo; Mahmoud, Omar Ali; Sotti, Victor Hugo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides data about an exploratory research that analyzed the Privacy and Security Policies and the Instruction Manuals of 59 home automation equipment for Smart Home in order to verify which personal data was handled and how these documents were providing information about processes performed in personal data. The analysis was conducted with a quantitative approach followed by a qualitative analysis, using content analysis.

  18. Human Resources & Organizational Behaviour

    • kaggle.com
    zip
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Jha-Mishra (2024). Human Resources & Organizational Behaviour [Dataset]. https://www.kaggle.com/datasets/adityajhamishra/human-resources-and-organizational-behaviour/data
    Explore at:
    zip(18428 bytes)Available download formats
    Dataset updated
    Sep 23, 2024
    Authors
    Aditya Jha-Mishra
    Description

    This Dataset consists of two categories of Data, namely - Primary Data obtained through survey responses and Secondary Data obtained through Websites and APIs.

    This Dataset contains categories of evaluation, to name a few: 1) Employment Status 2) Job Satisfaction 3) Retention of Key Employees

  19. Animal Shelter Analytics

    • kaggle.com
    zip
    Updated Mar 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Daoud (2021). Animal Shelter Analytics [Dataset]. https://www.kaggle.com/jackdaoud/animal-shelter-analytics
    Explore at:
    zip(8043946 bytes)Available download formats
    Dataset updated
    Mar 4, 2021
    Authors
    Jack Daoud
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Description

    Context

    I was reading Every Nose Counts: Using Metrics in Animal Shelters when I got inspired to conduct an EDA on animal shelter data. I looked online for data and found this dataset which is curated by Austin Animal Center. The data can be found on https://data.austintexas.gov.

    This data can be utilized for EDA practice. So go ahead and help animal shelters with your EDA powers by completing this task!

    Content

    The data set contains three CSVs: 1. Austin_Animal_Center_Intakes.csv 2. Austin_Animal_Center_Outcomes.csv 3. Austin_Animal_Center_Stray_Map.csv

    More TBD!

    Acknowledgement

    Thank you Austin Animal Center for all the animal protection you provide to stray & owned animals. Also, thank you for making your data accessible to the public.

  20. DQLab Telco Final

    • kaggle.com
    zip
    Updated Mar 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel Robert Ardi Nugraha (2025). DQLab Telco Final [Dataset]. https://www.kaggle.com/samran98/customer-churn-telco-final
    Explore at:
    zip(113195 bytes)Available download formats
    Dataset updated
    Mar 9, 2025
    Authors
    Samuel Robert Ardi Nugraha
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PLEASE UPVOTE THIS DATASET IF THIS HELP YOU... GLAD TO ANY FORKS HERE

    BACKGROUND DQLab Telco is a telecommunications company with numerous locations all over the world. In order to ensure that customers are not left behind, DQLab Telco has consistently paid attention to the customer experience since its establishment in 2019.

    Even though DQLab Telco is only a little over a year old, many of its customers have already changed their subscriptions to rival companies. By using machine learning, management hopes to lower the number of customers who leave.

    After cleaning the data yesterday, it is now time for us to build the best model to forecast customer churn.

    TASKS & STEPS Yesterday, we completed "Cleansing Data" as part of project part 1. You are now expected to develop the appropriate model as a data scientist.

    You will perform "Machine Learning Modeling" in this assignment using data from the previous month, specifically June 2020.

    The actions that must be taken are, 1. Analyze exploratory data first. 2. Carry out pre-processing of the data. 3. Using modeling from machine learning. 4. Picking the Ideal Model.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586

Data Analysis for the Systematic Literature Review of DL4SE

Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

Search
Clear search
Close search
Google apps
Main menu