100+ datasets found
  1. d

    Exploratory Data Analysis of Airbnb Data

    • dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi (2023). Exploratory Data Analysis of Airbnb Data [Dataset]. http://doi.org/10.5683/SP3/F2OCZF
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi
    Description

    Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals. The purpose of this study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities. We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!

  2. f

    Data from: FactExplorer: Fact Embedding-Based Exploratory Data Analysis for...

    • tandf.figshare.com
    pdf
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang (2025). FactExplorer: Fact Embedding-Based Exploratory Data Analysis for Tabular Data [Dataset]. http://doi.org/10.6084/m9.figshare.28399639.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Qi Jiang; Guodao Sun; Yue Dong; Lvhan Pan; Baofeng Chang; Li Jiang; Haoran Liang; Ronghua Liang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Despite exploratory data analysis (EDA) is a powerful approach for uncovering insights from unfamiliar datasets, existing EDA tools face challenges in assisting users to assess the progress of exploration and synthesize coherent insights from isolated findings. To address these challenges, we present FactExplorer, a novel fact-based EDA system that shifts the analysis focus from raw data to data facts. FactExplorer employs a hybrid logical-visual representation, providing users with a comprehensive overview of all potential facts at the outset of their exploration. Moreover, FactExplorer introduces fact-mining techniques, including topic-based drill-down and transition path search capabilities. These features facilitate in-depth analysis of facts and enhance the understanding of interconnections between specific facts. Finally, we present a usage scenario and conduct a user study to assess the effectiveness of FactExplorer. The results indicate that FactExplorer facilitates the understanding of isolated findings and enables users to steer a thorough and effective EDA.

  3. E

    Exploratory Data Analysis (EDA) Tools Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Exploratory Data Analysis (EDA) Tools Report [Dataset]. https://www.marketreportanalytics.com/reports/exploratory-data-analysis-eda-tools-54369
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.

  4. o

    YouTube Trending Videos of the Day

    • opendatabay.com
    .undefined
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). YouTube Trending Videos of the Day [Dataset]. https://www.opendatabay.com/data/ai-ml/34cfa60b-afac-4753-9409-bc00f9e8fbec
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    YouTube, Data Science and Analytics
    Description

    The dataset includes YouTube trending videos statistics for Mediterranean countries on 2022-11-07. It contains 15 columns and it's related to 19 countries:

    IT - Italy ES - Spain GR - Greece HR - Croatia TR - Turkey AL - Albania DZ - Algeria EG - Egypt LY - Lybia TN - Tunisia MA - Morocco IL - Israel ME - Montenegro LB - Lebanon FR - France BA - Bosnia and Herzegovina MT - Malta SI - Slovenia CY - Cyprus

    SY - Syria

    The columns are, instead, the following:

    country: where is the country in which the video was published. video_id: video identification number. Each video has one. You can find it clicking on a video with the right button and selecting 'stats for nerds'. title: title of the video. publishedAt: publication date of the video. channelId: identification number of the channel who published the video. channelTitle: name of the channel who published the video. categoryId: identification number category of the video. Each number corresponds to a certain category. For example, 10 corresponds to 'music' category. Check here for the complete list. trending_date: trending date of the video. tags: tags present in the video. view_count: view count of the video. comment_count: number of comments in the video. thumbnail_link: the link of the image that appears before clicking the video. -comments_disabled: tells if the comments are disabled or not for a certain video. -ratings_disabled: tells if the rating is disabled or not for that video. -description: description below the video. Inspiration You can perform an exploratory data analysis of the dataset, working with Pandas or Numpy (if you use Python) or other data analysis libraries; and you can practice to run queries using SQL or the Pandas functions. Also, it's possible to analyze the titles, the tags and the description of the videos to search for relevant information. Remember to upvote if you found the dataset useful :).

    License

    CC0

    Original Data Source: YouTube Trending Videos of the Day

  5. Data and Code for Exploratory Factor Analysis in Sample 1

    • osf.io
    Updated Apr 6, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Nielsen (2020). Data and Code for Exploratory Factor Analysis in Sample 1 [Dataset]. https://osf.io/z2hr3
    Explore at:
    Dataset updated
    Apr 6, 2020
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Mathias Nielsen
    Description

    This component contains the data and syntax code used to conduct the Exploratory Factor Analysis and compute Velicer’s minimum average partial test in sample 1

  6. COVID 19 Dataset

    • kaggle.com
    zip
    Updated Aug 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Gupta (2020). COVID 19 Dataset [Dataset]. https://www.kaggle.com/rahulgupta21/datahub-covid19
    Explore at:
    zip(915971 bytes)Available download formats
    Dataset updated
    Aug 24, 2020
    Authors
    Rahul Gupta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Coronavirus disease 2019 (COVID-19) time series listing confirmed cases, reported deaths and reported recoveries. Data is disaggregated by country (and sometimes subregion). Coronavirus disease (COVID-19) is caused by the Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) and has had a worldwide effect. On March 11 2020, the World Health Organization (WHO) declared it a pandemic, pointing to the over 118,000 cases of the Coronavirus illness in over 110 countries and territories around the world at the time.

    This dataset includes time series data tracking the number of people affected by COVID-19 worldwide, including:

    confirmed tested cases of Coronavirus infection the number of people who have reportedly died while sick with Coronavirus the number of people who have reportedly recovered from it

    Content

    Data is in CSV format and updated daily. It is sourced from this upstream repository maintained by the amazing team at Johns Hopkins University Center for Systems Science and Engineering (CSSE) who have been doing a great public service from an early point by collating data from around the world.

    We have cleaned and normalized that data, for example tidying dates and consolidating several files into normalized time series. We have also added some metadata such as column descriptions and data packaged it.

  7. f

    Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Hui Zhou; Ehsan Saghapour (2023). Data_Sheet_1_ImputEHR: A Visualization Tool of Imputation for the Prediction of Biomedical Data.PDF [Dataset]. http://doi.org/10.3389/fgene.2021.691274.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Yi-Hui Zhou; Ehsan Saghapour
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electronic health records (EHRs) have been widely adopted in recent years, but often include a high proportion of missing data, which can create difficulties in implementing machine learning and other tools of personalized medicine. Completed datasets are preferred for a number of analysis methods, and successful imputation of missing EHR data can improve interpretation and increase our power to predict health outcomes. However, use of the most popular imputation methods mainly require scripting skills, and are implemented using various packages and syntax. Thus, the implementation of a full suite of methods is generally out of reach to all except experienced data scientists. Moreover, imputation is often considered as a separate exercise from exploratory data analysis, but should be considered as art of the data exploration process. We have created a new graphical tool, ImputEHR, that is based on a Python base and allows implementation of a range of simple and sophisticated (e.g., gradient-boosted tree-based and neural network) data imputation approaches. In addition to imputation, the tool enables data exploration for informed decision-making, as well as implementing machine learning prediction tools for response data selected by the user. Although the approach works for any missing data problem, the tool is primarily motivated by problems encountered for EHR and other biomedical data. We illustrate the tool using multiple real datasets, providing performance measures of imputation and downstream predictive analysis.

  8. Guns incident data

    • kaggle.com
    Updated Sep 7, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Miglani (2020). Guns incident data [Dataset]. https://www.kaggle.com/datasets/datatattle/guns-incident-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 7, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aman Miglani
    Description

    This data consists of the incidents involving guns. Perform EDA to find out the hidden patterns. Columns: 1) Race: Race of individual 2) Date: Date of incident 3) Education 4) Police involvment

    Please leave an upvote if you find this relevant. P.S. I am new and it will help immensely. :)

  9. o

    Apple IPhone Customer Reviews

    • opendatabay.com
    .undefined
    Updated Jun 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Apple IPhone Customer Reviews [Dataset]. https://www.opendatabay.com/data/consumer/42533232-0299-4752-8408-4579f2251a34
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 10, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Reviews & Ratings
    Description

    Based on the dataset of iPhone reviews from Amazon, here are some project areas we can do:

    -> Sentiment analysis: Determine overall sentiment and identify trends.

    -> Feature analysis: Analyze user satisfaction with specific features.

    -> Topic modeling: Discover underlying themes and discussion points.

    Original Data Source: Apple IPhone Customer Reviews

  10. o

    🇸🇬 Shopee App Reviews from Google Store

    • opendatabay.com
    .undefined
    Updated Jun 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). 🇸🇬 Shopee App Reviews from Google Store [Dataset]. https://www.opendatabay.com/data/consumer/d5fa3d0d-8802-40cd-9e29-d477075f54e2
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 14, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Reviews & Ratings
    Description

    Context From the Shopee Wikipedia page

    Shopee Pte. Ltd., under the trade name "Shopee," is a Singaporean multinational technology company specializing in e-commerce. It is a subsidiary company of Sea Limited. It was launched in 2015 in Singapore, before its global expansion. As of 2021, Shopee is considered the largest e-commerce platform in Southeast Asia with 343 million monthly visitors. It also serves consumers and sellers across countries in East Asia and Latin America who wish to purchase and sell their goods online.

    (Personally, I use Shopee regularly.)

    Usage This dataset should paint a good picture on what is the public's perception of the app over the years. Using this dataset, we can do the following

    Extract sentiments and trends Identify which version of the app had the most positive feedback, the worst. Use topic modelling to identify the pain points of the application. (AND MANY MORE!)

    Note Images generated using Bing Image Generator

    Original Data Source: 🇸🇬 Shopee App Reviews from Google Store

  11. Health Insurance Lead Prediction

    • kaggle.com
    zip
    Updated Mar 2, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sathishkumar (2021). Health Insurance Lead Prediction [Dataset]. https://www.kaggle.com/klmsathishkumar/health-insurance-lead-prediction
    Explore at:
    zip(1177806 bytes)Available download formats
    Dataset updated
    Mar 2, 2021
    Authors
    Sathishkumar
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context Your Client FinMan is a financial services company that provides various financial services like loan, investment funds, insurance etc. to its customers. FinMan wishes to cross-sell health insurance to the existing customers who may or may not hold insurance policies with the company. The company recommend health insurance to it's customers based on their profile once these customers land on the website. Customers might browse the recommended health insurance policy and consequently fill up a form to apply. When these customers fill-up the form, their Response towards the policy is considered positive and they are classified as a lead.

    Once these leads are acquired, the sales advisors approach them to convert and thus the company can sell proposed health insurance to these leads in a more efficient manner.

    Content Demographics (city, age, region etc.) Information regarding holding policies of the customer Recommended Policy Information

    Acknowledgements This is dataset is released as part of a hackathon conducted by Analytics Vidhya. Visit https://datahack.analyticsvidhya.com/contest/job-a-thon/#ProblemStatement for more information.

  12. Data from: An Exploratory Analysis of Barriers to Usage of the USDA Dietary...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: An Exploratory Analysis of Barriers to Usage of the USDA Dietary Guidelines for Americans [Dataset]. https://catalog.data.gov/dataset/data-from-an-exploratory-analysis-of-barriers-to-usage-of-the-usda-dietary-guidelines-for--bb6c7
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Servicehttps://www.ars.usda.gov/
    Description

    The average American’s diet does not align with the Dietary Guidelines for Americans (DGA) provided by the U.S. Department of Agriculture and the U.S. Department of Health and Human Services (2020). The present study aimed to compare fruit and vegetable consumption among those who had and had not heard of the DGA, identify characteristics of DGA users, and identify barriers to DGA use. A nationwide survey of 943 Americans revealed that those who had heard of the DGA ate more fruits and vegetables than those who had not. Men, African Americans, and those who have more education had greater odds of using the DGA as a guide when preparing meals relative to their respective counterparts. Disinterest, effort, and time were among the most cited reasons for not using the DGA. Future research should examine how to increase DGA adherence among those unaware of or who do not use the DGA. Comparative analyses of fruit and vegetable consumption among those who were aware/unaware and use/do not use the DGA were completed using independent samples t tests. Fruit and vegetable consumption variables were log-transformed for analysis. Binary logistic regression was used to examine whether demographic features (race, gender, and age) predict DGA awareness and usage. Data were analyzed using SPSS version 28.1 and SAS/STAT® version 9.4 TS1M7 (2023 SAS Institute Inc).

  13. Real Estate Sales 730 Days

    • kaggle.com
    Updated Dec 7, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Real Estate Sales 730 Days [Dataset]. https://www.kaggle.com/datasets/thedevastator/analyzing-hartford-real-estate-sales-over-730-da/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real Estate Sales 730 Days

    City of Hartford real estate sales for the past 2 years

    By [source]

    About this dataset

    This dataset contains data on City of Hartford real estate sales for the last two years, with comprehensive records including property ID, parcel ID, sale date, sale price and more. This dataset is continuously updated each night and sourced from an official reliable source. The columns in this dataset include LocationStartNumber, ApartmentUnitNumber, StreetNameAndWay, LandSF TotalFinishedArea, LivingUnits ,OwnerLastName OwnerFirstName ,PrimaryGrantor ,SaleDate SalePrice ,TotalAppraisedValue and LegalReference - all valuable information to anyone wishing to understand the recent market trends and developments in the City of Hartford real estate industry. With this data providing detailed insights into what properties are selling at what time frame and for how much money – let’s see what secrets we can learn from examining the City of Hartford real estate activity!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains helpful information about homes sold in the Hartford area over the past two years. This data can be used to analyze trends in real estate markets, as well as monitor sales activity for various areas.

    In order to use this dataset, you will need knowledge of EDA (Exploratory Data Analysis) such as data cleaning and data visualization techniques. You will also need a basic understanding of SQL queries and Python scripting language.

    The first step is to familiarize yourself with the columns and information contained within the dataset by analyzing descriptive statistics like mean, min, max etc. Next you can filter or “slice” the data based on certain criteria or variables that interest you - such as sale date range, location (by street name or zip code), sale price range, type of dwelling unit etc. After using various filters for analysis it is important to take an error-check step by looking for outliers or any discrepancies that may exist - this will ensure more accuracy in results when plotting graphs and visualizing trends via software tools like Tableau and Power BI etc.

    Next you can conduct exploratory analysis through plot visualizations of relationships between buyer characteristics (first & last name) vs prices over time; living units vs square footage stats; average price per bedroom/bathroom ratio comparisons etc – all while taking into account external factors such as seasonal changeovers that could affect pricing fluctuations during given intervals across multiple neighborhoods - use interactive maps if available ets. At this point it's easy to compile insightful reports containing commonalities amongst buyers and begin generalizing your findings with extrapolations which allow us gain a better understanding of current market conditions across different demographic spectrums being compared ie traditional Vs luxury properties – all made possible simply through dedicated research with datasets like these!

    Research Ideas

    • Analyzing market trends in the City of Hartford's real estate industry by tracking sale prices and appraised values over time to identify regions who are being under or over valued.
    • Conducting a predictive analysis project to predict future sales prices, annual appreciation rates, and key features associated with residential properties such as total finished area and living units for investment purposes.
    • Studying the impact of local zoning laws on property ownership and development by comparing sale dates, primary grantors, legal references, street names and ways in a given area over time

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: real-estate-sales-730-days-1.csv | Column name | Description | |:------------------------|:---------------------------------------------------------------| | LocationStartNumber | The starting number of the location of the property. (Integer) | | ApartmentUnitNumber | The apartment unit number of the property. (Integer) | | StreetNameAndWay | The st...

  14. Data from: Drastic changes before the 2011 Tohoku earthquake, revealed by...

    • figshare.com
    zip
    Updated Feb 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tomokazu Konishi (2023). Drastic changes before the 2011 Tohoku earthquake, revealed by exploratory data analysis [Dataset]. http://doi.org/10.6084/m9.figshare.22010279.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tomokazu Konishi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Tohoku Region
    Description

    Predicting earthquakes is of the utmost importance, especially to those countries of high risk, and although much effort has been made, it has yet to be realised. Nevertheless, there is a paucity of statistical approaches in seismic studies to the extent that an old theory is believed without verification. Seismic records of time and magnitude in Japan were analysed by exploratory data analysis (EDA). EDA is a parametric statistical approach based on the characteristics of data and is suitable for data-driven investigations. The distribution style of each dataset was determined, and the important parameters were found. This enabled us to identify and evaluate the anomalies in the data. Before the huge 2011 Tohoku earthquake, swarm earthquakes occurred before the main earthquake at improbable frequencies. The frequency and magnitude of all earthquakes increased. Both changes made larger earthquakes more likely to occur: even an M9 earthquake was expected every two years. From these simple measurements, the EDA succeeded in extracting useful information. Detecting and evaluating anomalies using this approach for every set of data would lead to a more accurate prediction of earthquakes.

  15. Steam Action Game's Dataset

    • kaggle.com
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wajih ul Hassan (2024). Steam Action Game's Dataset [Dataset]. https://www.kaggle.com/datasets/wajihulhassan369/steam-games-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wajih ul Hassan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a large amount of information on action games available on the Steam platform.

    It contains game titles, tags, release dates, prices, and more.

    The information is useful for studying game patterns, price modeling, and investigating the correlations between game tags and pricing methods. This dataset is useful for both gamers and Data Scientist who want to conduct exploratory data analysis, construct machine learning models, or investigate the gaming industry.

    • Name : Contains the name of games
    • Price : Price of Games in $
    • Release_date : When was the game released
    • Review_no : How many reviews were given to game
    • Review_type : How was the Reviews ('Very Positive', 'Mostly Positive', 'Mixed', 'Positive', 'Overwhelmingly Positive', 'Mostly Negative', 'Very Negative', 'Overwhelmingly Negative')
    • Tags : The different tags given to the game e.g., Adventure,Fantasy etc
    • Description : The description of Game
  16. s

    Exploratory Factor Analysis split - Insomnia negative affect and paranoia

    • orda.shef.ac.uk
    bin
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Scott (2023). Exploratory Factor Analysis split - Insomnia negative affect and paranoia [Dataset]. http://doi.org/10.15131/shef.data.5331739
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    The University of Sheffield
    Authors
    Alexander Scott
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our full dataset was randomly split in half, we conducted an exploratory factor analysis (EFA) on one half of the dataset and a confirmatory factor analysis (CFA) on the other.This dataset represents the exploratory factor analysis dataset and forms the basis of the EFA presented in the PLoS paper: Scott, A.J., Rowse, G. and Webb, T.L. (2017) A structural equation model of the relationship between insomnia, negative affect, and paranoid thinking. PLoS One, 12 (10). e0186233. DOI:10.1371/journal.pone.0186233

  17. m

    Data for "Best Practices for Your Exploratory Factor Analysis: a Factor...

    • data.mendeley.com
    Updated Aug 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Rogers (2021). Data for "Best Practices for Your Exploratory Factor Analysis: a Factor Tutorial" published by RAC-Revista de Administração Contemporânea [Dataset]. http://doi.org/10.17632/rdky78bk8r.2
    Explore at:
    Dataset updated
    Aug 17, 2021
    Authors
    Pablo Rogers
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains material related to the analysis performed in the article "Best Practices for Your Exploratory Factor Analysis: a Factor Tutorial". The material includes the data used in the analyses in .dat format, the labels (.txt) of the variables used in the Factor software, the outputs (.txt) evaluated in the article, and videos (.mp4 with English subtitles) recorded for the purpose of explaining the article. The videos can also be accessed in the following playlist: https://youtube.com/playlist?list=PLln41V0OsLHbSlYcDszn2PoTSiAwV5Oda. Below is a summary of the article:

    "Exploratory Factor Analysis (EFA) is one of the statistical methods most widely used in Administration, however, its current practice coexists with rules of thumb and heuristics given half a century ago. The purpose of this article is to present the best practices and recent recommendations for a typical EFA in Administration through a practical solution accessible to researchers. In this sense, in addition to discussing current practices versus recommended practices, a tutorial with real data on Factor is illustrated, a software that is still little known in the Administration area, but freeware, easy to use (point and click) and powerful. The step-by-step illustrated in the article, in addition to the discussions raised and an additional example, is also available in the format of tutorial videos. Through the proposed didactic methodology (article-tutorial + video-tutorial), we encourage researchers/methodologists who have mastered a particular technique to do the same. Specifically, about EFA, we hope that the presentation of the Factor software, as a first solution, can transcend the current outdated rules of thumb and heuristics, by making best practices accessible to Administration researchers".

  18. A

    ‘Hr Analytics Job Prediction’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Hr Analytics Job Prediction’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-hr-analytics-job-prediction-4c7a/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Hr Analytics Job Prediction’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/mfaisalqureshi/hr-analytics-and-job-prediction on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    Hr Data Analytics This dataset contains information about employees who worked in a company.

    Content

    This dataset contains columns: Satisfactory Level, Number of Project, Average Monthly Hours, Time Spend Company, Promotion Last 5
    Years, Department, Salary

    Acknowledgements

    You can download, copy and share this dataset for analysis and Predictions employees Behaviour.

    Inspiration

    Answer the following questions would be worthy 1- Do Exploratory Data analysis to figure out which variables have a direct and clear impact on employee retention (i.e. whether they leave the company or continue to work) 2- Plot bar charts showing the impact of employee salaries on retention 3- Plot bar charts showing a correlation between department and employee retention 4- Now build a logistic regression model using variables that were narrowed down in step 1 5- Measure the accuracy of the model

    --- Original source retains full ownership of the source dataset ---

  19. n

    Data from: Research and exploratory analysis driven - time-data...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko (2022). Research and exploratory analysis driven - time-data visualization (read-tv) software [Dataset]. http://doi.org/10.5061/dryad.d51c5b02g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    Medical University of South Carolina
    Authors
    John Del Gaizo; Kenneth Catchpole; Alexander Alekseyenko
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    read-tv

    The main paper is about, read-tv, open-source software for longitudinal data visualization. We uploaded sample use case surgical flow disruption data to highlight read-tv's capabilities. We scrubbed the data of protected health information, and uploaded it as a single CSV file. A description of the original data is described below.

    Data source

    Surgical workflow disruptions, defined as “deviations from the natural progression of an operation thereby potentially compromising the efficiency or safety of care”, provide a window on the systems of work through which it is possible to analyze mismatches between the work demands and the ability of the people to deliver the work. They have been shown to be sensitive to different intraoperative technologies, surgical errors, surgical experience, room layout, checklist implementation and the effectiveness of the supporting team. The significance of flow disruptions lies in their ability to provide a hitherto unavailable perspective on the quality and efficiency of the system. This allows for a systematic, quantitative and replicable assessment of risks in surgical systems, evaluation of interventions to address them, and assessment of the role that technology plays in exacerbation or mitigation.

    In 2014, Drs Catchpole and Anger were awarded NIBIB R03 EB017447 to investigate flow disruptions in Robotic Surgery which has resulted in the detailed, multi-level analysis of over 4,000 flow disruptions. Direct observation of 89 RAS (robitic assisted surgery) cases, found a mean of 9.62 flow disruptions per hour, which varies across different surgical phases, predominantly caused by coordination, communication, equipment, and training problems.

    Methods This section does not describe the methods of read-tv software development, which can be found in the associated manuscript from JAMIA Open (JAMIO-2020-0121.R1). This section describes the methods involved in the surgical work flow disruption data collection. A curated, PHI-free (protected health information) version of this dataset was used as a use case for this manuscript.

    Observer training

    Trained human factors researchers conducted each observation following the completion of observer training. The researchers were two full-time research assistants based in the department of surgery at site 3 who visited the other two sites to collect data. Human Factors experts guided and trained each observer in the identification and standardized collection of FDs. The observers were also trained in the basic components of robotic surgery in order to be able to tangibly isolate and describe such disruptive events.

    Comprehensive observer training was ensured with both classroom and floor training. Observers were required to review relevant literature, understand general practice guidelines for observing in the OR (e.g., where to stand, what to avoid, who to speak to), and conduct practice observations. The practice observations were broken down into three phases, all performed under the direct supervision of an experienced observer. During phase one, the trainees oriented themselves to the real-time events of both the OR and the general steps in RAS. The trainee was also introduced to the OR staff and any other involved key personnel. During phase two, the trainer and trainee observed three RAS procedures together to practice collecting FDs and become familiar with the data collection tool. Phase three was dedicated to determining inter-rater reliability by having the trainer and trainee simultaneously, yet independently, conduct observations for at least three full RAS procedures. Observers were considered fully trained if, after three full case observations, intra-class correlation coefficients (based on number of observed disruptions per phase) were greater than 0.80, indicating good reliability.

    Data collection

    Following the completion of training, observers individually conducted observations in the OR. All relevant RAS cases were pre-identified on a monthly basis by scanning the surgical schedule and recording a list of procedures. All procedures observed were conducted with the Da Vinci Xi surgical robot, with the exception of one procedure at Site 2, which was performed with the Si robot. Observers attended those cases that fit within their allotted work hours and schedule. Observers used Microsoft Surface Pro tablets configured with a customized data collection tool developed using Microsoft Excel to collect data. The data collection tool divided procedures into five phases, as opposed to the four phases previously used in similar research, to more clearly distinguish between task demands throughout the procedure. Phases consisted of phase 1 - patient in the room to insufflation, phase 2 -insufflation to surgeon on console (including docking), phase 3 - surgeon on console to surgeon off console, phase 4 - surgeon off console to patient closure, and phase 5 - patient closure to patient leaves the operating room. During each procedure, FDs were recorded into the appropriate phase, and a narrative, time-stamp, and classification (based off of a robot-specific FD taxonomy) were also recorded.

    Each FD was categorized into one of ten categories: communication, coordination, environment, equipment, external factors, other, patient factors, surgical task considerations, training, or unsure. The categorization system is modeled after previous studies, as well as the examples provided for each FD category.

    Once in the OR, observers remained as unobtrusive as possible. They stood at an appropriate vantage point in the room without getting in the way of team members. Once an appropriate time presented itself, observers introduced themselves to the circulating nurse and informed them of the reason for their presence. Observers did not directly engage in conversations with operating room staff, however, if a staff member approached them with any questions/comments they would respond.

    Data Reduction and PHI (Protected Health Information) Removal

    This dataset uses 41 of the aforementioned surgeries. All columns have been removed except disruption type, a numeric timestamp for number of minutes into the day, and surgical phase. In addition, each surgical case had it's initial disruption set to 12 noon, (720 minutes).

  20. o

    Fake-Real News

    • opendatabay.com
    .undefined
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Fake-Real News [Dataset]. https://www.opendatabay.com/data/ai-ml/3d64e244-a70c-4dec-9a82-b550be89e373
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Entertainment & Media Consumption
    Description

    Context As we all know, Fake-News has become the centre of attraction worldwide because of its hazardous impact on our society. One of the recent example is spread of Fake-news related to Covid-19 cure, precautions, and symptoms and you must be understood by now, how dangerous this bogus information could be. Distorted piece of information propagated at the times of election for achieving political agenda is not hidden from anyone.

    Fake news is quickly becoming an epidemic, and it alarms and angers me how often and how rapidly totally fabricated stories circulate. Why? In the first place, the deceptive effect: the fact that if a lie is repeated enough times, you’ll begin to believe it’s true.

    You understand by now that fake news and other types of false information can take on various appearances. They can likewise have significant effects, because information shapes our world view: we make important decisions based on information. We form an idea about people or a situation by obtaining information. So if the information we saw on the Web is invented, false, exaggerated or distorted, we won’t make good decisions.

    Hence, Its in dire need to do something about it and It's a Big Data problem, where data scientist can contribute from their end to fight against Fake-News.

    Content Although, fighting against fake-News is a big data problem but I have created this small dataset having approx. 10,000 piece of news article and meta-data scraped through approx. 600 web-pages of Politifact website to analyse it using data science skills and get some insights of how can we stop spread of misinformation at broader aspect and what approach will give us better accuracy to achieve the same.

    This dataset is having 6 attributes among which News_Headline is the most important to us in order to classify news as FALSE or TRUE. As you notice the Label attribute clearly, there are 6 classes specified in it. So, it's totally up-to you whether you want to use my dataset for multi-class classification or convert these class labels into FALSE or TRUE and then, perform binary classification. Although, for your convenience, I will write a notebook on how to convert this dataset from multi-class to binary-class. To deal with the text data, you need to have good hands on practice on NLP & Data-Mining concepts.

    News_Headline - contains piece of information that has to be analysed. Link_Of_News - contains url of News Headlines specified in very first column. Source - this column contains author names who has posted the information on facebook, instagram, twitter or any other social-media platform. Stated_On - This column contains date when the information is posted by the authors on different social-media platforms. Date - This column contains date when this piece of information is analysed by politifact team of fact-checkers in order to labelize as FAKE or REAL. Label - This column contains 5 class labels : True, Mostly-True, Half-True, Barely-True, False, Pants on Fire. So, you can either perform multi-class classification on it or convert Mostly-True, Half-True, Barely-True as True and drop Pants on Fire and perform Binary-class classification.

    Acknowledgements A very Big thanks to fact-checking team of Politifact.com website as they provide with correct labels by working hard manually. So that we data science people can take advantage to train our models on such labels and make better models. These are some research papers that will help you to get start with the project and clear your fundamentals.

    Big Data and quality data for fake news and misinformation detection by Fatemeh Torabi Asr, Maite Taboada

    Automatic deception detection: Methods for finding fake news by Nadia K. Conroy Victoria L. Rubin Yimin Chen

    Inspiration I want to see which approach can solve this problem of combating Fake-News with greater accuracy.

    License

    CC BY-SA

    Original Data Source: Fake-Real News

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi (2023). Exploratory Data Analysis of Airbnb Data [Dataset]. http://doi.org/10.5683/SP3/F2OCZF

Exploratory Data Analysis of Airbnb Data

Explore at:
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Ahmad, Imad; Rasheed, Ibtassam; Man, Yip Chi
Description

Airbnb® is an American company operating an online marketplace for lodging, primarily for vacation rentals. The purpose of this study is to perform an exploratory data analysis of the two datasets containing Airbnb® listings and across 10 major cities. We aim to use various data visualizations to gain valuable insight on the effects of pricing, covid, and more!

Search
Clear search
Close search
Google apps
Main menu