12 datasets found
  1. Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A...

    • figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heather Cribbs; Gabriel Gardner (2023). Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A Multi-Campus Big Data Study" [Dataset]. http://doi.org/10.6084/m9.figshare.19071578.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Heather Cribbs; Gabriel Gardner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Five files, one of which is a ZIP archive, containing data that support the findings of this study. PDF file "IA screenshots CSU Libraries search config" contains screenshots captured from the Internet Archive's Wayback Machine for all 24 CalState libraries' homepages for years 2017 - 2019. Excel file "CCIHE2018-PublicDataFile" contains Carnegie Classifications data from the Indiana University Center for Postsecondary Research for all of the CalState campuses from 2018. CSV file "2017-2019_RAW" contains the raw data exported from Ex Libris Primo Analytics (OBIEE) for all 24 CalState libraries for calendar years 2017 - 2019. CSV file "clean_data" contains the cleaned data from Primo Analytics which was used for all subsequent analysis such as charting and import into SPSS for statistical testing. ZIP archive file "NonparametricStatisticalTestsFromSPSS" contains 23 SPSS files [.spv format] reporting the results of testing conducted in SPSS. This archive includes things such as normality check, descriptives, and Kruskal-Wallis H-test results.

  2. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  3. d

    Data from: A Worldwide Historical Dam Failure's Database

    • search.dataone.org
    • borealisdata.ca
    Updated Nov 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernard-Garcia, Mayari; Mahdi, Tew-Fik (2024). A Worldwide Historical Dam Failure's Database [Dataset]. http://doi.org/10.5683/SP2/E7Z09B
    Explore at:
    Dataset updated
    Nov 6, 2024
    Dataset provided by
    Borealis
    Authors
    Bernard-Garcia, Mayari; Mahdi, Tew-Fik
    Time period covered
    Jan 1, 575 - May 21, 2019
    Description

    Assembled from 196 references, this database records a total of 3,861 cases of historical dam failures around the world and represents the largest compilation of dam failures recorded to date (17-02-2020). Indeed, in this database is recorded historical dam failure regardless of the type of dams (e.g. man-made dam, tailing dam, temporary dam, natural dam, etc.), either the type of structure (e.g. concrete dam, embankment dam, etc.), the type of failure (e.g. pipping failure, overtopping failure, etc.) or the properties of the dams (e.g. dam height, reservoir capacity, etc.). Through this process, a total of 45 variables (i.e. which composed the “dataset”, obtained) have been used (when possible/available and relevant) to record various information about the failure (e.g. dam descriptions, dam properties, breach dimensions, etc.). Coupled with the Excel’s functionalities (e.g. adapted from Excel 2016; customizable screen visualization, individual search of specific cases, data filter, pivot table, etc.), the database file can easily be adapted to the needs of the user (i.e. research field, dam type, dam failure type, etc.) and is considered as a door opening in various fields of research (e.g. such as hydrology, hydraulics and dam safety). Also, notice that the dataset proposed allows any user to optimize the verification process, to identify duplicates and to put back in context the historical dam failures recorded. Overall, this investigation work has aimed to standardize data collection of historical dam failures and to facilitate the international collection by setting guidelines. Indeed, the sharing method (i.e. provided through this link) not only represents a considerable asset for a wide audience (e.g. researchers, dams’ owner, etc.) but, furthermore, allows paving the way for the field of dam safety in the actual era of "Big Data". Updated versions will be deposited (at this DOI) at undetermined frequencies in order to update the data recorded over the years. Cette base de données, compile un total de 3 861 cas de rupture de barrages à travers le monde, soit la plus large compilation de ruptures historiques de barrages actuellement disponible dans la littérature (17-02-2020), et a été obtenue suite à la revue de 196 références. Pour ce faire, les cas de ruptures de barrages historiques recensés ont été enregistrés dans le fichier XLSX fourni, et ce, indépendamment du domaine d’application (ex. barrage construit par l’Homme, barrage à rétention minier, barrage temporaire, barrage naturel, etc.), du type d’ouvrage (ex. barrage en béton, barrage en remblai, etc.), du mode de rupture (ex. rupture par effet de Renard, rupture par submersion, etc.) et des propriétés des ouvrages (ex. hauteur du barrage, capacité du réservoir, etc.). Au fil du processus de compilation, un jeu de 45 variables a été obtenu afin d’enregistrer les informations (lorsque possible/disponible et pertinente) décrivant les données recensées dans la littérature (ex. descriptions du barrage, propriétés du barrage, dimensions de la brèche de rupture, etc.). De ce fait, le travail d’investigation et de compilation, ayant permis d’uniformiser et de standardiser cette collecte de données de différents types de barrages, a ainsi permis de fournir des balises facilitant la collecte de données à l’échelle internationale. Soulignons qu’en couplant la base de données aux fonctionnalités d'Excel (ex. pour Excel 2016: visualisation d'écran personnalisable, recherche individuelle de cas spécifiques, filtre de données, tableau croisé dynamique, etc.), le fichier peut également aisément être adapter aux besoins de son utilisateur (ex. domaine d’étude, type de barrage, type de rupture de barrage, etc.), ouvrant ainsi la porte à de nouvelles études dans divers domaines de recherche (ex. domaine de l’hydrologie, l’hydraulique et de la sécurité des barrages), grâce aux données nouvellement compilées. De ce fait, cette méthode de partage, mise gratuitement à la disposition de la communauté internationale par l’entremise de cette page web, représente donc non seulement un atout considérable pour un large public (ex. chercheurs, propriétaires de barrages, etc.), mais permet au domaine de la sécurité des barrages d’entrer dans l'actuelle ère du « Big Data ». Des versions mises à jour seront par le fait même déposées (via ce DOI) à des fréquences indéterminées afin de mettre à jour les données enregistrées au fil des ans.

  4. High-Throughput Comp. Screening of MOFs

    • kaggle.com
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). High-Throughput Comp. Screening of MOFs [Dataset]. https://www.kaggle.com/datasets/thedevastator/high-throughput-comp-screening-of-mofs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    High-Throughput Comp. Screening of MOFs

    Open Metal Sites, Cavity Diameters and Free Paths

    By [source]

    About this dataset

    This dataset provides atomic coordinates for metal-organic frameworks (MOFs), enabling high-throughput computational screening of MOFs in a broad range of scenarios. The dataset is derived from the Cambridge Structural Database (CSD) and across the internet and offers an array of useful parameters, like accessible surface area (ASA), non-accessible surface area (NASA), largest cavity diameter (LCD), pore limiting diameter (PLD)and more. The results yielded by this dataset may prove to be very helpful in assessing the potential of MOFs as prospective materials for chemical separations, transformations and functional nanoporous materials. This can bring about improvements to many industries and help devise better products for consumers worldwide. If errors are found in this data, there is a feedback form available which can be used to report your findings. We appreciate your interest in our project and hope you will make good use out of this data!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide will introduce you to the CoRE MOF 2019 dataset and explain how to properly use it for high-throughput computational screenings. It will provide you with the necessary background information and knowledge for successful use of this dataset.

    The CoRE MOF 2019 Dataset contains atomic coordinates for metal-organic frameworks (MOFs) which can be used as inputs for simulation software packages, enabling high-throughput computational screening of these MOFs. This dataset is derived from both the Cambridge Structural Database (CSD) and World Wide Web sources, providing powerful data on which MOF systems are suitable for potential applications in chemical separations, transformations, and functional nanoporous materials.

    In order to make efficient use of this dataset, it is important that you familiarize yourself with all available columns. The columns contain information about a given MOF system such as LCD (largest cavity diameter), PLD (pore limiting diameter), LFPD (largest sphere along the free path), ASA (accessible surface area), NASA (non-accessible surface area), void fraction (AV_VF). Additionally there is also useful metadata such as public availability status, CSD overlap references in CoRE or CCDC databases, DOI details if available etc.. To get a full list of all these features please refer to the provided documentation or codebook on Kaggle website or your own research.

    Once you are familiar with column specifications it's time to move forward by downloading the actual database file from Kaggle servers. The downloaded file should be opened in MS Excel/CSV format where each row will represent a single distinct MOFS whereas each respective column represents its corresponding parameters value/range depending upon type(integer/float/boolean). Considering specific row from database shows us every information related to particular Molecular Framework System like AAC: Surface Area accessible by molecules outside pore (m^2). Using such info one can easily compare two different molecular framework systems directly without need for any pre processing algorithm or manual calculations typically required when comparing right values across different datasets holding same type of informations like respective project MCMC Algorithm running upon obtain structure hypothesis produces various mathematical linear variables whose direct comparison over simple values won't make much useful score out [until processed#naturally]. Thus after ensuring minimum data loss occurred during formatting one should seriously consider performing direct analysis involving entire set rather loopin[g #ASAP] into individual rows and perform direct comparisions though they might appear simpler at first instance

    Research Ideas

    • Create an open source library of automated SIM simulations for MOFs, which can be used to generate results quickly and accurately.
    • Update the existing Porous Materials Database (PMD) software with additional data fields that leverage insights from this dataset, allowing users to easily search and filter MOFs by specific structural characteristics.
    • Develop a web-based interface that allows researchers to visualize different MOF structures using realistic 3D images derived from the atomic data provided in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    ...

  5. e

    Input data for Artificial Intelligence research for the 2019 Measuring...

    • catalogue.eatlas.org.au
    • researchdata.edu.au
    Updated Mar 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Griffith Institute for Tourism, Griffith University (2021). Input data for Artificial Intelligence research for the 2019 Measuring aesthetics project (NESP TWQ 5.5, Griffith Institute for Tourism Research) [Dataset]. https://catalogue.eatlas.org.au/geonetwork/srv/api/records/34e73f87-f322-4477-ba00-5e2d9c17cde5
    Explore at:
    www:link-1.0-http--related, www:link-1.0-http--downloaddataAvailable download formats
    Dataset updated
    Mar 1, 2021
    Dataset provided by
    Griffith Institute for Tourism, Griffith University
    Time period covered
    Jan 1, 2019 - Sep 30, 2020
    Description

    The last stream within the NESP 5.5 project was related to the conduct of an online survey to get aesthetic ratings of additional 3500 images downloaded from Flickr to improve the Artificial Intelligence (AI)-based system recognising and assessing the beauty of natural scenes, which had been developed in the previous NESP 3.2.3 project. Despite some earlier investment into this research area, there is still a need to improve the tools we use to measure the aesthetic beauty of marine landscapes. This research drew on images publicly available on the Internet (in particular through the photo sharing site Flickr) to build a large dataset of GBR images for the assessment of aesthetic value. Building on earlier work in NESP TWQ Hub Project 3.2.3, we conducted a survey focused on collecting beauty scores of an additional large number of GBR images (n = 3500). This dataset consists of one dataset report, two word files and one excel file demonstrating the aesthetic ratings collected used to improve the accuracy of the aesthetic monitoring AI system.

    Methods: The third research stream was conducted on the basis of an online survey to collect aesthetic ratings of 1585 Australians to rate the aesthetic beauty of 3500 GBR underwater pictures downloaded and selected from Flickr. Flickr is an image hosting service and one of the main sources of images for our project. As per our requirement, we downloaded all images and their metadata (including coordinates where available) based on keyword filter such as “Great Barrier Reef”. The Flickr API is available for non-commercial (but commercial use is possible by prior arrangement) use by outside developers. To ensure a much larger and diverse supply of photographs, we have developed a python-based application using Flickr API that allowed us to download Flickr images by keyword (e.g. “Great Barrier Reef” available at https://www.flickr.com). The focus of this research was on under-water images, which had to be filtered from the downloaded Flickr photos. From the collected images we identified an additional number of 3020 relevant images with coral and fish contents out of a total of approximately 55,000 downloaded images. Matt Curnock, CSIRO expert, also provide 100 images from his private images taken at the GBR and consent to use these images for our research. In total, 3120 images were selected and renamed to be rated in a survey by Australian participants (see two file “Image modification” and “Matt image rename” in the AI folder for further details).

    The survey was created on Qualtrics website and launched in in April 2020 using Qualtrics survey service. After giving the consent to participating in the online survey, each respondent was randomly exposed to 50 images of the GBR and rate the aesthetic of the GBR scenery on a 10 point scale (1-Very ugly/unpleasant – Very beautiful/pleasant). In total, 1585 complete and valid questionnaires were recorded. Aesthetic rating results was exported to an Excel file and used for improving the accuracy of the computer algorithm recognising and assessing the beauty of natural scenes which had been developed in the previous NESP 3.2.3 project.

    Further information can be found here: Stantic, B. and Mandal, R. (2020) Aesthetic Assessment of the Great Barrier Reef using Deep Learning. Report to the National Environmental Science Program. Reef and Rainforest Research Centre Limited, Cairns (30pp.). Available at https://nesptropical.edu.au/wp-content/uploads/2020/11/NESP-TWQ-Project-5.5-Technical-Report-3.pdf

    Format: The AI DATASET has one dataset report, one excel file showing aesthetic ratings of all images and two Word files showing how images downloaded from Flickr website and provided by Matt Curnock (CSIRO) were renamed and used for aesthetic ratings and AI development. The aesthetic rating results were later used to improve the accuracy of the AI aesthetic monitoring system for the GBR.

    Further information can be found here: Stantic, B. and Mandal, R. (2020) Aesthetic Assessment of the Great Barrier Reef using Deep Learning. Report to the National Environmental Science Program. Reef and Rainforest Research Centre Limited, Cairns (30pp.). Available at https://nesptropical.edu.au/wp-content/uploads/2020/11/NESP-TWQ-Project-5.5-Technical-Report-3.pdf

    References: Murray, N., Marchesotti, M. & Perronnin, F (2012). AVA: A Large-Scale Database for Aesthetic Visual Analysis. Available (09/10/17) http://refbase.cvc.uab.es/files/MMP2012a.pdf

    Data Location: This dataset is filed in the eAtlas enduring data repository at: data\custodian\2019-2022-NESP-TWQ-5\5.5_Measuring-aesthetics

  6. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  7. SYNTHESEAS VIEW: SYstem for iNTegrating Human dimensions, Ecosystem Services...

    • data.csiro.au
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Petina Pert; Lauren Stevens; Akshat Sehgal; Anthea Coggan; Jeremy De Valck; Diane Jarvis; Victoria Graham (2024). SYNTHESEAS VIEW: SYstem for iNTegrating Human dimensions, Ecosystem Services and Economic Assessment for Sustainability [Dataset]. https://data.csiro.au/collection/csiro:62964
    Explore at:
    Dataset updated
    Aug 6, 2024
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Petina Pert; Lauren Stevens; Akshat Sehgal; Anthea Coggan; Jeremy De Valck; Diane Jarvis; Victoria Graham
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2024 - Jul 31, 2024
    Area covered
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    SYstem for iNTegrating Human dimensions, Ecosystem Services and Economic Assessment for Sustainability. CSIRO has developed this Shiny application to store metadata information relevant to datasets found in the Great Barrier Reef region which are primarily ecosystem service focused. The Shiny app allows users to filter the records describing the catalog of datasets by ES category (Provisioning, Regulating & Cultural) and Ecosystem service. It also allows users to filter by GBR User type (First Nations, Government, Household and industry). Users can also filter by various components in the Ecosystem Service Value Chain (ESVC) (eg. Use, measures, and derived values) to enable users to understand the various datasets which have been used to construct an ESVC.
    Lineage: The SEABORNE (Sustainable UsE And Benefits fOR mariNE) project has consolidatied and synthesised existing information about who is using the Reef, how it is being used and what the benefits are from this use. CSIRO's research on the Great Barrier Reef (GBR) has been identified as a Category 4 Mission for the organisation, with well-established investors and collaborators, an internal coordination architecture, and delivering impact from a large portfolio of research across several Business Units. This project is one of several key strategic outcomes of the Great Barrier Reef Platform. SEABORNE began in November 2021, with the project team initially developing and sourcing a list of potential datasets relevant to the research question. An Excel spreadsheet was trialled to make it more of a data entry form for users, however we encountered problems with dropdown fields not allowing multiple selections of values, due to the version of Excel and VBA programming. As part of the SCCPs (ERRFP-1322) we developed a Shiny (R) dashboard that allowed the database to be filtered and searched in a user-friendly manner. We exported the data from the MS ACCESS database as a CSV and used this in the Shiny app. This app also allows the spatial extent data from CSIRO DAP and the GBRMPA catalogue (online json file) to be read and displayed from the relevant website on a Leaflet map. Each record in the metadata database (has a UniqueID) and pertains to a dataset which has been used or considered in the SEABORNE project. This tool allows researchers a summary of what’s available, particularly in the GBR in relation to ecosystem services, where to get the data, what the data is about, the quality of the data etc, and who to contact to acquire the data. It is NOT a data warehouse, nor is it a data portal to download data from.

  8. Religious Populations Worldwide

    • kaggle.com
    zip
    Updated Dec 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Religious Populations Worldwide [Dataset]. https://www.kaggle.com/datasets/thedevastator/religious-populations-worldwide
    Explore at:
    zip(481071 bytes)Available download formats
    Dataset updated
    Dec 8, 2023
    Authors
    The Devastator
    Description

    Religious Populations Worldwide

    Religious Populations Worldwide by Year and Category

    By Throwback Thursday [source]

    About this dataset

    The dataset includes data on Christianity, Islam, Judaism, Buddhism, Hinduism, Sikhism, Shintoism, Baha'i Faith, Taoism, Confucianism, Jainism and various other syncretic and animist religions. For each religion or denomination category, it provides both the total population count and the percentage representation in relation to the overall population.

    Additionally, - Columns labeled with Population provide numeric values representing the total number of individuals belonging to a particular religion or denomination. - Columns labeled with Percent represent numerical values indicating the percentage of individuals belonging to a specific religion or denomination within a given population. - Columns that begin with ** indicate primary categories (e.g., Christianity), while columns that do not have this prefix refer to subcategories (e.g., Christianity - Roman Catholics).

    In addition to providing precise data about specific religions or denominations globally throughout multiple years,this dataset also records information about geographical locations by including state or country names under StateNme.

    This comprehensive dataset is valuable for researchers seeking information on global religious trends and can be used for analysis in fields such as sociology, anthropology studies cultural studies among others

    How to use the dataset

    Introduction:

    • Understanding the Columns:

    • Year: Represents the year in which the data was recorded.

    • StateNme: Represents the name of the state or country for which data is recorded.

    • Population: Represents the total population of individuals.

    • Total Religious: Represents the total percentage and population of individuals who identify as religious, regardless of specific religion.

    • Non Religious: Represents the percentage and population of individuals who identify as non-religious or atheists.

    • Identifying Specific Religions: The dataset includes columns for different religions such as Christianity, Judaism, Islam, Buddhism, Hinduism, etc. Each religion is further categorized into specific denominations or types within that religion (e.g., Roman Catholics within Christianity). You can find relevant information about these religions by focusing on specific columns related to each one.

    • Analyzing Percentages vs. Population: Some columns provide percentages while others provide actual population numbers for each category. Depending on your analysis requirement, you can choose either column type for your calculations and comparisons.

    • Accessing Historical Data: The dataset includes records from multiple years allowing you to analyze trends in religious populations over time. You can filter data based on specific years using Excel filters or programming languages like Python.

    • Filtering Data by State/Country: If you are interested in understanding religious populations in a particular state or country, use filters to focus on that region's data only.

    Example - Extracting Information:

    Let's say you want to analyze Hinduism's growth globally from 2000 onwards:

    • Identify Relevant Columns:
    • Year: to filter data from 2000 onwards.
    • Hindu - Total (Percent): to analyze the percentage of individuals identifying as Hindus globally.

    • Filter Data:

    • Set a filter on the Year column and select values greater than or equal to 2000.

    • Look for rows where Hindu - Total (Percent) has values.

    • Analyze Results: You can now visualize and calculate the growth of Hinduism worldwide after filtering out irrelevant data. Use statistical methods or graphical representations like line charts to understand trends over time.

    Conclusion: This guide has provided you with an overview of how to use the Rel

    Research Ideas

    • Comparing religious populations across different countries: With data available for different states and countries, this dataset allows for comparisons of religious populations across regions. Researchers can analyze how different religions are distributed geographically and compare their percentages or total populations across various locations.
    • Studying the impact of historical events on religious demographics: Since the dataset includes records categorized by year, it can be used to study how historical events such as wars, migration, or political changes have influenced religious demographics over time. By comparing population numbers before and after specific events, resea...
  9. 350,000+ Jeopardy Questions

    • kaggle.com
    zip
    Updated Feb 4, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul (2020). 350,000+ Jeopardy Questions [Dataset]. https://www.kaggle.com/prondeau/350000-jeopardy-questions
    Explore at:
    zip(19977471 bytes)Available download formats
    Dataset updated
    Feb 4, 2020
    Authors
    Paul
    Description

    Context

    This dataset was obtained from Reddit user u/jwolle1 on https://www.reddit.com/r/datasets/comments/cj3ipd/jeopardy_dataset_with_349000_clues/

    Content

    Notes: - 349,641 clues in TSV format. Source: They prefer not to be named. DM for info. - I made one large complete dataset and also individual datasets for each season. The season files are small enough to open with Excel. - I tried to clean up all the formatting and encoding issues so there is minimal , \u201c, etc. - I tried to filter out all the impossible audio and video clues. - I included Alex's comments when he reads the categories at the beginning of each round. - I included a column that specifies whether a clue was a Daily Double or not (yes or no). - I made a note when clues come from special episodes (Teen Tournament, Celebrity Jeopardy, etc.). I was on the fence about including this but I decided it was the best way to find relatively easy or difficult clues. - I organized the data into chronological order from 1984 to present (July 2019, end of Season 35). And each category is grouped together so you can read it from top to bottom.

  10. Covid-19 Food Insecurity Data

    • kaggle.com
    zip
    Updated Sep 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Ogozaly (2021). Covid-19 Food Insecurity Data [Dataset]. https://www.kaggle.com/datasets/jackogozaly/pulse-survey-food-insecurity-data
    Explore at:
    zip(6230854 bytes)Available download formats
    Dataset updated
    Sep 13, 2021
    Authors
    Jack Ogozaly
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    What's in the Data?

    This dataset tracks food insecurity across different demographics starting 4/23/2020 to 8/23/2021. It contains fields such as Race, Education, Sex, State, Income, etc. If you're looking for a dataset to examine Covid-19's impact on food insecurity for different demographics, then here you are!

    Data Source

    This data is from the United States Census Bureau's Pulse Survey. The Pulse Survey is a frequently updating survey designed to collect data on how people's lives have been impacted by the coronavirus. Specifically, this dataset is a cleaned up version of the ' Food Sufficiency for Households, in the Last 7 Days, by Select Characteristics" tables.

    The original form of this data can be found at: https://www.census.gov/programs-surveys/household-pulse-survey/data.html

    What was done to this data?

    The original form of this data was split into 36 excel files containing ~67 sheets each. The data was in a non-tidy format, and questions were also not entirely standard. This dataset is my attempt to combine all these different files, tidy the data up, and combine slightly different questions together.

    Why are there so many NA's?

    The large amount of NA's are a consequence of how awful the data was originally/ forcing the data into a tidy format. Just filter the NA's out for the question you want to analyze and you'll be fine.

  11. Customer Shopping Trends Dataset

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
    Explore at:
    zip(149846 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Sourav Banerjee
    Description

    Context

    The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

    Content

    This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

    Dataset Glossary (Column-wise)

    • Customer ID - Unique identifier for each customer
    • Age - Age of the customer
    • Gender - Gender of the customer (Male/Female)
    • Item Purchased - The item purchased by the customer
    • Category - Category of the item purchased
    • Purchase Amount (USD) - The amount of the purchase in USD
    • Location - Location where the purchase was made
    • Size - Size of the purchased item
    • Color - Color of the purchased item
    • Season - Season during which the purchase was made
    • Review Rating - Rating given by the customer for the purchased item
    • Subscription Status - Indicates if the customer has a subscription (Yes/No)
    • Shipping Type - Type of shipping chosen by the customer
    • Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)
    • Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)
    • Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction
    • Payment Method - Customer's most preferred payment method
    • Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

    Structure of the Dataset

    https://i.imgur.com/6UEqejq.png" alt="">

    Acknowledgement

    This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

    Cover Photo by: Freepik

    Thumbnail by: Clothing icons created by Flat Icons - Flaticon

  12. Data from: Car sales

    • kaggle.com
    zip
    Updated Oct 26, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GaganBhatia (2017). Car sales [Dataset]. https://www.kaggle.com/datasets/gagandeep16/car-sales
    Explore at:
    zip(6987 bytes)Available download formats
    Dataset updated
    Oct 26, 2017
    Authors
    GaganBhatia
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the Car sales data set which include information about different cars . This data set is being taken from the Analytixlabs for the purpose of prediction In this we have to see two things

    First we have see which feature has more impact on car sales and carry out result of this

    Secondly we have to train the classifier and to predict car sales and check the accuracy of the prediction.

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Heather Cribbs; Gabriel Gardner (2023). Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A Multi-Campus Big Data Study" [Dataset]. http://doi.org/10.6084/m9.figshare.19071578.v1
Organization logo

Data for "To Pre-Filter, or Not to Pre-Filter, That Is the Query: A Multi-Campus Big Data Study"

Explore at:
pdfAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Heather Cribbs; Gabriel Gardner
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Five files, one of which is a ZIP archive, containing data that support the findings of this study. PDF file "IA screenshots CSU Libraries search config" contains screenshots captured from the Internet Archive's Wayback Machine for all 24 CalState libraries' homepages for years 2017 - 2019. Excel file "CCIHE2018-PublicDataFile" contains Carnegie Classifications data from the Indiana University Center for Postsecondary Research for all of the CalState campuses from 2018. CSV file "2017-2019_RAW" contains the raw data exported from Ex Libris Primo Analytics (OBIEE) for all 24 CalState libraries for calendar years 2017 - 2019. CSV file "clean_data" contains the cleaned data from Primo Analytics which was used for all subsequent analysis such as charting and import into SPSS for statistical testing. ZIP archive file "NonparametricStatisticalTestsFromSPSS" contains 23 SPSS files [.spv format] reporting the results of testing conducted in SPSS. This archive includes things such as normality check, descriptives, and Kruskal-Wallis H-test results.

Search
Clear search
Close search
Google apps
Main menu