37 datasets found
  1. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  2. Is it time to stop sweeping data cleaning under the carpet? A novel...

    • plos.figshare.com
    docx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements (2023). Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data [Dataset]. http://doi.org/10.1371/journal.pone.0228154
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlotte S. C. Woolley; Ian G. Handel; B. Mark Bronsvoort; Jeffrey J. Schoenebeck; Dylan N. Clements
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.

  3. r

    Data from: Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

    • resodate.org
    Updated Aug 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Neutatz; Binger Chen; Yazan Alkhatib; Jingwen Ye; Ziawasch Abedjan (2022). Data Cleaning and AutoML: Would an Optimizer Choose to Clean? [Dataset]. http://doi.org/10.14279/depositonce-15981
    Explore at:
    Dataset updated
    Aug 5, 2022
    Dataset provided by
    DepositOnce
    Technische Universität Berlin
    Authors
    Felix Neutatz; Binger Chen; Yazan Alkhatib; Jingwen Ye; Ziawasch Abedjan
    Description

    Data cleaning is widely acknowledged as an important yet tedious task when dealing with large amounts of data. Thus, there is always a cost-benefit trade-off to consider. In particular, it is important to assess this trade-off when not every data point and data error is equally important for a task. This is often the case when statistical analysis or machine learning (ML) models derive knowledge about data. If we only care about maximizing the utility score of the applications, such as accuracy or F1 scores, many tasks can afford some degree of data quality problems. Recent studies analyzed the impact of various data error types on vanilla ML tasks, showing that missing values and outliers significantly impact the outcome of such models. In this paper, we expand the setting to one where data cleaning is not considered in isolation but as an equal parameter among many other hyper-parameters that influence feature selection, regularization, and model selection. In particular, we use state-of-the-art AutoML frameworks to automatically learn the parameters that benefit a particular ML binary classification task. In our study, we see that specific cleaning routines still play a significant role but can also be entirely avoided if the choice of a specific model or the filtering of specific features diminishes the overall impact.

  4. I

    Data for: The Potential Impact of a Clean Energy Society On Air Quality

    • databank.illinois.edu
    • aws-databank-alb.library.illinois.edu
    Updated Feb 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swarnali Sanyal (2021). Data for: The Potential Impact of a Clean Energy Society On Air Quality [Dataset]. http://doi.org/10.13012/B2IDB-0060601_V1
    Explore at:
    Dataset updated
    Feb 1, 2021
    Authors
    Swarnali Sanyal
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    These datasets provide the basis of our analysis in the paper - The Potential Impact of a Clean Energy Society On Air Quality. All datasets here are from the model output (CAM4-chem). All the simulations were run to steady-state and only the outputs used in the analysis are archived here.

  5. h

    Air Quality & Health data: Longitudinal impact of a clean air zone on asthma...

    • healthdatagateway.org
    unknown
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). Air Quality & Health data: Longitudinal impact of a clean air zone on asthma [Dataset]. https://healthdatagateway.org/en/dataset/184
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    This dataset, curated by PIONEER, encompasses a detailed collection of 181,207 asthma admissions from 1st June 2016 to 31st May 2022, offering a comprehensive analysis tool for researchers examining the effects of air quality on respiratory health. It includes extensive patient demographics, serial physiological measurements, assessments, diagnostic codes (ICD-10 and SNOMED-CT), initial presentations, symptoms, and outcomes. Additionally, it integrates DEFRA air pollution data, geographically linked t individual health data, allowing for a nuanced exploration of environmental impacts on asthma incidence and severity. The dataset includes 4 years of data prior to and currently 1 year post introduction of the clean air zone.

    The dataset invites longitudinal studies to evaluate the Clean Air Zones' effectiveness. Timelines post-introduction of the clean air zone can be expanded to include data up to 2024. Its granular detail provides invaluable insights into emergency medicine, public health policy, and environmental science, supporting targeted interventions and policy formulations aimed at reducing asthma exacerbations and improving air quality standards.

    Geography: The West Midlands (WM) has a population of 6 million & includes a diverse ethnic & socio-economic mix. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & > 120 ITU bed capacity. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

    Data set availability: Data access is available via the PIONEER Hub for projects which will benefit the public or patients. This can be by developing a new understanding of disease, by providing insights into how to improve care, or by developing new models, tools, treatments, or care processes. Data access can be provided to NHS, academic, commercial, policy and third sector organisations. Applications from SMEs are welcome. There is a single data access process, with public oversight provided by our public review committee, the Data Trust Committee. Contact pioneer@uhb.nhs.uk or visit www.pioneerdatahub.co.uk for more details.

    Available supplementary data: Matched controls; ambulance and community data. Unstructured data (images). We can provide the dataset in OMOP and other common data models and can build synthetic data to meet bespoke requirements.

    Available supplementary support: Analytics, model build, validation & refinement; A.I. support. Data partner support for ETL (extract, transform & load) processes. Bespoke and “off the shelf” Trusted Research Environment (TRE) build and run. Consultancy with clinical, patient & end-user and purchaser access/ support. Support for regulatory requirements. Cohort discovery. Data-driven trials and “fast screen” services to assess population size.

  6. Employment Of India CLeaned and Messy Data

    • kaggle.com
    zip
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MANSI SHINDE (2025). Employment Of India CLeaned and Messy Data [Dataset]. https://www.kaggle.com/datasets/soniaaaaaaaa/employment-of-india-cleaned-and-messy-data/code
    Explore at:
    zip(29791 bytes)Available download formats
    Dataset updated
    Apr 7, 2025
    Authors
    MANSI SHINDE
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    India
    Description

    This dataset presents a dual-version representation of employment-related data from India, crafted to highlight the importance of data cleaning and transformation in any real-world data science or analytics project.

    🔹 Dataset Composition:

    It includes two parallel datasets: 1. Messy Dataset (Raw) – Represents a typical unprocessed dataset often encountered in data collection from surveys, databases, or manual entries. 2. Cleaned Dataset – This version demonstrates how proper data preprocessing can significantly enhance the quality and usability of data for analytical and visualization purposes.

    Each record captures multiple attributes related to individuals in the Indian job market, including: - Age Group
    - Employment Status (Employed/Unemployed)
    - Monthly Salary (INR)
    - Education Level
    - Industry Sector
    - Years of Experience
    - Location
    - Perceived AI Risk
    - Date of Data Recording

    Transformations & Cleaning Applied:

    The raw dataset underwent comprehensive transformations to convert it into its clean, analysis-ready form: - Missing Values: Identified and handled using either row elimination (where critical data was missing) or imputation techniques. - Duplicate Records: Identified using row comparison and removed to prevent analytical skew. - Inconsistent Formatting: Unified inconsistent naming in columns (like 'monthly_salary_(inr)' → 'Monthly Salary (INR)'), capitalization, and string spacing. - Incorrect Data Types: Converted columns like salary from string/object to float for numerical analysis. - Outliers: Detected and handled based on domain logic and distribution analysis. - Categorization: Converted numeric ages into grouped age categories for comparative analysis. - Standardization: Uniform labels for employment status, industry names, education, and AI risk levels were applied for visualization clarity.

    Purpose & Utility:

    This dataset is ideal for learners and professionals who want to understand: - The impact of messy data on visualization and insights - How transformation steps can dramatically improve data interpretation - Practical examples of preprocessing techniques before feeding into ML models or BI tools

    It's also useful for: - Training ML models with clean inputs
    - Data storytelling with visual clarity
    - Demonstrating reproducibility in data cleaning pipelines

    By examining both the messy and clean datasets, users gain a deeper appreciation for why “garbage in, garbage out” rings true in the world of data science.

  7. Data from: Urbanev: An open benchmark dataset for urban electric vehicle...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan (2025). Urbanev: An open benchmark dataset for urban electric vehicle charging demand prediction [Dataset]. http://doi.org/10.5061/dryad.np5hqc04z
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset provided by
    Hong Kong Polytechnic University
    Institute of High Performance Computing
    Sun Yat-sen University
    Authors
    Han Li; Haohao Qu; Xiaojun Tan; Linlin You; Rui Zhu; Wenqi Fan
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    The recent surge in electric vehicles (EVs), driven by a collective push to enhance global environmental sustainability, has underscored the significance of exploring EV charging prediction. To catalyze further research in this domain, we introduce UrbanEV—an open dataset showcasing EV charging space availability and electricity consumption in a pioneering city for vehicle electrification, namely Shenzhen, China. UrbanEV offers a rich repository of charging data (i.e., charging occupancy, duration, volume, and price) captured at hourly intervals across an extensive six-month span for over 20,000 individual charging stations. Beyond these core attributes, the dataset also encompasses diverse influencing factors like weather conditions and spatial proximity. These factors are thoroughly analyzed qualitatively and quantitatively to reveal their correlations and causal impacts on charging behaviors. Furthermore, comprehensive experiments have been conducted to showcase the predictive capabilities of various models, including statistical, deep learning, and transformer-based approaches, using the UrbanEV dataset. This dataset is poised to propel advancements in EV charging prediction and management, positioning itself as a benchmark resource within this burgeoning field. Methods To build a comprehensive and reliable benchmark dataset, we conduct a series of rigorous processes from data collection to dataset evaluation. The overall workflow sequentially includes data acquisition, data processing, statistical analysis, and prediction assessment. As follows, please see detailed descriptions. Study area and data acquisition

    Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, EV charging data was automatically collected from a mobile platform used by EV drivers to locate public charging stations. Through this platform, users could access real-time information on each charging pile, including its availability (e.g., busy or idle), charging price, and geographic coordinates. Accordingly, we recorded the charging-related data at five-minute intervals from September 1, 2022, to February 28, 2023. This data collection process was fully digital and did not require manual readings. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city were acquired from two meteorological observatories situated in the airport and central regions, respectively. These meteorological data are publicly available on the Shenzhen Government Data Open Platform. Thirdly, point of interest (POI) data was extracted through the Application Programming Interface Platform of AMap.com, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions.

    Shenzhen, a pioneering city in global vehicle electrification, has been selected for this study with the objective of offering valuable insights into electric vehicle (EV) development that can serve as a reference for other urban centers. This study encompasses the entire expanse of Shenzhen, where data on public EV charging stations distributed around the city have been meticulously gathered. Specifically, a program was employed to extract the status (e.g., busy or idle, charging price, electricity volume, and coordinates) of each charging pile at five-minute intervals from 1 September 2022 to 28 February 2023. Furthermore, to delve into the correlation between EV charging patterns and environmental elements, weather data for Shenzhen city was acquired from two meteorological observatories situated in the airport and central regions, respectively. Thirdly, point of interest (POI) data was extracted, along with three primary types: food and beverage services, business and residential, and lifestyle services. Lastly, the spatial and static data were organized based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. The collected data contains detailed spatiotemporal information that can be analyzed to provide valuable insights about urban EV charging patterns and their correlations with meteorological conditions. Processing raw information into well-structured data To streamline the utilization of the UrbanEV dataset, we harmonize heterogeneous data from various sources into well-structured data with aligned temporal and spatial resolutions. This process can be segmented into two parts: the reorganization of EV charging data and the preparation of other influential factors. EV charging data The raw charging data, obtained from publicly available EV charging services, pertains to charging stations and predominantly comprises string-type records at a 5-minute interval. To transform this raw data into a structured time series tailored for prediction tasks, we implement the following three key measures:

    Initial Extraction. From the string-type records, we extract vital information for each charging pile, such as availability (designated as "busy" or "idle"), rated power, and the corresponding charging and service fees applicable during the observed time periods. First, a charging pile is categorized as "active charging" if its states at two consecutive timestamps are both "busy". Consequently, the occupancy within a charging station can be defined as the count of in-use charging piles, while the charging duration is calculated as the product of the count of in-use piles and the time between the two timestamps (in our case, 5 minutes). Moreover, the charging volume in a station can correspondingly be estimated by multiplying the duration by the piles' rated power. Finally, the average electricity price and service price are calculated for each station in alignment with the same temporal resolution as the three charging variables.

    Error Detection and Imputation. Ensuring data quality is paramount when utilizing charging data for decision-making, advanced analytics, and machine-learning applications. It is crucial to address concerns around data cleanliness, as the presence of inaccuracies and inconsistencies, often referred to as dirty data, can significantly compromise the reliability and validity of any subsequent analysis or modeling efforts. To improve data quality of our charging data, several errors are identified, particularly the negative values for charging fees and the inconsistencies between the counts of occupied, idle, and total charging piles. We remove the records containing these anomalies and treat them as missing data. Besides that, a two-step imputation process was implemented to address missing values. First, forward filling replaced missing values using data from preceding timestamps. Then, backward filling was applied to fill gaps at the start of each time series. Moreover, a certain number of outliers were identified in the dataset, which could significantly impact prediction performance. To address this, the interquartile range (IQR) method was used to detect outliers for metrics including charging volume (v), charging duration (d), and the rate of active charging piles at the charging station (o). To retain more original data and minimize the impact of outlier correction on the overall data distribution, we set the coefficient to 4 instead of the default 1.5. Finally, each outlier was replaced by the mean of its adjacent valid values. This preprocessing pipeline transformed the raw data into a structured and analyzable dataset.

    Aggregation and Filtration. Building upon the station-level charging data that has been extracted and cleansed, we further organize the data into a region-level dataset with an hourly interval providing a new perspective for EV charging behavior analysis. This is achieved by two major processes: aggregation and filtration. First, we aggregate all the charging data from both temporal and spatial views: a. Temporally, we standardize all time-series data to a common time resolution of one hour, as it serves as the least common denominator among the various resolutions. This aims to establish a unified temporal resolution for all time-series data, including pricing schemes, weather records, and charging data, thereby creating a well-structured dataset. Aggregation rules specify that the five-minute charging volume v and duration $(d)$ are summed within each interval (i.e., one hour), whereas the occupancy o, electricity price pe, and service price ps are assigned specific values at certain hours for each charging pile. This distinction arises from the inherent nature of these data types: volume v and duration d are cumulative, while o, pe, and ps are instantaneous variables. Compared to using the mean or median values within each interval, selecting the instantaneous values of o, pe, and ps as representatives preserves the original data patterns more effectively and minimizes the influence of human interpretation. b. Spatially, stations are aggregated based on the traffic zones delineated by the sixth Residential Travel Survey of Shenzhen. After aggregation, our aggregated dataset comprises 331 regions (also called traffic zones) with 4344 timestamps. Second, variance tests and zero-value filtering functions were employed to filter out traffic zones with zero or no change in charging data. Specifically, it means that

  8. D

    Data Clean Room For Financial Services Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Clean Room For Financial Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-clean-room-for-financial-services-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Clean Room for Financial Services Market Outlook



    According to our latest research, the Data Clean Room for Financial Services market size reached USD 1.42 billion globally in 2024, driven by increasing concerns over privacy, data security, and regulatory compliance in the financial sector. The market is expected to expand at a robust CAGR of 23.6% from 2025 to 2033, with a projected value of USD 11.31 billion by 2033. This growth is primarily fueled by the rising demand for secure data collaboration, stringent data privacy regulations, and the need for advanced analytics in financial institutions. As per our latest research, financial organizations are accelerating adoption of data clean room solutions to enable privacy-preserving data sharing and advanced analytics, while ensuring compliance with evolving global regulatory frameworks.




    The rapid digitalization of the financial services industry is a key growth driver for the Data Clean Room for Financial Services market. Financial institutions such as banks, insurance companies, and investment firms are increasingly leveraging big data and advanced analytics to drive business growth, improve customer experiences, and optimize risk management. However, the sensitive nature of financial data and the growing threat of cyberattacks have heightened the need for secure environments where multiple parties can collaborate and analyze data without exposing personally identifiable information (PII). Data clean rooms provide a solution by allowing privacy-compliant data collaboration, enabling financial organizations to unlock valuable insights while maintaining strict data governance and security standards. This necessity is further amplified by the increasing adoption of cloud-based technologies and the proliferation of data sources, which require robust solutions to ensure data integrity and privacy.




    Regulatory compliance is another major catalyst for the expansion of the Data Clean Room for Financial Services market. The introduction and enforcement of stringent data privacy laws such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA), and various other region-specific regulations have compelled financial institutions to rethink their data handling and processing strategies. Non-compliance with these regulations can result in significant financial penalties, reputational damage, and loss of customer trust. Data clean rooms offer a compliant framework for data analysis and sharing by enabling organizations to enforce granular access controls, audit trails, and data anonymization techniques. As regulatory scrutiny intensifies and cross-border data flows become more complex, the demand for advanced data clean room solutions is expected to surge, making compliance a central pillar of market growth.




    In addition to regulatory and security drivers, the growing need for collaborative data-driven innovation is propelling the adoption of data clean rooms in the financial services sector. Financial institutions are increasingly partnering with fintech companies, technology providers, and even competitors to co-develop new products, enhance fraud detection capabilities, and improve marketing effectiveness. However, traditional data sharing methods pose significant risks to data privacy and intellectual property. Data clean rooms enable secure, privacy-preserving collaboration by allowing parties to jointly analyze datasets without exposing raw data to each other. This fosters innovation while minimizing risk, enabling organizations to extract actionable insights from combined datasets and accelerate digital transformation initiatives. The rise of artificial intelligence (AI) and machine learning (ML) in financial services further underscores the need for secure data environments, as these technologies rely on large, diverse datasets to deliver accurate results.




    From a regional perspective, North America currently dominates the Data Clean Room for Financial Services market, accounting for the largest share in 2024. This leadership is attributed to the early adoption of advanced analytics, a highly regulated financial ecosystem, and the presence of leading technology vendors. Europe follows closely, driven by stringent data privacy regulations and a mature financial sector. The Asia Pacific region is poised for the fastest growth over the forecast period, fueled by rapid digitalization, increasing investments in fintech, and evolving regulatory landscapes. Latin America and the Middle East & Africa are also witne

  9. m

    DATASET Analytical Data for the Cost-Benefit Impact Assessment of Clean...

    • data.mendeley.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco Catalan Meyer (2025). DATASET Analytical Data for the Cost-Benefit Impact Assessment of Clean Production Agreements [Dataset]. http://doi.org/10.17632/8bbjkrtwh5.1
    Explore at:
    Dataset updated
    Jul 9, 2025
    Authors
    Francisco Catalan Meyer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was generated for the analysis presented in the article Application of distributional weights for the cost-benefit impact assessment of sustainable production programs. A case study of Clean Production Agreements. It contains the consolidated analytical data for the Clean Production Agreements (CPAs) certified in Chile between 2012 and 2024. The file merges data from multiple public and institutional sources and includes calculated variables used for the distributional weighting model proposed in the study.

    Funding Information: This research did not receive any specific external grant from funding agencies in the public, commercial, or not-for-profit sectors. The work was conducted as part of the author's professional responsibilities at the Agency for Sustainability and Climate Change (ASCC) and was supported by its institutional resources. The ASCC is a committee of the Corporación de Fomento de la Producción (CORFO).

    COLUMN DESCRIPTIONS 1. (Unnamed) Row index. 2. CODIGO_APL Unique identifier for the Clean Production Agreement (CPA). 3. BS Total Social Benefit, monetized in Chilean Unidad de Fomento (UF). 4. Año.Certificación Year of final certification of the CPA. 5. REGION_APL Primary region of implementation for the CPA. NACIONAL indicates a national scope. 6. GRANDE Number of participating facilities classified as 'Large Enterprise'. 7. MEDIANA Number of participating facilities classified as 'Medium Enterprise'. 8. MICRO Number of participating facilities classified as 'Micro Enterprise'. 9. NC: Number of participating facilities classified as 'Unclassified', primarily consisting of civil society organizations. 10. PEQUENA Number of participating facilities classified as 'Small Enterprise'. 11. SSPP: Number of participating facilities classified as 'Public Services' (e.g., government agencies, municipalities). 12. Total.general Total number of participating facilities in the CPA. 13. Tasa_MIPYME SME (Micro, Small, and Medium Enterprise) participation rate. 14. Gasto.21 Pro-rated share of the agency's personnel and human resource costs (Chilean public budget sub-item 21 Personnel Expenses), allocated to the CPA based on its activity duration, in UF. 15. Gasto.22 Pro-rated share of the agency's operational overhead costs (Chilean public budget sub-item 22 Goods and Services), allocated to the CPA based on its activity duration, in UF. 16. Gasto.24 Total direct funding allocated to the CPA under the Chilean public budget sub-item 24 (Current Transfers), in UF. 17. Gasto.Total Total public expenditure allocated to the CPA (sum of Gasto.21, Gasto.22, Gasto.24), in UF. 18. IDH_2022 Human Development Index (2022 estimate) for the corresponding region. 19. Ponderador_MIPYME The SME weighting component, calculated as (1 + Tasa_MIPYME). 20. Ponderador_IDH The territorial weighting component, calculated as (HDI_reference IDH_region).

  10. H

    Database of Major Clean Air Act Regulations, 1997-2019

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Jul 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph E. Aldy; Matthew Kotchen; Mary Evans; Meredith Fowlie; Arik Levinson; Karen Palmer (2020). Database of Major Clean Air Act Regulations, 1997-2019 [Dataset]. http://doi.org/10.7910/DVN/J2HWDA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2020
    Dataset provided by
    Harvard Dataverse
    Authors
    Joseph E. Aldy; Matthew Kotchen; Mary Evans; Meredith Fowlie; Arik Levinson; Karen Palmer
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This database includes data and information on the methods, benefits, and costs presented in the regulatory impact analyses of all major Clean Air Act regulations promulgated by the EPA over 1997-2019. The database includes pollution-specific benefits estimates by rule, and distinguishes between targeted and ancillary monetized benefits. The authors recommend downloading the original Excel file for accessing the database.

  11. D

    Health Data Clean Room Liability Insurance Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Health Data Clean Room Liability Insurance Market Research Report 2033 [Dataset]. https://dataintelo.com/report/health-data-clean-room-liability-insurance-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Health Data Clean Room Liability Insurance Market Outlook



    As per our latest research, the global Health Data Clean Room Liability Insurance market size reached USD 1.32 billion in 2024, reflecting a robust response to the increasing complexity of healthcare data privacy and compliance regulations. The market is projected to expand at a CAGR of 18.5% from 2025 to 2033, forecasting a value of approximately USD 6.23 billion by 2033. This remarkable growth is primarily driven by the rising adoption of data clean rooms in healthcare, growing cyber threats, and the intensifying need for specialized liability coverage to mitigate emerging risks associated with sensitive health data handling.




    The primary growth factor for the Health Data Clean Room Liability Insurance market is the accelerating digital transformation within the healthcare sector. As healthcare providers and related organizations increasingly rely on data clean rooms to facilitate secure and compliant data collaboration, the risks associated with privacy breaches and regulatory non-compliance have surged. The implementation of stringent data protection regulations such as HIPAA, GDPR, and other regional mandates has compelled organizations to seek robust liability insurance solutions tailored to the unique risks of health data clean rooms. This heightened regulatory scrutiny is pushing organizations to invest in insurance products that not only cover traditional liabilities but also address the complexities of data anonymization, third-party access, and cross-border data sharing.




    Another significant driver is the proliferation of cyber threats targeting healthcare organizations, which are among the most lucrative targets for cybercriminals due to the sensitive nature of the data they handle. The increasing frequency and sophistication of cyberattacks, including ransomware, phishing, and insider threats, have underscored the need for comprehensive liability insurance. Health data clean rooms, while designed to enhance privacy and security, present novel vulnerabilities that insurers are now addressing through specialized products. These products offer coverage for financial losses, legal expenses, and reputational damage arising from data breaches or misuse within clean room environments, further fueling market demand.




    Additionally, the growing collaboration between healthcare providers, insurers, pharmaceutical companies, and research organizations is amplifying the need for shared data environments that maintain privacy and compliance. Health data clean rooms enable these collaborations by providing secure spaces for data analysis without exposing raw data. However, this also introduces shared liability and complex risk profiles that traditional insurance policies may not adequately cover. As a result, there is a rising trend among organizations to seek bespoke liability insurance solutions that can be customized to their specific operational models, data governance frameworks, and regulatory obligations, thereby driving the market's expansion.




    From a regional perspective, North America continues to dominate the Health Data Clean Room Liability Insurance market, accounting for over 41% of the global market share in 2024. This leadership is attributed to the advanced healthcare infrastructure, early adoption of digital health technologies, and a highly regulated environment that prioritizes data privacy and security. Europe follows closely, supported by the comprehensive data protection framework under GDPR and increased investments in healthcare IT. Meanwhile, the Asia Pacific region is experiencing the fastest growth, driven by rapid digitalization, expanding healthcare sectors, and evolving regulatory landscapes, making it a key area of focus for market participants in the coming years.



    Coverage Type Analysis



    The Coverage Type segment in the Health Data Clean Room Liability Insurance market is categorized into General Liability, Professional Liability, Cyber Liability, and Others. Among these, Cyber Liability has emerged as the most dynamic sub-segment, primarily due to the escalating cyber risks associated with health data clean room environments. As healthcare organizations increasingly rely on interconnected digital platforms, the exposure to data breaches, unauthorized access, and cyberattacks has expanded significantly. Cyber Liability insurance products are specifically designed to address these r

  12. D

    EO Data Clean Room Collaboration Tools Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). EO Data Clean Room Collaboration Tools Market Research Report 2033 [Dataset]. https://dataintelo.com/report/eo-data-clean-room-collaboration-tools-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    EO Data Clean Room Collaboration Tools Market Outlook




    According to our latest research, the EO Data Clean Room Collaboration Tools market size reached USD 1.42 billion globally in 2024, and it is anticipated to grow at a robust CAGR of 17.8% during the forecast period. By 2033, the market is forecasted to attain a value of USD 6.18 billion. The primary growth driver for this market is the increasing demand for privacy-compliant data collaboration solutions across various industries, propelled by stricter data privacy regulations and the need for secure, scalable data sharing environments.




    The EO Data Clean Room Collaboration Tools market is experiencing significant momentum due to the rising emphasis on data privacy and regulatory compliance. Organizations are increasingly seeking solutions that allow them to collaborate on sensitive data without compromising privacy or breaching regulations such as GDPR, CCPA, and HIPAA. The proliferation of digital data and the growing use of third-party data for analytics, marketing, and operational improvements have made data clean rooms indispensable. These tools enable organizations to extract insights from combined datasets while maintaining strict access controls, encryption, and anonymization, helping them mitigate the risks associated with data breaches and non-compliance. As a result, enterprises across industries are ramping up investments in EO data clean room collaboration technologies to future-proof their data strategies.




    Another key growth factor is the rapid digital transformation across sectors such as healthcare, financial services, retail, and government. The integration of advanced analytics, artificial intelligence, and machine learning into business operations has increased the need for collaborative data environments that are both secure and scalable. EO Data Clean Room Collaboration Tools are uniquely positioned to address these needs by offering robust capabilities for data integration, analytics, and privacy management. The surge in cloud adoption, remote work, and cross-border data collaborations has further amplified the demand for these tools, as organizations strive to enable seamless data sharing while adhering to local and international privacy laws. This trend is expected to accelerate as more organizations recognize the strategic value of secure data collaboration in driving innovation and competitive advantage.




    Furthermore, the market is benefiting from technological advancements and the emergence of new business models that rely heavily on data-driven decision-making. The ability to securely collaborate on data with external partners, suppliers, and customers is becoming a critical differentiator for organizations aiming to enhance customer experiences, optimize supply chains, and drive targeted marketing campaigns. EO Data Clean Room Collaboration Tools facilitate these collaborations by providing a secure environment for joint data analysis, reducing the risk of data leakage and ensuring that sensitive information remains protected. The growing awareness of the potential financial and reputational damage caused by data breaches is prompting organizations to adopt these tools proactively, fueling market growth.




    Regionally, North America continues to dominate the EO Data Clean Room Collaboration Tools market, driven by the presence of leading technology providers, early adoption of privacy regulations, and a strong focus on data-driven innovation. However, Asia Pacific is emerging as a high-growth region, supported by rapid digitalization, increasing regulatory scrutiny, and the expansion of cloud infrastructure. Europe also holds a significant market share, owing to stringent data privacy laws and a mature technology ecosystem. Latin America and the Middle East & Africa are witnessing steady growth, albeit from a smaller base, as organizations in these regions begin to prioritize data privacy and secure collaboration in their digital transformation journeys.



    Component Analysis




    The EO Data Clean Room Collaboration Tools market is segmented by component into software and services, each playing a pivotal role in facilitating secure, privacy-compliant data collaboration. The software segment comprises platforms and solutions that enable organizations to manage, analyze, and share data securely within a controlled environment. These platforms typically offer features such as data encryption, a

  13. M

    Global Clean Locker Market Economic and Social Impact 2025-2032

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Clean Locker Market Economic and Social Impact 2025-2032 [Dataset]. https://www.statsndata.org/report/clean-locker-market-303040
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Clean Locker market has emerged as a vital segment within various industries, responding to the increasing demand for hygiene, security, and efficiency in the handling of personal belongings and sensitive items. Clean lockers are specifically designed to minimize contamination, making them essential in sectors s

  14. G

    Data Clean Rooms for Travel Brands Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Clean Rooms for Travel Brands Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-clean-rooms-for-travel-brands-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Oct 4, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Clean Rooms for Travel Brands Market Outlook



    As per our latest research, the global Data Clean Rooms for Travel Brands market size reached USD 587.4 million in 2024, reflecting the sector’s rapid adoption of privacy-centric data collaboration solutions. The market is projected to expand at a robust CAGR of 18.2% from 2025 to 2033, with the market size expected to reach USD 2,561.7 million by 2033. This impressive growth trajectory is primarily driven by the increasing demand for secure, privacy-compliant data sharing and analytics among travel brands, as well as the mounting pressure from evolving regulatory frameworks worldwide.




    One of the key growth factors propelling the Data Clean Rooms for Travel Brands market is the intensifying focus on data privacy and compliance. With stringent regulations such as GDPR in Europe, CCPA in California, and similar frameworks emerging across the globe, travel brands are under immense pressure to ensure that customer data is handled with the utmost care. Data clean rooms provide a secure environment where multiple parties—such as airlines, hotels, and online travel agencies—can collaborate on data-driven campaigns without exposing raw, personally identifiable information. This capability not only enables brands to unlock richer insights but also ensures adherence to regulatory mandates, driving widespread adoption across the travel sector.




    Another critical driver is the escalating need for advanced customer insights and hyper-personalization. As travel brands seek to differentiate themselves in a highly competitive market, the ability to deliver tailored experiences has become paramount. Data clean rooms empower organizations to aggregate and analyze customer data from disparate sources—ranging from booking engines to loyalty programs—while maintaining strict privacy controls. This enables travel brands to refine their marketing strategies, optimize audience targeting, and enhance customer journeys, ultimately boosting conversion rates and brand loyalty. The growing sophistication of analytics and machine learning tools integrated within clean room environments further amplifies the value proposition for travel industry stakeholders.




    The surge in strategic partnerships and alliances among travel brands and technology providers is further catalyzing market expansion. Leading airlines, hotel chains, and online travel agencies are increasingly collaborating with data clean room vendors to co-develop innovative solutions tailored to the unique needs of the travel industry. These partnerships facilitate seamless data integration, audience activation, and attribution measurement, allowing brands to maximize the impact of their marketing spend. Moreover, the proliferation of cloud-based deployment models is lowering barriers to entry for small and medium-sized enterprises, democratizing access to advanced data collaboration tools and accelerating overall market growth.




    Regionally, North America continues to dominate the Data Clean Rooms for Travel Brands market, accounting for the largest revenue share in 2024. This leadership position is attributed to the region’s mature digital infrastructure, high concentration of leading travel brands, and proactive adoption of privacy-enhancing technologies. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, burgeoning travel and tourism sectors, and increasing regulatory scrutiny. Europe also remains a critical market, underpinned by robust data protection laws and a strong emphasis on consumer privacy. As travel brands across all regions grapple with the dual imperatives of data-driven growth and regulatory compliance, the adoption of data clean rooms is poised to accelerate further in the years ahead.





    Component Analysis



    The Component segment of the Data Clean Rooms for Travel Brands market is bifurcated into Software and Services, each playing a pivotal role in enabling secure data collaboration for travel brands. The Software s

  15. Energy Consumption of United States Over Time

    • kaggle.com
    zip
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Energy Consumption of United States Over Time [Dataset]. https://www.kaggle.com/datasets/thedevastator/unlocking-the-energy-consumption-of-united-state
    Explore at:
    zip(222388 bytes)Available download formats
    Dataset updated
    Dec 14, 2022
    Authors
    The Devastator
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    Energy Consumption of United States Over Time

    Building Energy Data Book

    By Department of Energy [source]

    About this dataset

    The Building Energy Data Book (2011) is an invaluable resource for gaining insight into the current state of energy consumption in the buildings sector. This dataset provides comprehensive data on residential, commercial and industrial building energy consumption, construction techniques, building technologies and characteristics. With this resource, you can get an in-depth understanding of how energy is used in various types of buildings - from single family homes to large office complexes - as well as its impact on the environment. The BTO within the U.S Department of Energy's Office of Energy Efficiency and Renewable Energy developed this dataset to provide a wealth of knowledge for researchers, policy makers, engineers and even everyday observers who are interested in learning more about our built environment and its energy usage patterns

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset provides comprehensive information regarding energy consumption in the buildings sector of the United States. It contains a number of key variables which can be used to analyze and explore the relations between energy consumption and building characteristics, technologies, and construction. The data is provided in both CSV format as well as tabular format which can make it helpful for those who prefer to use programs like Excel or other statistical modeling software.

    In order to get started with this dataset we've developed a guide outlining how to effectively use it for your research or project needs.

    • Understand what's included: Before you start analyzing the data, you should read through the provided documentation so that you fully understand what is included in the datasets. You'll want to be aware of any potential limitations or requirements associated with each type of data point so that your results are valid and reliable when drawing conclusions from them.

    • Clean up any outliers: You may need to take some time upfront investigating suspicious outliers within your dataset before using it in any further analyses — otherwise, they can skew results down the road if not dealt with first-hand! Furthermore, they could also make complex statistical modeling more difficult as well since they artificially inflate values depending on their magnitude within each example data point (i.e., one outlier could affect an entire model’s prior distributions). Missing values should also be accounted for too since these may not always appear obvious at first glance when reviewing a table or graphical representation - but accurate statistics must still be obtained either way no matter how messy things seem!

    • Exploratory data analysis: After cleaning up your dataset you'll want to do some basic exploring by visualizing different types of summaries like boxplots, histograms and scatter plots etc.. This will give you an initial case into what trends might exist within certain demographic/geographic/etc.. regions & variables which can then help inform future predictive models when needed! Additionally this step will highlight any clear discontinuous changes over time due over-generalization (if applicable), making sure predictors themselves don’t become part noise instead contributing meaningful signals towards overall effect predictions accuracy etc…

    • Analyze key metrics & observations: Once exploratory analyses have been carried out on rawsamples post-processing steps are next such as analyzing metrics such ascorrelations amongst explanatory functions; performing significance testing regression models; imputing missing/outlier values and much more depending upon specific project needs at hand… Additionally – interpretation efforts based

    Research Ideas

    • Creating an energy efficiency rating system for buildings - Using the dataset, an organization can develop a metric to rate the energy efficiency of commercial and residential buildings in a standardized way.
    • Developing targeted campaigns to raise awareness about energy conservation - Analyzing data from this dataset can help organizations identify areas of high energy consumption and create targeted campaigns and incentives to encourage people to conserve energy in those areas.
    • Estimating costs associated with upgrading building technologies - By evaluating various trends in building technologies and their associated costs, decision-makers can determine the most cost-effective option when it comes time to upgrade their structures' energy efficiency...
  16. M

    Global Clean Environment Manufacturing And Assembly Service Market Economic...

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Clean Environment Manufacturing And Assembly Service Market Economic and Social Impact 2025-2032 [Dataset]. https://www.statsndata.org/report/clean-environment-manufacturing-and-assembly-service-market-269505
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Clean Environment Manufacturing and Assembly Service market is a critical segment of the industrial landscape, primarily focused on providing controlled environments for manufacturing processes in industries such as pharmaceuticals, biotechnology, electronics, and food processing. With the increasing need for cl

  17. f

    Data sources.

    • plos.figshare.com
    xlsx
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxuan You; Xia Xu; Guanqiu Yin (2025). Data sources. [Dataset]. http://doi.org/10.1371/journal.pone.0321936.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Yuxuan You; Xia Xu; Guanqiu Yin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Implementing energy transition in rural areas is crucial for China to achieve its low-carbon transition in energy consumption and dual-carbon goals. This study aimed to elucidate policy effects and further analyze the mediating effect of value perception to provide a reference for building a long-term rural energy transition mechanism. We constructed a “policy incentives–value perception–behavior” theoretical analysis framework and used survey data collected from residents of northern China. A logit model was employed to empirically test the effects of advocacy, demonstration, and subsidy policies on residents’ clean heating behavior. We used a mediation effect model to examine the mediating effects of economic, functional, social, and emotional value perceptions. The results showed that all three policies significantly positively impacted residents’ clean heating choices, with subsidy policies exerting the best effect. These findings suggest that implementing policy incentives can influence residents’ behavior by enhancing their value perceptions. However, different types of policies may act through distinct pathways. Compared with previous studies that focused solely on the impact of policy or value perception on clean heating behavior, this study explored their interactive relationship and found that external policy incentives can be transformed into internal driving forces. Therefore, value perception should be considered during policy formulation to build a long-term mechanism for promoting energy transition in rural areas.

  18. f

    S1 Data -

    • figshare.com
    • plos.figshare.com
    xlsx
    Updated Apr 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gong Caijuan; Muhammad Javeed Akhtar; Hafeez ur Rehman; Khatib Ahmad Khan (2024). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0297529.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Apr 5, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Gong Caijuan; Muhammad Javeed Akhtar; Hafeez ur Rehman; Khatib Ahmad Khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Currently, the world faces an existential threat of climate change, and every government across the globe is trying to come up with strategies to tackle the severity of climate change in every way possible. To this end, the use of clean energy rather than fossil fuel energy sources is critical, as it can reduce greenhouse gas emissions and pave the way for carbon neutrality. This study examines the impact of the energy cleanability gap on four different climate vulnerabilities, such as ecosystem, food, health, and housing vulnerabilities, considering 47 European and non-European high-income countries. The study considers samples from 2002 to 2019. This study precedes the empirical analysis in the context of a quadratic relationship between the energy cleanability gap and climate vulnerability. The study uses system-generalized methods of the moment as the main technique, while panel quantile regression is a robustness analysis. Fixed effect and random effect models have also been incorporated. The study finds that the energy cleanability gap and all four climate vulnerabilities demonstrate a U-shaped relationship in both European and non-European countries, implying that when the energy cleanability gap increases, climate vulnerability decreases, but after reaching a certain threshold, it starts to increase. Development expenditure is found to be negatively affecting food and health vulnerabilities in European nations, while it increases food vulnerability and decreases health vulnerability in non-European nations. Regarding industrialization’s impact on climate vulnerabilities, the study finds opposite effects for the European and non-European economies. On the other hand, for both groups, trade openness decreases climate vulnerabilities. Based on these results, the study recommends speeding up the energy transition process from fossil fuel energy resources towards clean energy resources to obtain carbon neutrality in both European and non-European groups.

  19. d

    Data from: Spatiotemporal-social association predicts immunological...

    • search.dataone.org
    • datadryad.org
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Downie; Oyebola Oyesola; Ramya Smithaveni Barre; Quentin Caudron; Ying-Han Chen; Emily Dennis; Romain Garnier; Kasalina Kiwanuka; Arthur Menezes; Daniel Navarrete; Octavio Mondragón-Palomino; Jesse Saunders; Christopher Tokita; Kimberly Zaldana; Ken Cadwell; P'ng Loke; Andrea Graham (2025). Spatiotemporal-social association predicts immunological similarity in rewilded mice [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zjc
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Alexander Downie; Oyebola Oyesola; Ramya Smithaveni Barre; Quentin Caudron; Ying-Han Chen; Emily Dennis; Romain Garnier; Kasalina Kiwanuka; Arthur Menezes; Daniel Navarrete; Octavio Mondragón-Palomino; Jesse Saunders; Christopher Tokita; Kimberly Zaldana; Ken Cadwell; P'ng Loke; Andrea Graham
    Time period covered
    Jan 1, 2023
    Description

    Environmental influences on immune phenotypes are well-documented, but our understanding of which elements of the environment affect immune systems, and how, remains vague. Behaviors, including socializing with others, are central to an individual’s interaction with its environment. We therefore tracked behavior of rewilded laboratory mice of three inbred strains in outdoor enclosures and examined contributions of behavior, including associations measured from spatiotemporal cooccurrences, to immune phenotypes. We found extensive variation in individual and social behavior among and within mouse strains upon rewilding. And we found that the more associated two individuals were, the more similar their immune phenotypes were. Spatiotemporal association was particularly predictive of similar memory T and B cell profiles and was more influential than sibling relationships or shared infection status. These results highlight the importance of shared spatiotemporal activity patterns and/or so..., This dataset includes the data and analysis code for Downie et al. (2023 preprint, 202? publication). It is a mixture of immune cell phenotypes, serum cytokines, MLN cytokine production, microbiome (sequenced via 16S), and behavioral data from RFID check-ins. Please see the preprint (or eventual manuscript) for details about the methodology and degree of processing. The behavioral data is largely unprocessed, while the flow cytometry data and ELISA data are post-processing., , # Data from: Spatiotemporal-social association predicts immunological similarity in rewilded mice

    https://doi.org/10.5061/dryad.rjdfn2zjc

    This dataset contains immune data, behavior data, and related R code for processing, analyzing, and plotting the data and model results. The immune data include flow cytometry data from lymphocytes, serum concentrations of cytokines, concentrations of cytokines produced by MLN cells following antigenic challenge, and complete blood count profiles. The data also include gut microbiome data, as measured via 16S sequencing.

    This dataset goes with a preprint, Downie et al. (2023) (linked here), and the associated manuscript, provisionally accepted by Science Advances.

    Description of the data and file structure

    The data are presented in comma-separated-value (CSV) files, except the microbiome data; these are in a tab-delimited file. The full ...

  20. Social Insurance Programs in Richest Quintile

    • kaggle.com
    Updated Jan 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Social Insurance Programs in Richest Quintile [Dataset]. https://www.kaggle.com/datasets/thedevastator/coverage-of-social-insurance-programs-in-richest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 7, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Coverage of Social Insurance Programs in Richest Quintile

    Percent of Population Eligible

    By data.world's Admin [source]

    About this dataset

    This dataset offers a unique insight into the coverage of social insurance programs for the wealthiest quintile of populations around the world. It reveals how many individuals in each country are receiving support from old age contributory pensions, disability benefits, and social security and health insurance benefits such as occupational injury benefits, paid sick leave, maternity leave, and more. This data provides an invaluable resource to understand the health and well-being of those most financially privileged in society – often having greater impact on decision making than other groups. With up-to-date figures from 2019-05-11 this dataset is invaluable in uncovering where there is work to be done for improved healthcare provision in each country across the world

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Understand the context: Before you begin analyzing this dataset, it is important to understand the information that it provides. Take some time to read the description of what is included in the dataset, including a clear understanding of the definitions and scope of coverage provided with each data point.

    • Examine the data: Once you have a general understanding of this dataset's contents, take some time to explore its contents in more depth. What specific questions does this dataset help answer? What kind of insights does it provide? Are there any missing pieces?

    • Clean & Prepare Data: After you've preliminarily examined its content, start preparing your data for further analysis and visualization. Clean up any formatting issues or irregularities present in your data set by correcting typos and eliminating unnecessary rows or columns before working with your chosen programming language (I prefer R for data manipulation tasks). Additionally, consider performing necessary transformations such as sorting or averaging values if appropriate for the findings you wish to draw from your analysis.

    • Visualize Results: Once you've cleaned and prepared your data, use visualizations such as charts, graphs or tables to reveal patterns within it that support specific conclusions about how insurance coverage under social programs vary among different groups within society's quintiles - based on age groups etc.. This type of visualization allows those who aren't familiar with programming to process complex information quickly and accurately than when displayed numerically in tabular form only!

    5 Final Analysis & Export Results: Finally export your visuals into presentation-ready formats (e.g., PDFs) which can be shared with colleagues! Additionally use these results as part of a narrative conclusion report providing an accurate assessment and meaningful interpretation about how social insurance programs vary between different members within society's quintiles (i..e., accordingest vs poorest), along with potential policy implications relevant for implementing effective strategies that improve access accordingly!

    Research Ideas

    • Analyzing the effectiveness of social insurance programs by comparing the coverage levels across different geographic areas or socio-economic groups;
    • Estimating the economic impact of social insurance programs on local and national economies by tracking spending levels and revenues generated;
    • Identifying potential problems with access to social insurance benefits, such as racial or gender disparities in benefit coverage

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: coverage-of-social-insurance-programs-in-richest-quintile-of-population-1.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit data.world's Admin.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
Organization logo

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Explore at:
zip(276607 bytes)Available download formats
Dataset updated
Aug 26, 2022
Authors
Abdulrasaq Ariyo
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates
--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3 

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast 
UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...
Search
Clear search
Close search
Google apps
Main menu