Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterDescription: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:
Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:
v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Facebook
TwitterThese dataset contains 40 Laptops which has scrapped from the Flipkart website using python code. I want you to clean this dataset analyze it like a Data Analyst do. I know this dataset is too short I wish to make it messier and bigger later on as the need increases. If you have analyzed the data the make sure to upload the jupyter notebook of it. All the Best!
ABOUT UPDATED DATA: I have updated the data by adding more stuffs to make it interesting and messier. You can clean the dataset and do more stuffs over this dataset. Note that the prices are in INR Rupees
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.
Included Files:
sp500_cleaned.csv – Cleaned dataset used for analysis
sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)
dashboard_screenshot.png – Screenshot of Power BI dashboard
README.md – Summary of the project and key takeaways
This project demonstrates practical data cleaning, querying, and visualization skills.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.
The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.
The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].
The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.
If you use this dataset in your work, please cite it as follows:
Feel free to reach out for any additional information or clarification. Happy analyzing!
Facebook
TwitterCompleted Jupyter Notebook project: Conducted data cleaning, exploratory data analysis with pandas, matplotlib, and seaborn. Enhanced customer experience by identifying potential customers based on demographics and improved sales by optimizing inventory planning through product analysis.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.
After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.
The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.
The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.
To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:
get_dataset.ipynb: 1. os library 2. pandas library
Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library
The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.
For questions or suggestions please e-mail Xinlei Chen
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This dataset accompanies the empirical analysis in Legality Without Justice, a study examining the relationship between public trust in institutions and perceived governance legitimacy using data from the World Values Survey Wave 7 (2017–2022). It includes:
WVS_Cross-National_Wave_7_csv_v6_0.csv — World Values Survey Wave 7 core data.
GDP.csv — World Bank GDP per capita (current US$) for 2022 by country.
denial.ipynb — Fully documented Jupyter notebook with code for data merging, exploratory statistics, and ordinal logistic regression using OrderedModel. Includes GDP as a control for institutional trust and perceived governance.
All data processing and analysis were conducted in Python using FAIR reproducibility principles and can be replicated or extended on Google Colab.
DOI: 10.5281/zenodo.16361108
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Authors: Anon Annotator
Publication date: 2025-07-23
Language: English
Version: 1.0.0
Publisher: Zenodo
Programming language: Python
Go to https://colab.research.google.com
Click File > Upload notebook, and upload the denial.ipynb file.
Also upload the CSVs (WVS_Cross-National_Wave_7_csv_v6_0.csv and GDP.csv) using the file browser on the left sidebar.
In denial.ipynb, ensure file paths match:
wvs = pd.read_csv('/content/WVS_Cross-National_Wave_7_csv_v6_0.csv')
gdp = pd.read_csv('/content/GDP.csv')
Execute the notebook cells from top to bottom. You may need to install required libraries:
!pip install statsmodels pandas numpy
The notebook performs:
Data cleaning
Merging WVS and GDP datasets
Summary statistics
Ordered logistic regression to test if confidence in courts/police (Q57, Q58) predicts belief that the country is governed in the interest of the people (Q183), controlling for GDP.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo repository contains data and code associated with the publication:
Chen R, Duffy Á, Petrazzini BO, Vy HM, Stein D, Mort M, Park JK, Schlessinger A, Itan Y, Cooper DN, Jordan DM, Rocheleau G, Do R. Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score. Nat Commun. 2024 Oct 15;15(1):8891. doi: 10.1038/s41467-024-53333-y.
Files needed to train ML-GPS and ML-GPS DOE:
| Model | Open Targets AUPRC | SIDER AUPRC |
| ML-GPS (non-DOE) | 0.074 | 0.080 |
| ML-GPS DOE (activator predictions) | 0.029 | 0.042 |
| ML-GPS DOE (inhibitor predictions) | 0.067 | 0.064 |
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).
The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.
Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).
Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.
Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.
Authors Contributions:
Data treatment: Eliott Baboz, Jérémy Lavarenne.
Documentation: Jérémy Lavarenne.
Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.
Changelog:
v1.0.0 : initial submission
Facebook
TwitterIntroduction:
About the Company:
Cyclistic is a bike-sharing company in Chicago, which has since expanded to include a fleet of 5,824 geotracked bicycles stationed at 692 locations across Chicago. The bikes can be unlocked at one station and returned to any other station within the network at any time. Individuals buying single-ride or full-day passes fall into the category of casual riders, while those acquiring annual memberships become recognized as Cyclistic members. Tools and Technologies: ⦁ Tableau/Power BI for dashboard development. ⦁ Python for data analysis
Phase 1: About the Dataset: The data is publicly available on an AWS server. We were tasked to work with an entire year of data, so I downloaded zipped files (CSV format) containing data from January 2023 to December 2023, one file for each month. Data Structure: Each .csv file contains a table with 13 columns with varying data types as shown below. Each column stands for a field that describes how people use Cyclist's bike-sharing service. Each row represents an observation with the details of every ride. ⦁ ride_id: This is a unique identifier assigned to each bike ride. It's like a reference number for the trip. ⦁ rideable_type: This column indicates the type of bike used in the ride. It can be "electric_bike" or "classic_bike". ⦁ started_at: This shows the date and time when the ride began. The format is YYYY-MM-DD HH:MM:SS. ⦁ ended_at: This shows the date and time when the ride ended. The format is the same as the started_at column. ⦁ start_station_name: This specifies the name of the docking station where the ride started. ⦁ start_station_id: This is a unique identifier for the starting docking station. It complements the start_station_name. ⦁ start_lat: This represents the latitude coordinate of the starting docking station. ⦁ start_lng: This represents the longitude coordinate of the starting docking station. These coordinates might be useful for mapping the station's location. ⦁ end_station_name: This specifies the name of the docking station where the ride ended. ⦁ end_station_id: This is a unique identifier for the ending docking station. It complements the end_station_name. ⦁ end_lat: This represents the latitude coordinate of the ending docking station. ⦁ end_lng: This represents the longitude coordinate of the ending docking station. These coordinates might be useful for mapping the station's location. ⦁ member_casual: This column indicates whether the rider was a member (member) or a casual user (casual) of the bike-sharing service. Phase 2: I used python for data cleaning You can view the Jupyter Notebook for the Process phase here Here are the steps that I did during this phase ⦁ Check for null and duplicates ⦁ Additional columns and data transformation (change the data type, remove trailing or leading spaces, etc.) ⦁ Extract data for analysis Data Cleaning Result Total Row Count before data cleaning: 5745324 Total Row Count before data cleaning: 4268747
Phase 3: Analyze: I used Python in my jupyter notebook to look at the huge data we cleaned earlier. I came up with questions to figure out how casual riders are different from annual members. Then, I made queries to get the answers, helping us understand more and make decisions based on the data. Questions Here are the following questions we will answer in this phase: ⦁ What is the percentage of user types from total users? ⦁ Is there a bike type preferred by different user types? ⦁ Which bike type has the longest trip duration between users? ⦁ What is the average trip duration per user type? ⦁ What is the average distance traveled per user type? ⦁ What days are most users active? ⦁ What months or seasons of the year users tend to use the bike-sharing service?
I used Tableau public in making the visualization. You can view the data visualization for the Share phase here https://public.tableau.com/app/profile/katabathina.jyoshnavi/viz/divvytripvisualisation/Dashboard7.
Findings ⦁ 63% of the total Cyclistic users are annual members while 36% are casual riders. ⦁ Both annual members and casual riders prefer classic bikes. Only casual riders use docked bikes. ⦁ Generally, casual riders have the longest average ride duration (23 minutes) compared with annual members (18 minutes). ⦁ Both annual members and casual riders have almost the same average distance traveled. ⦁ Docked bikes have the longest average ride duration which only casual riders use. Classic bikes have the longest average ride duration for annual members. ⦁ Most trips are recorded on Saturday. ⦁ There are more trips during spring and at least during winter.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Access to continuous, quality assessed meteorological data is critical for understanding the climatology and atmospheric dynamics of a region. Research facilities like Oak Ridge National Laboratory (ORNL) rely on such data to assess site-specific climatology, model potential emissions, establish safety baselines, and prepare for emergency scenarios. To meet these needs, on-site towers at ORNL collect meteorological data at 15-minute and hourly intervals. However, data measurements from meteorological towers are affected by sensor sensitivity, degradation, lightning strikes, power fluctuations, glitching, and sensor failures, all of which can affect data quality. To address these challenges, we conducted a comprehensive quality assessment and processing of five years of meteorological data collected from ORNL at 15-minute intervals, including measurements of temperature, pressure, humidity, wind, and solar radiation. The time series of each variable was pre-processed and gap-filled using established meteorological data collection and cleaning techniques, i.e., the time series were subjected to structural standardization, data integrity testing, automated and manual outlier detection, and gap-filling. The data product and highly generalizable processing workflow developed in Python Jupyter notebooks are publicly accessible online. As a key contribution of this study, the evaluated 5-year data will be used to train atmospheric dispersion models that simulate dispersion dynamics across the complex ridge-and-valley topography of the Oak Ridge Reservation in East Tennessee.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Demographic Analysis of Shopping Behavior: Insights and Recommendations
Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.
Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.
Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.
Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.
Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.
References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Facebook
TwitterThis is the data of a social media platform of an organization. You have been hired by the organization & given their social media data to analyze, visualize and prepare a report on it.
You are required to prepare a neat notebook on it using Jupyter Notebook/Jupyter Lab or Google Colab. Then, zip everything including the notebook file (.ipynb file) and the dataset. Finally, upload through the google forms link stated below. The notebook should be neat, containing codes with details regarding your code, visualizations, and description of your purpose of doing each task.
You are suggested but not limited to go through the general steps like -> Data Cleaning, Data preparation, Exploratory Data Analysis(EDA), Correlations finding, Feature extraction, and more. (There is no limit to your skills and ideas)
After doing what needs to be done, you are to give your organization insights and facts. For example, are they reaching more audiences on weekends? Is posting content on the weekdays turn out to be more effective? Is posting many contents on the same day make more sense? Or, should they post content regularly and keep day-to-day consistency? Did you find any trend patterns in the data? What are your advice after completing the analysis? Mention them clearly at the end of the Notebook. (These are just a few examples, your findings may be entirely different and that is totally acceptable. )
Note that, we will value clear documentation which states clear insights from analysis of data & visualizations, more than anything else. It will not matter how complex methods are you applying if it eventually does not find anything useful.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A Python module using Jupyter Notebooks to take an existing dataset available at Kaggle and undertake some data cleansing, data hard coding and data science management so it can be more useful for Machine Learning models. Source of original dataset: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data
Introduction The problem we are trying to solve is that there are very limited datasets on Kaggle if you wish to apply ML models to the problem of individual stock Share Price prediction using financial statement ratios as your input data. This is a problem that needs addressing as there is a multi-billion global fundamental financial ratio investment analysis industry that is ripe for performance enhancement by Machine Learning. We believe that the best dataset for such a purpose on Kaggle was the above dataset that we found above. The problem with this dataset for ML model use was as follows: • There was a number of data attributes that were not shown across each annual period. We removed data attributes that were not populated across all the annual periods. • We filled in data that was missing and we replaced NANs and INFs with logical and reasonable fill values. • We attached label data being 12 month ahead Share Price returns for each stock and each annual period providing this data both as discrete percentage returns and binary outperform or underperform the Nasdaq 100 index labels.
Resulting Datasets The resulting datasets cover 102 stocks using 39 financial ratios across both 4 and 5 year periods using two different types of labels.
In summary, this repository provides a Jupyter Notebook that shows the steps undertaken to generate:
Two datasets for 2017 to 2021 with the Y labels attached at the end column. • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year. And Two mote datasets for 2017 to 2020 with the same Y label data as above: • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year.
Usage & Contributing At the moment the project is in development. You can use the repository and play with the Jupyter Notebook to generate your own datasets with differing assumptions to ours. We will then load up some ML models that we think can be the most effective at predicting 12 month forward Share Price outcomes based on the 39 financial ratios provided. We would welcome your thoughts on our models. Even better we would welcome YOUR ideas on the best models to use to solve such a prediction problem using these datasets? You can always help to get this problem solved. It's an open-source project after all!
Resources • Kaggle: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data • Jupyter Notebooks: https://jupyter.org/ • Yfinance: https://pypi.org/project/yfinance/
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
A data-driven end-to-end analysis of Electric Vehicle adoption, performance, and policy alignment across Washington State. This project covers everything from data cleaning and exploration to visualization and presentation — using SQL, Python, and Power BI.
Facebook
TwitterThis repository includes python scripts and input/output data associated with the following publication:
[1] Brown, P.R.; O'Sullivan, F. "Spatial and temporal variation in the value of solar power across United States Electricity Markets". Renewable & Sustainable Energy Reviews 2019. https://doi.org/10.1016/j.rser.2019.109594
Please cite reference [1] for full documentation if the contents of this repository are used for subsequent work.
Many of the scripts, data, and descriptive text in this repository are shared with the following publication:
[2] Brown, P.R.; O'Sullivan, F. "Shaping photovoltaic array output to align with changing wholesale electricity price profiles". Applied Energy 2019, 256, 113734. https://doi.org/10.1016/j.apenergy.2019.113734
All code is in python 3 and relies on a number of dependencies that can be installed using pip or conda.
Contents
pvvm/*.py : Python module with functions for modeling PV generation and calculating PV energy revenue, capacity value, and emissions offset.
notebooks/*.ipynb : Jupyter notebooks, including:
pvvm-vos-data.ipynb: Example scripts used to download and clean input LMP data, determine LMP node locations, assign nodes to capacity zones, download NSRDB input data, and reproduce some figures in [1]
pvvm-example-generation.ipynb: Example scripts demonstrating the use of the PV generation model and a sensitivity analysis of PV generator assumptions
pvvm-example-plots.ipynb: Example scripts demonstrating different plotting functions
validate-pv-monthly-eia.ipynb: Scripts and plots for comparing modeled PV generation with monthly generation reported in EIA forms 860 and 923, as discussed in SI Note 3 of [1]
validate-pv-hourly-pvdaq.ipynb: Scripts and plots for comparing modeled PV generation with hourly generation reported in NREL PVDAQ database, as discussed in SI Note 3 of [1]
pvvm-energyvalue.ipynb: Scripts for calculating the wholesale energy market revenues of PV and reproducing some figures in [1]
pvvm-capacityvalue.ipynb: Scripts for calculating the capacity credit and capacity revenues of PV and reproducing some figures in [1]
pvvm-emissionsvalue.ipynb: Scripts for calculating the emissions offset of PV and reproducing some figures in [1]
pvvm-breakeven.ipynb: Scripts for calculating the breakeven upfront cost and carbon price for PV and reproducing some figures in [1]
html/*.html : Static images of the above Jupyter notebooks for viewing without a python kernel
data/lmp/*.gz : Day-ahead nodal locational marginal prices (LMPs) and marginal costs of energy (MCE), congestion (MCC), and losses (MCL) for CAISO, ERCOT, MISO, NYISO, and ISONE.
At the time of publication of this repository, permission had not been received from PJM to republish their LMP data. If permission is received in the future, a new version of this repository will be linked here with the complete dataset.
results/*.csv.gz : Simulation results associated with [1], including modeled energy revenue, capacity credit and revenue, emissions offsets, and breakeven costs for PV systems at all LMP nodes
Data notes
ISO LMP data are used with permission from the different ISOs. Adapting the MIT License (https://opensource.org/licenses/MIT), "The data are provided 'as is', without warranty of any kind, express or implied, including but not limited to the warranties of merchantibility, fitness for a particular purpose and noninfringement. In no event shall the authors or sources be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the data or other dealings with the data." Copyright and usage permissions for the LMP data are available on the ISO websites, linked below.
ISO-specific notes on LMP data:
CAISO data from http://oasis.caiso.com/mrioasis/logon.do are used pursuant to the terms at http://www.caiso.com/Pages/PrivacyPolicy.aspx#TermsOfUse.
ERCOT data are from http://www.ercot.com/mktinfo/prices.
MISO data are from https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/ and https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/market-report-archives/.
PJM data were originally downloaded from https://www.pjm.com/markets-and-operations/energy/day-ahead/lmpda.aspx and https://www.pjm.com/markets-and-operations/energy/real-time/lmp.aspx. At the time of this writing these data are currently hosted at https://dataminer2.pjm.com/feed/da_hrl_lmps and https://dataminer2.pjm.com/feed/rt_hrl_lmps.
NYISO data from http://mis.nyiso.com/public/ are used subject to the disclaimer at https://www.nyiso.com/legal-notice.
ISONE data are from https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-da-hourly and https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-rt-hourly-final. The Material is provided on an "as is" basis. ISO New England Inc., to the fullest extent permitted by law, disclaims all warranties, either express or implied, statutory or otherwise, including but not limited to the implied warranties of merchantability, non-infringement of third parties' rights, and fitness for particular purpose. Without limiting the foregoing, ISO New England Inc. makes no representations or warranties about the accuracy, reliability, completeness, date, or timeliness of the Material. ISO New England Inc. shall have no liability to you, your employer or any other third party based on your use of or reliance on the Material.
Data workup: LMP data were downloaded directly from the ISOs using scripts similar to the pvvm.data.download_lmps() function (see below for caveats), then repackaged into single-node single-year files using the pvvm.data.nodalize() function. These single-node single-year files were then combined into the dataframes included in this repository, using the procedure shown in the pvvm-vos-data.ipynb notebook for MISO. We provide these yearly dataframes, rather than the long-form data, to minimize file size and number. These dataframes can be unpacked into the single-node files used in the analysis using the pvvm.data.copylmps() function.
Usage notes
Code is provided under the MIT License, as specified in the pvvm/LICENSE file and at the top of each *.py file.
Updates to the code, if any, will be posted in the non-static repository at https://github.com/patrickbrown4/pvvm_vos. The code in the present repository has the following version-specific dependencies:
matplotlib: 3.0.3
numpy: 1.16.2
pandas: 0.24.2
pvlib: 0.6.1
scipy: 1.2.1
tqdm: 4.31.1
To use the NSRDB download functions, you will need to modify the "settings.py" file to insert a valid NSRDB API key, which can be requested from https://developer.nrel.gov/signup/. Locations can be specified by passing (latitude, longitude) floats to pvvm.data.downloadNSRDBfile(), or by passing a string googlemaps query to pvvm.io.queryNSRDBfile(). To use the googlemaps functionality, you will need to request a googlemaps API key (https://developers.google.com/maps/documentation/javascript/get-api-key) and insert it in the "settings.py" file.
Note that many of the ISO websites have changed in the time since the functions in the pvvm.data module were written and the LMP data used in the above papers were downloaded. As such, the pvvm.data.download_lmps() function no longer works for all ISOs and years. We provide this function to illustrate the general procedure used, and do not intend to maintain it or keep it up to date with the changing ISO websites. For up-to-date functions for accessing ISO data, the following repository (no connection to the present work) may be helpful: https://github.com/catalyst-cooperative/pudl.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🎵 Unveiling Spotify Trends: A Deep Dive into Streaming Data:
Introduction:
This Jupyter Notebook explores data manipulation, aggregation, and visualization techniques using Python’s Pandas, Matplotlib, and Seaborn libraries. The key objectives of this analysis include:
📌 Data Cleaning and Preparation ✔ Handling missing values in key columns. ✔ Standardizing and transforming categorical features (e.g., mode, release_day_name). ✔ Creating new derived features, such as decade classification and energy levels.
📌 Feature Engineering & Data Transformation ✔ Extracting release trends from date-based columns. ✔ Categorizing song durations and popularity levels dynamically. ✔ Applying lambda functions, apply(), map(), and filter() for efficient data transformations. ✔ Using groupby() and aggregation functions to analyze trends in song streams. ✔ Ranking artists based on total streams using rank().
📌 Data Aggregation and Trend Analysis ✔ Identifying the most common musical keys used in songs. ✔ Tracking song releases over time with rolling averages. ✔ Comparing Major vs. Minor key distributions in song compositions.
📌 Data Visualization ✔ Bar plots for ranking top artists and stream counts. ✔ Box plots to analyze stream distribution per release year. ✔ Heatmaps to examine feature correlations. ✔ Pie charts to understand song popularity distribution.
📌 Dataset Description The dataset consists of Spotify streaming statistics and includes features such as:
🎵 track_name – Song title. 🎤 artist(s)_name – Name(s) of performing artists. 🔢 streams – Number of times the song was streamed. 📅 released_year, released_month, released_day – Date of song release. 🎼 energy_%, danceability_%, valence_% – Audio feature metrics. 📊 in_spotify_playlists – Number of Spotify playlists featuring the song. 🎹 mode – Musical mode (Major or Minor). 🎯 Purpose This analysis is designed for: ✔ Exploring real-world datasets to develop data analyst skills. ✔ Practicing data transformation, aggregation, and visualization techniques. ✔ Preparing for data analyst interviews by working with structured workflows.
📌 Table of Contents 1️⃣ Data Cleaning & Preparation 2️⃣ Feature Engineering & Transformations (apply(), map(), filter(), groupby(), rank()) 3️⃣ Data Aggregation & Trend Analysis 4️⃣ Data Visualization & Insights 5️⃣ Conclusion and Key Takeaways
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically