42 datasets found

Additional file 1 of Tensor extrapolation: an adaptation to data sets with...
springernature.figshare.com
html
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19242463.v1
Dataset updated
May 31, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Josef Schosser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Python code. The notebook highlights core components of the code applied in the study.
Ecommerce Dataset for Data Analysis
kaggle.com
zip
Updated Sep 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description
This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning
Pre-Processed Power Grid Frequency Time Series
zenodo.org
bin, zip
Updated Jul 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut (2021). Pre-Processed Power Grid Frequency Time Series [Dataset]. http://doi.org/10.5281/zenodo.3744121
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3744121
Dataset updated
Jul 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kruse; Johannes Kruse; Benjamin Schäfer; Benjamin Schäfer; Dirk Witthaut; Dirk Witthaut
Description
Overview
This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources
We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Data_converted and Data_cleansed
The folder "Data_converted" contains the output of "convert_data_format.py" and "Data_cleansed" contains the output of "clean_corrupted_data.py".

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The second column contains the frequency values in Hz.

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases
We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much. If your application cannot deal with NaNs, you could build upon the following commands to select the longest interval of valid data from the cleansed data:

from helper_functions import * import pandas as pd cleansed_data = pd.read_csv('/Path_to_cleansed_data/data.zip', index_col=0, header=None, squeeze=True, parse_dates=[0]) valid_bounds, valid_sizes = true_intervals(~cleansed_data.isnull()) start,end= valid_bounds[ np.argmax(valid_sizes) ] data_without_nan = cleansed_data.iloc[start:end]

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "Data_converted".

License
We release the code in the folder "Scripts" under the MIT license [8]. In the case of Nationalgrid and Fingrid, we further release the pre-processed data in the folder "Data_converted" and "Data_cleansed" under the CC-BY 4.0 license [7]. TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.
Z
Data from: A comprehensive dataset for the accelerated development and...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
University of California San Diego
Authors
Carreira Pedro, Hugo; Larson, David; Coimbra, Carlos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

Missing data The string "NAN" indicates missing data

File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

Files

Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

Folsom_weather.csv Primary One-minute weather data.

Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

Folsom_sky_image_features.csv Secondary Features derived from the sky images.

Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

Postprocess.py Code Python script used to compute the error metric for all the forecasts.
Overwatch 2 statistics
kaggle.com
zip
Updated Jun 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mykhailo Kachan (2023). Overwatch 2 statistics [Dataset]. https://www.kaggle.com/datasets/mykhailokachan/overwatch-2-statistics/code
Explore at:
zip(67546 bytes)Available download formats
Dataset updated
Jun 27, 2023
Authors
Mykhailo Kachan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is built on data from Overbuff with the help of python and selenium. Development environment - Jupyter Notebook.

The tables contain the data for competitive seasons 1-4 and for quick play for each hero and rank along with the standard statistics (common to each hero as well as information belonging to a specific hero).

Note: data for some columns are missing on Overbuff site (there is '—' instead of a specific value), so they were dropped: Scoped Crits for Ashe and Widowmaker, Rip Tire Kills for Junkrat, Minefield Kills for Wrecking Ball. 'Self Healing' column for Bastion was dropped too as Bastion doesn't have this property anymore in OW2. Also, there are no values for "Javelin Spin Kills / 10min" for Orisa in season 1 (the column was dropped). Overall, all missing values were cleaned.

Attention: Overbuff doesn't contain info about OW 1 competitive seasons (when you change a skill tier, the data isn't changed). If you know a site where it's possible to get this data, please, leave a comment. Thank you!

The code on GitHub .

All procedure is done in 5 stages:

Stage 1:

Data is retrieved directly from HTML elements on the page with the selenium tool on python.

Stage 2:

After scraping, data was cleansed: 1) Deleted comma separator on thousands (e.g. 1,009 => 1009). 2) Translated time representation (e.g. '01:23') to seconds (1*60 + 23 => 83). 3) Lúcio has become Lucio, Torbjörn - Torbjorn.

Stage 3:

Data were arranged into a table and saved to CSV.

Stage 4:

Columns which are supposed to have only numeric values are checked. All non-numeric values are dropped. This stage helps to find missing values which contain '—' instead and delete them.

Stage 5:

Additional missing values are searched for and dealt with. It's either column rename that happens (as the program cannot infer the correct column name for missing values) or a column drop. This stage ensures all wrong data are truly fixed.

The procedure to fetch the data takes 7 minutes on average.

This project and code were born from this GitHub code.
Number of hidden nodes of SCNs and the corresponding training error.
plos.figshare.com
xls
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). Number of hidden nodes of SCNs and the corresponding training error. [Dataset]. http://doi.org/10.1371/journal.pone.0300229.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300229.t001
Dataset updated
Mar 19, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of hidden nodes of SCNs and the corresponding training error.
Automobile_Price_prediction
kaggle.com
zip
Updated Oct 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Triapthi (2023). Automobile_Price_prediction [Dataset]. https://www.kaggle.com/datasets/ayusht18dec/case-study-dataset
Explore at:
zip(10947 bytes)Available download formats
Dataset updated
Oct 22, 2023
Authors
Ayush Triapthi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
A Case Study

In this case study we are going to use the automobile dataset, which plenty of car manufacturers withtheir specifications in in order to build a predictive model to find out the approximate car price. This dataset has 26 columns, including categorical and quantitative attributes.

The given_automobile.csv contains records from the above-mentioned dataset.

You need to write descriptive answers to the questions under each task and also usea proper program written in Python and execute the code. 1. The missing values are presented as ‘?’ in the dataset. Apply data wrangling techniques using Python programming language to solve missing values inall the attributes. 2. Check the data types of those columns with the missing values, and convert the data type if needed. 3. Find all the correlated features to the ‘Price’. 4. Build a predictive model to predict the car price based on using one of the independent correlated variables. 5. Continue with the same built model in No.4, but choose differentindependent variables and discuss the result.
Number of SCNs-based learners and corresponding error.
plos.figshare.com
xls
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan (2024). Number of SCNs-based learners and corresponding error. [Dataset]. http://doi.org/10.1371/journal.pone.0300229.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300229.t002
Dataset updated
Mar 19, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Xinfu Pang; Wei Sun; Haibo Li; Wei Liu; Changfeng Luan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of SCNs-based learners and corresponding error.

BI intro to data cleaning eda and machine learning

kaggle.com

zip

Updated Nov 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions

Explore at:

zip(9961 bytes)Available download formats

Dataset updated

Nov 17, 2025

Authors

Walekhwa Tambiti Leo Philip

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

Background

Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

Admissions decision-making
Academic support strategies
Overall program impact and ROI

Your Mission

Answer this central question:

“Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

Key Strategic Areas

You are required to analyze and provide actionable insights for the following three areas:

1. Admissions Optimization

Should entry exams remain the primary admissions filter?

Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

✅ Deliverables:

Feature importance ranking for predicting Python and DB scores
Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
Business rationale and risk analysis

2. Curriculum Support Strategy

Are there at-risk student groups who need extra support?

Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

✅ Deliverables:

At-risk segment identification
Support program design (e.g., prep course, mentoring)
Expected outcomes, costs, and KPIs

3. Resource Allocation & Program ROI

How can we allocate resources for maximum student success?

Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

✅ Deliverables:

Performance drivers
Student segmentation
Resource allocation plan and ROI projection

🛠️ Dataset Overview

Column	Description
`fNAME`, `lNAME`	Student first and last name
`Age`	Student age (21–71 years)
`gender`	Gender (standardized as "Male"/"Female")
`country`	Student’s country of origin
`residence`	Student housing/residence type
`entryEXAM`	Entry test score (28–98)
`prevEducation`	Prior education (High School, Diploma, etc.)
`studyHOURS`	Total study hours logged
`Python`	Final Python exam score
`DB`	Final Database exam score

📊 Dataset

You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

Raw Dataset (Recommended for Full Project)

Download: bi.csv

This dataset includes common data quality challenges:

Country name inconsistencies
e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom
Residence type variations
e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence
Education level typos and casing issues
e.g. Barrrchelors → Bachelor, DIPLOMA, Diplomaaa → Diploma
Gender value noise
e.g. M, F, female → standardize to Male / Female
Missing scores in Python subject
Fill NaN values using column mean or suitable imputation strategy

Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

✅ Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

Cleaned Dataset (Optional Shortcut)

Download: cleaned_bi.csv

This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

f
Data from: Calculations of Absolute Solvation Free Energies with...
acs.figshare.com
zip
Updated Aug 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Karwounopoulos; Åsmund Kaupang; Marcus Wieder; Stefan Boresch (2023). Calculations of Absolute Solvation Free Energies with TransformatoApplication to the FreeSolv Database Using the CGenFF Force Field [Dataset]. http://doi.org/10.1021/acs.jctc.3c00691.s003
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jctc.3c00691.s003
Dataset updated
Aug 24, 2023
Dataset provided by
ACS Publications
Authors
Johannes Karwounopoulos; Åsmund Kaupang; Marcus Wieder; Stefan Boresch
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
We recently introduced transformato, an open-source Python package for the automated setup of large-scale calculations of relative solvation and binding free energy differences. Here, we extend the capabilities of transformato to the calculation of absolute solvation free energy differences. After careful validation against the literature results and reference calculations with the PERT module of CHARMM, we used transformato to compute absolute solvation free energies for most molecules in the FreeSolv database (621 out of 642). The force field parameters were obtained with the program cgenff (v2.5.1), which derives missing parameters from the CHARMM general force field (CGenFF v4.6). A long-range correction for the Lennard-Jones interactions was added to all computed solvation free energies. The mean absolute error compared to the experimental data is 1.12 kcal/mol. Our results allow a detailed comparison between the AMBER and CHARMM general force fields and provide a more in-depth understanding of the capabilities and limitations of the CGenFF small molecule parameters.
H
Modeled groundwater levels across Central Valley, CA, from March 2015 to...
dataverse.harvard.edu
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon (2024). Modeled groundwater levels across Central Valley, CA, from March 2015 to August 2020, using GP-DNN regression [Dataset]. http://doi.org/10.7910/DVN/23TNJO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/23TNJO
Dataset updated
Sep 29, 2024
Dataset provided by
Harvard Dataverse
Authors
Anshuman Pradhan; Kyra H. Adams; Venkat Chandrasekaran; Zhen Liu; John T. Reager; Andrew M. Stuart; Michael J. Turmon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Central Valley, California
Description
This dataset contains groundwater level trends and time series data across a discretized grid of California's Central Valley, modeled with well data using hierarchical Gaussian process and neural network regression methodology. The spatial grid consists of 400 cells, spanning latitudes 34.91 to 40.895 degrees, and 220 cells, spanning longitudes -122.6 to -118.658 degrees. The temporal axis spans March 2015 to Aug 2020, discretized at biweekly intervals, with a total of 132 cells. The spatiotemporal grid details are present in relevant files. The first dataset is contained in the following Python pickle file. 1. 'CV_water_level_trends_Mar2015_Aug2020.pkl': This file contains a nested Python dictionary with following pairs: 1.1. 'longitude': Numpy array of shape 400 x 220 1.2. 'longitude': Numpy array of shape 400 x 220 1.3. 'mean': Python dictionary with mean long-term and seasonal water level trends 1.4. 'P10': Python dictionary with P10 long-term and seasonal water level trends 1.5. 'P90': Python dictionary with P90 long-term and seasonal water level trends Each of the dictionary in 1.3., 1.4. and 1.5. contain the following key and values: 'initial_water_level_ft': Mean/P10/P90 of March 2015 water levels in feet stored as Numpy array of shape 400 x 220 'water_level_decline_rate_ft/biweek': Mean/P10/P90 of March 2015 - Aug 2020 water level decline rates in ft/biweek stored as Numpy array of shape 400 x 220 'water_level_amplitude_ft': Mean/P10/P90 of seasonal water level oscillation amplitude stored as Numpy array of shape 400 x 220 'water_level_phase_deg': Mean/P10/P90 of time to peak seasonal signal in degrees stored as Numpy array of shape 400 x 220 The second dataset is contained in the following Python pickle file. 2. 'CV_water_level_time_series_Mar2015_Aug2020.pkl': This file contains a Python dictionary with following pairs. 2.1. 'longitude': Numpy array of shape 400 x 220 2.2. 'longitude': Numpy array of shape 400 x 220 2.3. 'time_axis': Python list on length 132 containing strings for biweekly periods from March 2015 - August 2020 2.4. 'water_level_well_ft': Processed water level observations in feet from 1744 wells, irregularly sampled across time. The data is stored as Numpy array of shape 400 x 220 x 132, with missing values as nans. 2.5. 'water_level_modeled_mean_ft': Modeled mean water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P10_ft': Modeled P10 water level time series in feet stored as Numpy array of shape 400 x 220 x 132 2.6. 'water_level_modeled_P90_ft': Modeled P90 water level time series in feet stored as Numpy array of shape 400 x 220 x 132

ACEA competition additional datasets

kaggle.com

zip

Updated Apr 7, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Vincent Larmet (2021). ACEA competition additional datasets [Dataset]. https://www.kaggle.com/vlarmet/acea-competition-additional-datasets

Explore at:

zip(74756921 bytes)Available download formats

Dataset updated

Apr 7, 2021

Authors

Vincent Larmet

Description

These datasets come from Google Earth Engine and are used in ACEA challenge

The first is daily time series from Copernicus ECMWF ERA5 Daily aggregates, extracted using weather station geolocations. Time series range from 1998 to 2020. 48 different stations are located in Italy.
The extraction have been done with this script :

import pandas as pd
import numpy as np
from datetime import datetime as dt
import ee
def extract_time_series(lat, lon, start, end, product_name, sf):
  # Set up point geometry
  point = ee.Geometry.Point(lon, lat)

  # Obtain image collection for all images within query dates
  coll = ee.ImageCollection(product_name)\
    .filterDate(start, end)

  def setProperty(image):
    dic = image.reduceRegion(ee.Reducer.first(), point)
    return image.set(dic)

  data = coll.map(setProperty)
  data = data.getInfo()
  liste = list(map(lambda x: pd.DataFrame(x['properties']), data['features']))
  df = pd.concat(liste)
  return df

if _name_ == "_main_":
  ee.Initialize()
  for i in locations.keys(): # locations is a dictionnary containing latitude and longitude
    print(i)
    latitude = locations[i]['lat']
    longitude = locations[i]['lon'] 
    while True:
      try: 
        output = extract_time_series(latitude,
                 longitude,
                 '1998-01-01',
                 '2020-01-01',
                 'ECMWF/ERA5/DAILY',
                 1)
        break
      except: 
        print(i + " 1 fail")
        continue              
    name =PATH + i + "_1.csv"        
    output.to_csv(name, index=True)

The second dataset is Forecasted Weather from Global Forecast System.
The purpose of this dataset is to provide forecasted rainfall and temperature for the 16 coming days. Creation_time column is the released date while forecast_hours is forecasted weather for horizon : creation_time + forecast_hours. Time series are daily and range from 2015 to 2020. Unfortunately, there are missing values.
Python script :

import pandas as pd
import numpy as np
from datetime import datetime as dt
import ee
def extract_time_series_gfs(lat, lon, start, end, product_name, sf, h):

  # Set up point geometry
  point = ee.Geometry.Point(lon, lat)

  # Obtain image collection for all images within query dates
  coll = ee.ImageCollection(product_name)\
    .select(['total_precipitation_surface','temperature_2m_above_ground'])\
    .filterDate(start, end)\
    .filterMetadata('forecast_hours', 'equals', h)

  def setProperty(image):
    dic = image.reduceRegion(ee.Reducer.first(), point)
    return image.set(dic)

  data = coll.map(setProperty)
  data = data.getInfo()
  
  liste = list(map(lambda x: pd.DataFrame(x['properties']), data['features']))
  df = pd.concat(liste)
  df=df[df["system:footprint"] == "LinearRing"]

  return df
if _name_ == "_main_":

  ee.Initialize()
  horizon = [i*24 for i in range(1,17)]
  for i in locations.keys():
    print(i)
    latitude = locations[i]['lat']
    longitude = locations[i]['lon'] 
    
    for j in horizon:
      while True:
        try:
          output = extract_time_series_gfs(latitude,
                 longitude,
                 '2015-07-01',
                 '2020-08-01',
                 'NOAA/GFS0P25',
                 1,
                 j)
          break
        except:
          print(i + " " + str(j) +" 1 fail")
          continue
      name = PATH + i + "_" + str(j) +"_1.csv"
           
      output.to_csv(name, index=True)

CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...
figshare.com
txt
Updated Apr 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tahir Bhatti (2025). CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes: Tracing the Genomic Divergence From SARS-CoV (2003) to SARS-CoV-2 (2019) [Dataset]. http://doi.org/10.6084/m9.figshare.28736501.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28736501.v1
Dataset updated
Apr 5, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Tahir Bhatti
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveThe primary objective of this study was to analyze CpG dinucleotide dynamics in coronaviruses by comparing Wuhan-Hu-1 with its closest and most distant relatives. Heatmaps were generated to visualize CpG counts and O/E ratios across intergenic regions, providing a clear depiction of conserved and divergent CpG patterns.Methods1. Data CollectionSource : The dataset includes CpG counts and O/E ratios for various coronaviruses, extracted from publicly available genomic sequences.Format : Data was compiled into a CSV file containing columns for intergenic regions, CpG counts, and O/E ratios for each virus.2. PreprocessingData Cleaning :Missing values (NaN), infinite values (inf, -inf), and blank entries were handled using Python's pandas library.Missing values were replaced with column means, and infinite values were capped at a large finite value (1e9).Reshaping :The data was reshaped into matrices for CpG counts and O/E ratios using meltpandas[] and pivot[] functions.3. Distance CalculationEuclidean Distance :Pairwise Euclidean distances were calculated between Wuhan-Hu-1 and other viruses using the scipy.spatial.distance.euclidean function.Distances were computed separately for CpG counts and O/E ratios, and the total distance was derived as the sum of both metrics.4. Identification of Closest and Distant RelativesThe virus with the smallest total distance was identified as the closest relative .The virus with the largest total distance was identified as the most distant relative .5. Heatmap GenerationTools :Heatmaps were generated using Python's seaborn library (sns.heatmap) and matplotlib for visualization.Parameters :Heatmaps were annotated with numerical values for clarity.A color gradient (coolwarm) was used to represent varying CpG counts and O/E ratios.Titles and axis labels were added to describe the comparison between Wuhan-Hu-1 and its relatives.ResultsClosest Relative :The closest relative to Wuhan-Hu-1 was identified based on the smallest Euclidean distance.Heatmaps for CpG counts and O/E ratios show high similarity in specific intergenic regions.Most Distant Relative :The most distant relative was identified based on the largest Euclidean distance.Heatmaps reveal significant differences in CpG dynamics compared to Wuhan-Hu-1 .Tools and LibrariesThe following tools and libraries were used in this analysis:Programming Language :Python 3.13Libraries :pandas: For data manipulation and cleaning.numpy: For numerical operations and handling missing/infinite values.scipy.spatial.distance: For calculating Euclidean distances.seaborn: For generating heatmaps.matplotlib: For additional visualization enhancements.File Formats :Input: CSV files containing CpG counts and O/E ratios.Output: PNG images of heatmaps.Files IncludedCSV File :Contains the raw data of CpG counts and O/E ratios for all viruses.Heatmap Images :Heatmaps for CpG counts and O/E ratios comparing Wuhan-Hu-1 with its closest and most distant relatives.Python Script :Full Python code used for data processing, distance calculation, and heatmap generation.Usage NotesResearchers can use this dataset to further explore the evolutionary dynamics of CpG dinucleotides in coronaviruses.The Python script can be adapted to analyze other viral genomes or datasets.Heatmaps provide a visual summary of CpG dynamics, aiding in hypothesis generation and experimental design.AcknowledgmentsSpecial thanks to the open-source community for developing tools like pandas, numpy, seaborn, and matplotlib.This work was conducted as part of an independent research project in molecular biology and bioinformatics.LicenseThis dataset is shared under the CC BY 4.0 License , allowing others to share and adapt the material as long as proper attribution is given.DOI: 10.6084/m9.figshare.28736501

The Device Activity Report with Complete Knowledge (DARCK) for NILM

zenodo.org

bin, xz

Updated Sep 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonymous Anonymous; Anonymous Anonymous (2025). The Device Activity Report with Complete Knowledge (DARCK) for NILM [Dataset]. http://doi.org/10.5281/zenodo.17159850

Explore at:

bin, xzAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.17159850

Dataset updated

Sep 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonymous Anonymous; Anonymous Anonymous

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

1. Abstract

This dataset contains aggregated and sub-metered power consumption data from a two-person apartment in Germany. Data was collected from March 5 to September 4, 2025, spanning 6 months. It includes an aggregate reading from a main smart meter and individual readings from 40 smart plugs, smart relays, and smart power meters monitoring various appliances.

2. Dataset Overview

Apartment: Two-person apartment, approx. 58m², located in Aachen, Germany.
Aggregate Meter: eBZ DD3
Sub-meters: 31 Shelly Plus Plug S, 6 Shelly Plus 1PM, 3 Shelly Plus PM Mini Gen3
Sampling Rate: 1 Hz
Measured Quantity: Active Power
Unit of Measurement: Watt
Duration: 6 months
Format: Single CSV file (`DARCK.csv`)
Structure: Timestamped rows with columns for the aggregate meter and each sub-metered appliance.
Completeness: The main power meter has a completeness of 99.3%. Missing values were linearly interpolated.

3. Download and Usage

The dataset can be downloaded here: https://doi.org/10.5281/zenodo.17159850

As it contains longer off periods with zeros, the CSV file is nicely compressible.

To extract it use: xz -d DARCK.csv.xz.
The compression leads to a 97% smaller file size (From 4GB to 90.9MB).

To use the dataset in python, you can, e.g., load the csv file into a pandas dataframe.

python
import pandas as pd

df = pd.read_csv("DARCK.csv", parse_dates=["time"])

4. Measurement Setup

The main meter was monitored using an infrared reading head magnetically attached to the infrared interface of the meter. An ESP8266 flashed with Tasmota decodes the binary datagrams and forwards the Watt readings to the MQTT broker. Individual appliances were monitored using a combination of Shelly Plugs (for outlets), Shelly 1PM (for wired-in devices like ceiling lights), and Shelly PM Mini (for each of the three phases of the oven). All devices reported to a central InfluxDB database via Home Assistant running in docker on a Dell OptiPlex 3020M.

5. File Format (`DARCK.csv`)

The dataset is provided as a single comma-separated value (CSV) file.

The first row is a header containing the column names.
All power values are rounded to the first decimal place.
There are no missing values in the final dataset.
Each row represents 1 second, from start of measuring in March until the end in September.

Column Descriptions

Column Name	Data Type	Unit	Description
`time`	datetime	-	Timestamp for the reading in `YYYY-MM-DD HH:MM:SS`
`main`	float	Watt	Total aggregate power consumption for the apartment, measured at the main electrical panel.
`[appliance_name]`	float	Watt	Power consumption of an individual appliance (e.g., `lightbathroom`, `fridge`, `sherlockpc`). See Section 8 for a full list.
Aggregate Columns
`aggr_chargers`	float	Watt	The sum of `sherlockcharger`, `sherlocklaptop`, `watsoncharger`, `watsonlaptop`, `watsonipadcharger`, `kitchencharger`.
`aggr_stoveplates`	float	Watt	The sum of `stoveplatel1` and `stoveplatel2`.
`aggr_lights`	float	Watt	The sum of `lightbathroom`, `lighthallway`, `lightsherlock`, `lightkitchen`, `lightlivingroom`, `lightwatson`, `lightstoreroom`, `fcob`, `sherlockalarmclocklight`, `sherlockfloorlamphue`, `sherlockledstrip`, `livingfloorlamphue`, `sherlockglobe`, `watsonfloorlamp`, `watsondesklamp` and `watsonledmap`.
Analysis Columns
`inaccuracy`	float	Watt	As no electrical device bypasses a power meter, the true inaccuracy can be assessed. It is the absolute error between the sum of individual measurements and the mains reading. A 30W offset is applied to the sum since the measurement devices themselves draw power which is otherwise unaccounted for.

6. Data Postprocessing Pipeline

The final dataset was generated from two raw data sources (meter.csv and shellies.csv) using a comprehensive postprocessing pipeline.

6.1. Main Meter (`main`) Postprocessing

The aggregate power data required several cleaning steps to ensure accuracy.

Outlier Removal: Readings below 10W or above 10,000W were removed (merely 3 occurrences).
Timestamp Burst Correction: The source data contained bursts of delayed readings. A custom algorithm was used to identify these bursts (large time gap followed by rapid readings) and back-fill the timestamps to create an evenly spaced time series.
Alignment & Interpolation: The smart meter pushes a new value via infrared every second. To align those to the whole seconds, it was resampled to a 1-second frequency by taking the mean of all readings within each second (in 99.5% only 1 value). Any resulting gaps (0.7% outage ratio) were filled using linear interpolation.

6.2. Sub-metered Devices (`shellies`) Postprocessing

The Shelly devices are not prone to the same burst issue as the ESP8266 is. They push a new reading at every change in power drawn. If no power change is observed or the one observed is too small (less than a few Watt), the reading is pushed once a minute, together with a heartbeat. When a device turns on or off, intermediate power values are published, which leads to sub-second values that need to be handled.

Grouping: Data was grouped by the unique device identifier.
Resampling & Filling: The data for each device was resampled to a 1-second frequency using .resample('1s').last().ffill().
This method was chosen to firstly, capture the last known state of the device within each second, handling rapid on/off events. Secondly, to forward-fill the last state across periods of no new data, modeling that the device's consumption remained constant until a new reading was sent.

6.3. Merging and Finalization

Merge: The cleaned main meter and all sub-metered device dataframes were merged into a single dataframe on the time index.
Final Fill: Any remaining NaN values (e.g., from before a device was installed) were filled with 0.0, assuming zero consumption.

7. Manual Corrections and Known Data Issues

During analysis, two significant unmetered load events were identified and manually corrected to improve the accuracy of the aggregate reading. The error column (inaccuracy) was recalculated after these corrections.

March 10th - Unmetered Bulb: An unmetered 107W bulb was active. It was subtracted from the main reading as if it never happened.
May 31st - Unmetered Air Pump: An unmetered 101W pump for an air mattress was used directly in an outlet with no intermediary plug and hence manually added to the respective plug.

8. Appliance Details and Multipurpose Plugs

The following table lists the column names with an explanation where needed. As Watson moved at the beginning of June, some metering plugs changed their appliance.

m
Python code for the estimation of missing prices in real-estate market with...
data.mendeley.com
Updated Sep 17, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iván García-Magariño (2017). Python code for the estimation of missing prices in real-estate market with a dataset of house prices from the center of Teruel city [Dataset]. http://doi.org/10.17632/mxpgf54czz.1
Explore at:
Unique identifier
https://doi.org/10.17632/mxpgf54czz.1
Dataset updated
Sep 17, 2017
Authors
Iván García-Magariño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Teruel
Description
This research data file contains the necessary software and the dataset for estimating the missing prices of house units. This approach combines several machine learning techniques (linear regression, support vector regression, the k-nearest neighbors and a multi-layer perceptron neural network) with several dimensionality reduction techniques (non-negative factorization, recursive feature elimination and feature selection with a variance threshold). It includes the input dataset formed with the available house prices in the center of Teruel city (Spain) in December 30, 2016 from Idealista website.

This dataset supports the research of the authors in the improvement of the setup of agent-based simulations about real-estate market. The work about this dataset has been submitted for consideration for publication to a scientific journal.

The open source python code is composed of all the files with the “.py” extension. The main program can be executed from the “main.py” file. The “boxplotErrors.eps” is a chart generated from the execution of the code, and compares the results of the different combinations of machine learning techniques and dimensionality reduction methods.

The dataset is in the “data” folder. The input raw data of the house prices are in the “dataRaw.csv” file. These were shuffled into the “dataShuffled.csv” file. We used cross-validation to obtain the estimations of house prices. The outputted estimations alongside the real values are stored in different files of the “data” folder, in which each filename is composed by the machine learning technique abbreviation and the dimensionality reduction method abbreviation.
n
AMAS: a fast tool for large alignment manipulation and computing of summary...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Jan 19, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marek L. Borowiec (2017). AMAS: a fast tool for large alignment manipulation and computing of summary statistics [Dataset]. http://doi.org/10.5061/dryad.p2q52
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.p2q52
Dataset updated
Jan 19, 2017
Dataset provided by
University of California, Davis
Authors
Marek L. Borowiec
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, creation of replicate data sets, and removal of taxa. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It is computationally efficient, utilizes parallel processing, and performs better at concatenation than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules and needs no additional dependencies. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/ under GNU General Public License.
c
Machine learning to predict mesenchymal stem cell efficacy for cartilage...
repository.cam.ac.uk
bin, ods
Updated Sep 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Yu Yang Fredrik; Lu, Yin; Oh, Steve; Conduit, Gareth (2020). Machine learning to predict mesenchymal stem cell efficacy for cartilage repair [Dataset]. http://doi.org/10.17863/CAM.52036
Explore at:
bin(169108 bytes), bin(2855 bytes), ods(132585 bytes), bin(2707 bytes), bin(5176 bytes)Available download formats
Unique identifier
https://doi.org/10.17863/CAM.52036
Dataset updated
Sep 24, 2020
Dataset provided by
University of Cambridge
Apollo
Authors
Liu, Yu Yang Fredrik; Lu, Yin; Oh, Steve; Conduit, Gareth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We collected data from 36 published articles on PubMed [13–48] to train and validate our machine learning models. Some articles comprised more than one type of cartilage injury models or treatment condition. In total, 15 clinical trial conditions and 29 animal 66 model conditions (1 goat, 6 pigs, 2 dogs, 9 rabbits, 9 rats, and 2 mice) on osteochondral injury or osteoarthritis were included, where MSCs were transplanted to repair the cartilage tissue. We documented each case with specific treatment condition into an entry by considering the cell- and treatment target-related factors as input properties, including species, body weight, tissue source, cell number, cell concentration, defect area, defect depth, and type of cartilage damage. The therapeutic outcomes were considered as output properties, which were evaluated using integrated clinical and histological cartilage repair scores, including the international cartilage repair society (ICRS) scoring system, the O’Driscoll score, the Pineda score, the Mankin score, the osteoarthritis research society international (OARSI) scoring system, the international knee documentation committee (IKDC) score, the visual analog score (VAS) for pain, the knee injury and osteoarthritis outcome score (KOOS), the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lyscholm score. In this study, these scores were linearly normalized to a number between 0 and 1, with 0 representing the worst damage or pain, and 1 representing the completely healthy tissue. The list of entries was combined together to form a database.

We have provided the details for the imputation algorithm in the subsection Handling missing data under Methods and a flowchart in Fig 2. Data imputation algorithm for the vector x was added in the manuscript for illustration. The pseudo-code for uncertainty calculation was shown in S1 Algorithm: A ensemble model to measure the ANN's prediction uncertainty. The original database gathered from the literature, and a ‘complete’ database with missing information filled from our neural network are also included, along with a sample neural network architecture file in Python.

Here we provide a Python notebook comprising a neural network that delivers the performance and results described in the manuscript. Documentation in the form of comments and installation guide is included in the Python notebook. This Python notebook along with the methods described in the manuscript provides sufficient details for other interested readers to either extend this script or write their own scripts and reproduce the results in the paper.
Customer Sale Dataset for Data Visualization
kaggle.com
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Atul (2025). Customer Sale Dataset for Data Visualization [Dataset]. https://www.kaggle.com/datasets/atulkgoyl/customer-sale-dataset-for-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Atul
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This synthetic dataset is designed specifically for practicing data visualization and exploratory data analysis (EDA) using popular Python libraries like Seaborn, Matplotlib, and Pandas.

Unlike most public datasets, this one includes a diverse mix of column types:

📅 Date columns (for time series and trend plots) 🔢 Numerical columns (for histograms, boxplots, scatter plots) 🏷️ Categorical columns (for bar charts, group analysis)

Whether you are a beginner learning how to visualize data or an intermediate user testing new charting techniques, this dataset offers a versatile playground.

Feel free to:

Create EDA notebooks Practice plotting techniques Experiment with filtering, grouping, and aggregations 🛠️ No missing values, no data cleaning needed — just download and start exploring!

Hope you find this helpful. Looking forward to hearing from you all.
Classicmodels
kaggle.com
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
Explore at:
zip(65751 bytes)Available download formats
Dataset updated
Dec 15, 2024
Authors
Javier Landaeta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

Methodology 1. Data Extraction:

A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.

A reusable function is created to read each table and load it into a Pandas DataFrame.

2. Data Cleansing and Transformation:

An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.

New variables are calculated, such as the total value of each sale, cost, and profit.

Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

3. Exploratory Data Analysis (EDA):

Key metrics such as total sales, number of unique customers, and average order value are calculated.

Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.

Results are visualized using relevant graphics (histograms, bar charts, etc.).

4. Modeling and Prediction:

Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

5. Report Generation:

Detailed reports are created in Pandas DataFrames format that answer specific business questions.

These reports are stored in new PostgreSQL tables for further analysis and visualization.

Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.

Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Shiny Seshat scrubbing code.
plos.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lux Miranda; Jacob Freeman (2023). Shiny Seshat scrubbing code. [Dataset]. http://doi.org/10.1371/journal.pone.0232609.s004
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0232609.s004
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lux Miranda; Jacob Freeman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We include our Python 3 program that begins with the original, untampered Seshat database and performs the entire process of turning it into Shiny Seshat (including all error correction, Complexity Characteristic creation, imputation of missing values, etc.). (ZIP)

Facebook

Twitter

Click to copy link

Link copied

Cite

Josef Schosser (2023). Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries [Dataset]. http://doi.org/10.6084/m9.figshare.19242463.v1

Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries

Explore at:

htmlAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.19242463.v1

Dataset updated

May 31, 2023

Dataset provided by

Figsharehttp://figshare.com/
figshare

Authors

Josef Schosser

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Additional file 1. Python code. The notebook highlights core components of the code applied in the study.

Clear search

Close search

Google apps

Main menu

Additional file 1 of Tensor extrapolation: an adaptation to data sets with...

Ecommerce Dataset for Data Analysis

Pre-Processed Power Grid Frequency Time Series

Data from: A comprehensive dataset for the accelerated development and...

Overwatch 2 statistics

Stage 1:

Stage 2:

Stage 3:

Stage 4:

Stage 5:

Number of hidden nodes of SCNs and the corresponding training error.

Automobile_Price_prediction

Number of SCNs-based learners and corresponding error.

BI intro to data cleaning eda and machine learning

Real-World Data Science Challenge

Business Intelligence Program Strategy — Student Success Optimization

Background

Your Mission

Key Strategic Areas

1. Admissions Optimization

2. Curriculum Support Strategy

3. Resource Allocation & Program ROI

🛠️ Dataset Overview

📊 Dataset

Raw Dataset (Recommended for Full Project)

Cleaned Dataset (Optional Shortcut)

Data from: Calculations of Absolute Solvation Free Energies with...

Modeled groundwater levels across Central Valley, CA, from March 2015 to...

ACEA competition additional datasets

CpG Signature Profiling and Heatmap Visualization of SARS-CoV Genomes:...

The Device Activity Report with Complete Knowledge (DARCK) for NILM

1. Abstract

2. Dataset Overview

3. Download and Usage

4. Measurement Setup

5. File Format (DARCK.csv)

Column Descriptions

Column Name

Data Type

Unit

Description

6. Data Postprocessing Pipeline

6.1. Main Meter (main) Postprocessing

6.2. Sub-metered Devices (shellies) Postprocessing

6.3. Merging and Finalization

7. Manual Corrections and Known Data Issues

8. Appliance Details and Multipurpose Plugs

Python code for the estimation of missing prices in real-estate market with...

AMAS: a fast tool for large alignment manipulation and computing of summary...

Machine learning to predict mesenchymal stem cell efficacy for cartilage...

Customer Sale Dataset for Data Visualization

Classicmodels

Shiny Seshat scrubbing code.

Additional file 1 of Tensor extrapolation: an adaptation to data sets with missing entries

5. File Format (`DARCK.csv`)

6.1. Main Meter (`main`) Postprocessing

6.2. Sub-metered Devices (`shellies`) Postprocessing