30 datasets found
  1. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  2. Z

    NoCORA - Northern Cameroon Observed Rainfall Archive

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
    Explore at:
    Dataset updated
    Jul 10, 2024
    Dataset provided by
    Center for International Forestry Research
    Centre de Coopération Internationale en Recherche Agronomique pour le Développement
    Authors
    Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel
    Area covered
    Cameroon, North Region
    Description

    Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

    Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

    v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission

  3. Flipkart Laptops Data

    • kaggle.com
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Hannan Ansari (2022). Flipkart Laptops Data [Dataset]. https://www.kaggle.com/datasets/ansariabdulhannan/flipkart-laptops-data
    Explore at:
    zip(18601 bytes)Available download formats
    Dataset updated
    Jun 6, 2022
    Authors
    Abdul Hannan Ansari
    Description

    These dataset contains 40 Laptops which has scrapped from the Flipkart website using python code. I want you to clean this dataset analyze it like a Data Analyst do. I know this dataset is too short I wish to make it messier and bigger later on as the need increases. If you have analyzed the data the make sure to upload the jupyter notebook of it. All the Best!

    ABOUT UPDATED DATA: I have updated the data by adding more stuffs to make it interesting and messier. You can clean the dataset and do more stuffs over this dataset. Note that the prices are in INR Rupees

  4. S&P 500 Companies Analysis Project

    • kaggle.com
    zip
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
    Explore at:
    zip(9721576 bytes)Available download formats
    Dataset updated
    Apr 6, 2025
    Authors
    anshadkaggle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

    Included Files:

    sp500_cleaned.csv – Cleaned dataset used for analysis

    sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

    dashboard_screenshot.png – Screenshot of Power BI dashboard

    README.md – Summary of the project and key takeaways

    This project demonstrates practical data cleaning, querying, and visualization skills.

  5. Surveys of Data Professionals (Alex the Analyst)

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
    Explore at:
    zip(81050 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    Stewie
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [Dataset Name] - About This Dataset

    Overview

    This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

    Context

    The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

    Content

    The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

    Acknowledgements

    The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

    Citation

    If you use this dataset in your work, please cite it as follows:

    How to Use

    1. Download the dataset from this link.
    2. Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

    Feel free to reach out for any additional information or clarification. Happy analyzing!

  6. Diwali_Sales_Data_Analysis

    • kaggle.com
    Updated Aug 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adimadapala Geetika (2023). Diwali_Sales_Data_Analysis [Dataset]. https://www.kaggle.com/datasets/adimadapalageetika/diwali-sales-data-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Adimadapala Geetika
    Description

    Completed Jupyter Notebook project: Conducted data cleaning, exploratory data analysis with pandas, matplotlib, and seaborn. Enhanced customer experience by identifying potential customers based on demographics and improved sales by optimizing inventory planning through product analysis.

  7. H

    Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

    • dataverse.harvard.edu
    • figshare.com
    Updated Mar 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 21, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Elizabeth Szkirpan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).

  8. Sample Park Analysis

    • figshare.com
    zip
    Updated Nov 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 2, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Eric Delmelle
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn

  9. Z

    The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

    • data.niaid.nih.gov
    Updated Sep 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
    Explore at:
    Dataset updated
    Sep 25, 2020
    Dataset provided by
    Carnegie Mellon University
    Tsinghua University
    Stanford University
    Authors
    Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

    After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

    The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

    The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

    To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

    get_dataset.ipynb: 1. os library 2. pandas library

    Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

    The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

    For questions or suggestions please e-mail Xinlei Chen

  10. Legality Without Justice: Symbolic Governance, Institutional Denial, and the...

    • zenodo.org
    bin, csv
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scott Brown; Scott Brown (2025). Legality Without Justice: Symbolic Governance, Institutional Denial, and the Ethical Foundations of Law [Dataset]. http://doi.org/10.5281/zenodo.16361108
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Nov 6, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Scott Brown; Scott Brown
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description:
    This dataset accompanies the empirical analysis in Legality Without Justice, a study examining the relationship between public trust in institutions and perceived governance legitimacy using data from the World Values Survey Wave 7 (2017–2022). It includes:

    • WVS_Cross-National_Wave_7_csv_v6_0.csv — World Values Survey Wave 7 core data.

    • GDP.csv — World Bank GDP per capita (current US$) for 2022 by country.

    • denial.ipynb — Fully documented Jupyter notebook with code for data merging, exploratory statistics, and ordinal logistic regression using OrderedModel. Includes GDP as a control for institutional trust and perceived governance.

    All data processing and analysis were conducted in Python using FAIR reproducibility principles and can be replicated or extended on Google Colab.

    DOI: 10.5281/zenodo.16361108
    License: Creative Commons Attribution 4.0 International (CC BY 4.0)
    Authors: Anon Annotator
    Publication date: 2025-07-23
    Language: English
    Version: 1.0.0
    Publisher: Zenodo
    Programming language: Python

    🔽 How to Download and Run on Google Colab

    Step 1: Open Google Colab

    Go to https://colab.research.google.com

    Step 2: Upload Files

    Click File > Upload notebook, and upload the denial.ipynb file.
    Also upload the CSVs (WVS_Cross-National_Wave_7_csv_v6_0.csv and GDP.csv) using the file browser on the left sidebar.

    Step 3: Adjust File Paths (if needed)

    In denial.ipynb, ensure file paths match:

    python
    CopiarEditar
    wvs = pd.read_csv('/content/WVS_Cross-National_Wave_7_csv_v6_0.csv') gdp = pd.read_csv('/content/GDP.csv')

    Step 4: Run the Code

    Execute the notebook cells from top to bottom. You may need to install required libraries:

    python
    CopiarEditar
    !pip install statsmodels pandas numpy

    The notebook performs:

    • Data cleaning

    • Merging WVS and GDP datasets

    • Summary statistics

    • Ordered logistic regression to test if confidence in courts/police (Q57, Q58) predicts belief that the country is governed in the interest of the people (Q183), controlling for GDP.

  11. Data from: Expanding drug targets for 112 chronic diseases using a machine...

    • zenodo.org
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Chen; Robert Chen; Ron Do; Ron Do (2025). Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score [Dataset]. http://doi.org/10.5281/zenodo.14905752
    Explore at:
    Dataset updated
    Feb 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robert Chen; Robert Chen; Ron Do; Ron Do
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Feb 21, 2025
    Description

    ML-GPS: Machine Learning-Assisted Genetic Priority Score

    This Zenodo repository contains data and code associated with the publication:

    Chen R, Duffy Á, Petrazzini BO, Vy HM, Stein D, Mort M, Park JK, Schlessinger A, Itan Y, Cooper DN, Jordan DM, Rocheleau G, Do R. Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score. Nat Commun. 2024 Oct 15;15(1):8891. doi: 10.1038/s41467-024-53333-y.

    Important notes

    Repository contents

    Files needed to train ML-GPS and ML-GPS DOE:

    • Files needed for Jupyter notebooks.zip: Data files required for preprocessing and training.
    • Jupyter notebooks.zip: Notebooks for cleaning data, training models, and generating predictions.

    Other files:

    • Predictions for all gene-phecode pairs.zip: ML-GPS and ML-GPS DOE scores for all analyzed gene-phecode pairs.
    • Summary statistics.zip: Genetic association summary statistics for all tested gene-phecode pairs.

    Updated performance metrics

    ModelOpen Targets AUPRCSIDER AUPRC
    ML-GPS (non-DOE)0.0740.080
    ML-GPS DOE (activator predictions)0.0290.042
    ML-GPS DOE (inhibitor predictions)0.0670.064

    Zenodo versions

    • Version 4: Updated notebooks and external data to use Open Targets 2024.9; summary statistics are unchanged
    • Version 3: Corrected error where DOE for rare and ultrarare variants was incorrectly incorporated
    • Version 2: Original release accompanying the publication
  12. Z

    Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baboz, Eliott; Lavarenne, Jérémy (2024). Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA and RHoMIS datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10556265
    Explore at:
    Dataset updated
    Jul 7, 2024
    Dataset provided by
    Centre de Coopération Internationale en Recherche Agronomique pour le Développement
    Authors
    Baboz, Eliott; Lavarenne, Jérémy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Africa, West Africa
    Description

    Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).

    The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.

    Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).

    Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.

    Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.

    Authors Contributions:

    Data treatment: Eliott Baboz, Jérémy Lavarenne.

    Documentation: Jérémy Lavarenne.

    Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.

    Changelog:

    v1.0.0 : initial submission

  13. divvy's Trip (Cyclist bike share analysis)

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    katabathina jyoshnavi (2024). divvy's Trip (Cyclist bike share analysis) [Dataset]. https://www.kaggle.com/datasets/katabathinajyoshnavi/divvys-trip-cyclist-bike-share-analysis
    Explore at:
    zip(194213174 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    katabathina jyoshnavi
    Description

    Introduction:

    About the Company:

    Cyclistic is a bike-sharing company in Chicago, which has since expanded to include a fleet of 5,824 geotracked bicycles stationed at 692 locations across Chicago. The bikes can be unlocked at one station and returned to any other station within the network at any time. Individuals buying single-ride or full-day passes fall into the category of casual riders, while those acquiring annual memberships become recognized as Cyclistic members. Tools and Technologies: ⦁ Tableau/Power BI for dashboard development. ⦁ Python for data analysis

    Phase 1: About the Dataset: The data is publicly available on an AWS server. We were tasked to work with an entire year of data, so I downloaded zipped files (CSV format) containing data from January 2023 to December 2023, one file for each month. Data Structure: Each .csv file contains a table with 13 columns with varying data types as shown below. Each column stands for a field that describes how people use Cyclist's bike-sharing service. Each row represents an observation with the details of every ride. ⦁ ride_id: This is a unique identifier assigned to each bike ride. It's like a reference number for the trip. ⦁ rideable_type: This column indicates the type of bike used in the ride. It can be "electric_bike" or "classic_bike". ⦁ started_at: This shows the date and time when the ride began. The format is YYYY-MM-DD HH:MM:SS. ⦁ ended_at: This shows the date and time when the ride ended. The format is the same as the started_at column. ⦁ start_station_name: This specifies the name of the docking station where the ride started. ⦁ start_station_id: This is a unique identifier for the starting docking station. It complements the start_station_name. ⦁ start_lat: This represents the latitude coordinate of the starting docking station. ⦁ start_lng: This represents the longitude coordinate of the starting docking station. These coordinates might be useful for mapping the station's location. ⦁ end_station_name: This specifies the name of the docking station where the ride ended. ⦁ end_station_id: This is a unique identifier for the ending docking station. It complements the end_station_name. ⦁ end_lat: This represents the latitude coordinate of the ending docking station. ⦁ end_lng: This represents the longitude coordinate of the ending docking station. These coordinates might be useful for mapping the station's location. ⦁ member_casual: This column indicates whether the rider was a member (member) or a casual user (casual) of the bike-sharing service. Phase 2: I used python for data cleaning You can view the Jupyter Notebook for the Process phase here Here are the steps that I did during this phase ⦁ Check for null and duplicates ⦁ Additional columns and data transformation (change the data type, remove trailing or leading spaces, etc.) ⦁ Extract data for analysis Data Cleaning Result Total Row Count before data cleaning: 5745324 Total Row Count before data cleaning: 4268747

    Phase 3: Analyze: I used Python in my jupyter notebook to look at the huge data we cleaned earlier. I came up with questions to figure out how casual riders are different from annual members. Then, I made queries to get the answers, helping us understand more and make decisions based on the data. Questions Here are the following questions we will answer in this phase: ⦁ What is the percentage of user types from total users? ⦁ Is there a bike type preferred by different user types? ⦁ Which bike type has the longest trip duration between users? ⦁ What is the average trip duration per user type? ⦁ What is the average distance traveled per user type? ⦁ What days are most users active? ⦁ What months or seasons of the year users tend to use the bike-sharing service?

    I used Tableau public in making the visualization. You can view the data visualization for the Share phase here https://public.tableau.com/app/profile/katabathina.jyoshnavi/viz/divvytripvisualisation/Dashboard7.

    Findings ⦁ 63% of the total Cyclistic users are annual members while 36% are casual riders. ⦁ Both annual members and casual riders prefer classic bikes. Only casual riders use docked bikes. ⦁ Generally, casual riders have the longest average ride duration (23 minutes) compared with annual members (18 minutes). ⦁ Both annual members and casual riders have almost the same average distance traveled. ⦁ Docked bikes have the longest average ride duration which only casual riders use. Classic bikes have the longest average ride duration for annual members. ⦁ Most trips are recorded on Saturday. ⦁ There are more trips during spring and at least during winter.

  14. Five years of quality-controlled meteorological surface data at Oak Ridge...

    • zenodo.org
    bin, zip
    Updated Apr 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Steckler; Morgan Steckler; Xiao-Ying Yu; Xiao-Ying Yu; Kevin Birdwell; Kevin Birdwell; haowen xu; haowen xu (2025). Five years of quality-controlled meteorological surface data at Oak Ridge Reserve in Tennessee [Dataset]. http://doi.org/10.5281/zenodo.14744006
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Morgan Steckler; Morgan Steckler; Xiao-Ying Yu; Xiao-Ying Yu; Kevin Birdwell; Kevin Birdwell; haowen xu; haowen xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Oak Ridge, Tennessee
    Description

    Access to continuous, quality assessed meteorological data is critical for understanding the climatology and atmospheric dynamics of a region. Research facilities like Oak Ridge National Laboratory (ORNL) rely on such data to assess site-specific climatology, model potential emissions, establish safety baselines, and prepare for emergency scenarios. To meet these needs, on-site towers at ORNL collect meteorological data at 15-minute and hourly intervals. However, data measurements from meteorological towers are affected by sensor sensitivity, degradation, lightning strikes, power fluctuations, glitching, and sensor failures, all of which can affect data quality. To address these challenges, we conducted a comprehensive quality assessment and processing of five years of meteorological data collected from ORNL at 15-minute intervals, including measurements of temperature, pressure, humidity, wind, and solar radiation. The time series of each variable was pre-processed and gap-filled using established meteorological data collection and cleaning techniques, i.e., the time series were subjected to structural standardization, data integrity testing, automated and manual outlier detection, and gap-filling. The data product and highly generalizable processing workflow developed in Python Jupyter notebooks are publicly accessible online. As a key contribution of this study, the evaluated 5-year data will be used to train atmospheric dispersion models that simulate dispersion dynamics across the complex ridge-and-valley topography of the Oak Ridge Reservation in East Tennessee.

  15. Shopping Mall Customer Data Segmentation Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
    Explore at:
    zip(5890828 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    DataZng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Demographic Analysis of Shopping Behavior: Insights and Recommendations

    Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

    Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

    Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

    Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

    Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

    References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

  16. Social Media Customer Analysis

    • kaggle.com
    zip
    Updated Apr 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nafe Muhtasim (2021). Social Media Customer Analysis [Dataset]. https://www.kaggle.com/nafemuhtasim/social-media-customer-analysis
    Explore at:
    zip(108529 bytes)Available download formats
    Dataset updated
    Apr 16, 2021
    Authors
    Nafe Muhtasim
    Description

    This is the data of a social media platform of an organization. You have been hired by the organization & given their social media data to analyze, visualize and prepare a report on it.

    You are required to prepare a neat notebook on it using Jupyter Notebook/Jupyter Lab or Google Colab. Then, zip everything including the notebook file (.ipynb file) and the dataset. Finally, upload through the google forms link stated below. The notebook should be neat, containing codes with details regarding your code, visualizations, and description of your purpose of doing each task.

    You are suggested but not limited to go through the general steps like -> Data Cleaning, Data preparation, Exploratory Data Analysis(EDA), Correlations finding, Feature extraction, and more. (There is no limit to your skills and ideas)

    After doing what needs to be done, you are to give your organization insights and facts. For example, are they reaching more audiences on weekends? Is posting content on the weekdays turn out to be more effective? Is posting many contents on the same day make more sense? Or, should they post content regularly and keep day-to-day consistency? Did you find any trend patterns in the data? What are your advice after completing the analysis? Mention them clearly at the end of the Notebook. (These are just a few examples, your findings may be entirely different and that is totally acceptable. )

    Note that, we will value clear documentation which states clear insights from analysis of data & visualizations, more than anything else. It will not matter how complex methods are you applying if it eventually does not find anything useful.

  17. Financial ratios 4 Nasdaq 100 membrs + 12m returns

    • kaggle.com
    zip
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SheepBoss (2023). Financial ratios 4 Nasdaq 100 membrs + 12m returns [Dataset]. https://www.kaggle.com/datasets/mlcapital/financial-ratios-4-nasdaq-100-membrs-12m-returns/code
    Explore at:
    zip(212319 bytes)Available download formats
    Dataset updated
    Jun 7, 2023
    Authors
    SheepBoss
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A Python module using Jupyter Notebooks to take an existing dataset available at Kaggle and undertake some data cleansing, data hard coding and data science management so it can be more useful for Machine Learning models. Source of original dataset: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data

    Introduction The problem we are trying to solve is that there are very limited datasets on Kaggle if you wish to apply ML models to the problem of individual stock Share Price prediction using financial statement ratios as your input data. This is a problem that needs addressing as there is a multi-billion global fundamental financial ratio investment analysis industry that is ripe for performance enhancement by Machine Learning. We believe that the best dataset for such a purpose on Kaggle was the above dataset that we found above. The problem with this dataset for ML model use was as follows: • There was a number of data attributes that were not shown across each annual period. We removed data attributes that were not populated across all the annual periods. • We filled in data that was missing and we replaced NANs and INFs with logical and reasonable fill values. • We attached label data being 12 month ahead Share Price returns for each stock and each annual period providing this data both as discrete percentage returns and binary outperform or underperform the Nasdaq 100 index labels.

    Resulting Datasets The resulting datasets cover 102 stocks using 39 financial ratios across both 4 and 5 year periods using two different types of labels.

    In summary, this repository provides a Jupyter Notebook that shows the steps undertaken to generate:

    Two datasets for 2017 to 2021 with the Y labels attached at the end column. • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year. And Two mote datasets for 2017 to 2020 with the same Y label data as above: • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year.

    Usage & Contributing At the moment the project is in development. You can use the repository and play with the Jupyter Notebook to generate your own datasets with differing assumptions to ours. We will then load up some ML models that we think can be the most effective at predicting 12 month forward Share Price outcomes based on the 39 financial ratios provided. We would welcome your thoughts on our models. Even better we would welcome YOUR ideas on the best models to use to solve such a prediction problem using these datasets? You can always help to get this problem solved. It's an open-source project after all!

    Resources • Kaggle: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data • Jupyter Notebooks: https://jupyter.org/ • Yfinance: https://pypi.org/project/yfinance/

  18. Electric Vehicle Population Analysis

    • kaggle.com
    zip
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nibedita Sahu (2025). Electric Vehicle Population Analysis [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/electric-vehicle-population-analysis
    Explore at:
    zip(10564209 bytes)Available download formats
    Dataset updated
    Jun 23, 2025
    Authors
    Nibedita Sahu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Electric Vehicle Population Analysis

    A data-driven end-to-end analysis of Electric Vehicle adoption, performance, and policy alignment across Washington State. This project covers everything from data cleaning and exploration to visualization and presentation — using SQL, Python, and Power BI.

    Tools & Technologies

    • SQL (MySQL): Data cleaning, filtering, type conversion, preprocessing
    • Python (Jupyter Notebook): Pandas, SQLAlchemy, NumPy, Matplotlib, Seaborn
    • Pandas Profiling / YData EDA: Automated EDA for in-depth data profiling
    • Power BI: Interactive, multi-page report design and visual analysis
  19. Z

    Data from: Spatial and temporal variation in the value of solar power across...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown, Patrick R. (2020). Spatial and temporal variation in the value of solar power across United States electricity markets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3562895
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    MIT Energy Initiative
    Authors
    Brown, Patrick R.
    Area covered
    United States
    Description

    This repository includes python scripts and input/output data associated with the following publication:

    [1] Brown, P.R.; O'Sullivan, F. "Spatial and temporal variation in the value of solar power across United States Electricity Markets". Renewable & Sustainable Energy Reviews 2019. https://doi.org/10.1016/j.rser.2019.109594

    Please cite reference [1] for full documentation if the contents of this repository are used for subsequent work.

    Many of the scripts, data, and descriptive text in this repository are shared with the following publication:

    [2] Brown, P.R.; O'Sullivan, F. "Shaping photovoltaic array output to align with changing wholesale electricity price profiles". Applied Energy 2019, 256, 113734. https://doi.org/10.1016/j.apenergy.2019.113734

    All code is in python 3 and relies on a number of dependencies that can be installed using pip or conda.

    Contents

    pvvm/*.py : Python module with functions for modeling PV generation and calculating PV energy revenue, capacity value, and emissions offset.

    notebooks/*.ipynb : Jupyter notebooks, including:

    pvvm-vos-data.ipynb: Example scripts used to download and clean input LMP data, determine LMP node locations, assign nodes to capacity zones, download NSRDB input data, and reproduce some figures in [1]

    pvvm-example-generation.ipynb: Example scripts demonstrating the use of the PV generation model and a sensitivity analysis of PV generator assumptions

    pvvm-example-plots.ipynb: Example scripts demonstrating different plotting functions

    validate-pv-monthly-eia.ipynb: Scripts and plots for comparing modeled PV generation with monthly generation reported in EIA forms 860 and 923, as discussed in SI Note 3 of [1]

    validate-pv-hourly-pvdaq.ipynb: Scripts and plots for comparing modeled PV generation with hourly generation reported in NREL PVDAQ database, as discussed in SI Note 3 of [1]

    pvvm-energyvalue.ipynb: Scripts for calculating the wholesale energy market revenues of PV and reproducing some figures in [1]

    pvvm-capacityvalue.ipynb: Scripts for calculating the capacity credit and capacity revenues of PV and reproducing some figures in [1]

    pvvm-emissionsvalue.ipynb: Scripts for calculating the emissions offset of PV and reproducing some figures in [1]

    pvvm-breakeven.ipynb: Scripts for calculating the breakeven upfront cost and carbon price for PV and reproducing some figures in [1]

    html/*.html : Static images of the above Jupyter notebooks for viewing without a python kernel

    data/lmp/*.gz : Day-ahead nodal locational marginal prices (LMPs) and marginal costs of energy (MCE), congestion (MCC), and losses (MCL) for CAISO, ERCOT, MISO, NYISO, and ISONE.

    At the time of publication of this repository, permission had not been received from PJM to republish their LMP data. If permission is received in the future, a new version of this repository will be linked here with the complete dataset.

    results/*.csv.gz : Simulation results associated with [1], including modeled energy revenue, capacity credit and revenue, emissions offsets, and breakeven costs for PV systems at all LMP nodes

    Data notes

    ISO LMP data are used with permission from the different ISOs. Adapting the MIT License (https://opensource.org/licenses/MIT), "The data are provided 'as is', without warranty of any kind, express or implied, including but not limited to the warranties of merchantibility, fitness for a particular purpose and noninfringement. In no event shall the authors or sources be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the data or other dealings with the data." Copyright and usage permissions for the LMP data are available on the ISO websites, linked below.

    ISO-specific notes on LMP data:

    CAISO data from http://oasis.caiso.com/mrioasis/logon.do are used pursuant to the terms at http://www.caiso.com/Pages/PrivacyPolicy.aspx#TermsOfUse.

    ERCOT data are from http://www.ercot.com/mktinfo/prices.

    MISO data are from https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/ and https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/market-report-archives/.

    PJM data were originally downloaded from https://www.pjm.com/markets-and-operations/energy/day-ahead/lmpda.aspx and https://www.pjm.com/markets-and-operations/energy/real-time/lmp.aspx. At the time of this writing these data are currently hosted at https://dataminer2.pjm.com/feed/da_hrl_lmps and https://dataminer2.pjm.com/feed/rt_hrl_lmps.

    NYISO data from http://mis.nyiso.com/public/ are used subject to the disclaimer at https://www.nyiso.com/legal-notice.

    ISONE data are from https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-da-hourly and https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-rt-hourly-final. The Material is provided on an "as is" basis. ISO New England Inc., to the fullest extent permitted by law, disclaims all warranties, either express or implied, statutory or otherwise, including but not limited to the implied warranties of merchantability, non-infringement of third parties' rights, and fitness for particular purpose. Without limiting the foregoing, ISO New England Inc. makes no representations or warranties about the accuracy, reliability, completeness, date, or timeliness of the Material. ISO New England Inc. shall have no liability to you, your employer or any other third party based on your use of or reliance on the Material.

    Data workup: LMP data were downloaded directly from the ISOs using scripts similar to the pvvm.data.download_lmps() function (see below for caveats), then repackaged into single-node single-year files using the pvvm.data.nodalize() function. These single-node single-year files were then combined into the dataframes included in this repository, using the procedure shown in the pvvm-vos-data.ipynb notebook for MISO. We provide these yearly dataframes, rather than the long-form data, to minimize file size and number. These dataframes can be unpacked into the single-node files used in the analysis using the pvvm.data.copylmps() function.

    Usage notes

    Code is provided under the MIT License, as specified in the pvvm/LICENSE file and at the top of each *.py file.

    Updates to the code, if any, will be posted in the non-static repository at https://github.com/patrickbrown4/pvvm_vos. The code in the present repository has the following version-specific dependencies:

    matplotlib: 3.0.3

    numpy: 1.16.2

    pandas: 0.24.2

    pvlib: 0.6.1

    scipy: 1.2.1

    tqdm: 4.31.1

    To use the NSRDB download functions, you will need to modify the "settings.py" file to insert a valid NSRDB API key, which can be requested from https://developer.nrel.gov/signup/. Locations can be specified by passing (latitude, longitude) floats to pvvm.data.downloadNSRDBfile(), or by passing a string googlemaps query to pvvm.io.queryNSRDBfile(). To use the googlemaps functionality, you will need to request a googlemaps API key (https://developers.google.com/maps/documentation/javascript/get-api-key) and insert it in the "settings.py" file.

    Note that many of the ISO websites have changed in the time since the functions in the pvvm.data module were written and the LMP data used in the above papers were downloaded. As such, the pvvm.data.download_lmps() function no longer works for all ISOs and years. We provide this function to illustrate the general procedure used, and do not intend to maintain it or keep it up to date with the changing ISO websites. For up-to-date functions for accessing ISO data, the following repository (no connection to the present work) may be helpful: https://github.com/catalyst-cooperative/pudl.

  20. Spotify-Dataset_for_Self_practise

    • kaggle.com
    zip
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sonal Anand (2025). Spotify-Dataset_for_Self_practise [Dataset]. https://www.kaggle.com/datasets/sonalanand/spotify-dataset-for-self-practise/data
    Explore at:
    zip(48187 bytes)Available download formats
    Dataset updated
    Feb 24, 2025
    Authors
    Sonal Anand
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🎵 Unveiling Spotify Trends: A Deep Dive into Streaming Data:

    Introduction:

    This Jupyter Notebook explores data manipulation, aggregation, and visualization techniques using Python’s Pandas, Matplotlib, and Seaborn libraries. The key objectives of this analysis include:

    📌 Data Cleaning and Preparation ✔ Handling missing values in key columns. ✔ Standardizing and transforming categorical features (e.g., mode, release_day_name). ✔ Creating new derived features, such as decade classification and energy levels.

    📌 Feature Engineering & Data Transformation ✔ Extracting release trends from date-based columns. ✔ Categorizing song durations and popularity levels dynamically. ✔ Applying lambda functions, apply(), map(), and filter() for efficient data transformations. ✔ Using groupby() and aggregation functions to analyze trends in song streams. ✔ Ranking artists based on total streams using rank().

    📌 Data Aggregation and Trend Analysis ✔ Identifying the most common musical keys used in songs. ✔ Tracking song releases over time with rolling averages. ✔ Comparing Major vs. Minor key distributions in song compositions.

    📌 Data Visualization ✔ Bar plots for ranking top artists and stream counts. ✔ Box plots to analyze stream distribution per release year. ✔ Heatmaps to examine feature correlations. ✔ Pie charts to understand song popularity distribution.

    📌 Dataset Description The dataset consists of Spotify streaming statistics and includes features such as:

    🎵 track_name – Song title. 🎤 artist(s)_name – Name(s) of performing artists. 🔢 streams – Number of times the song was streamed. 📅 released_year, released_month, released_day – Date of song release. 🎼 energy_%, danceability_%, valence_% – Audio feature metrics. 📊 in_spotify_playlists – Number of Spotify playlists featuring the song. 🎹 mode – Musical mode (Major or Minor). 🎯 Purpose This analysis is designed for: ✔ Exploring real-world datasets to develop data analyst skills. ✔ Practicing data transformation, aggregation, and visualization techniques. ✔ Preparing for data analyst interviews by working with structured workflows.

    📌 Table of Contents 1️⃣ Data Cleaning & Preparation 2️⃣ Feature Engineering & Transformations (apply(), map(), filter(), groupby(), rank()) 3️⃣ Data Aggregation & Trend Analysis 4️⃣ Data Visualization & Insights 5️⃣ Conclusion and Key Takeaways

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Organization logo

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
text/x-python, csv, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.
Search
Clear search
Close search
Google apps
Main menu