30 datasets found

Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
data.niaid.nih.gov
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Z
NoCORA - Northern Cameroon Observed Rainfall Archive
data.niaid.nih.gov
zenodo.org
Updated Jul 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel (2024). NoCORA - Northern Cameroon Observed Rainfall Archive [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10156437
Explore at:
Dataset updated
Jul 10, 2024
Dataset provided by
Center for International Forestry Research
Centre de Coopération Internationale en Recherche Agronomique pour le Développement
Authors
Lavarenne, Jérémy; Nenwala, Victor Hugo; Foulna Tcheobe, Carmel
Area covered
Cameroon, North Region
Description
Description: The NoCORA dataset represents a significant effort to compile and clean a comprehensive set of daily rainfall data for Northern Cameroon (North and Extreme North regions). This dataset, overing more than 1 million observations across 418 rainfall stations on a temporal range going from 1927 to 2022, is instrumental for researchers, meteorologists, and policymakers working in climate research, agricultural planning, and water resource management in the region. It integrates data from diverse sources, including Sodecoton rain funnels, the archive of Robert Morel (IRD), Centrale de Lagdo, the GHCN daily service, and the TAHMO network. The construction of NoCORA involved meticulous processes, including manual assembly of data, extensive data cleaning, and standardization of station names and coordinates, making it a hopefully robust and reliable resource for understanding climatic dynamics in Northern Cameroon. Data Sources: The dataset comprises eight primary rainfall data sources and a comprehensive coordinates dataset. The rainfall data sources include extensive historical and contemporary measurements, while the coordinates dataset was developed using reference data and an inference strategy for variant station names or missing coordinates. Dataset Preparation Methods: The preparation involved manual compilation, integration of machine-readable files, data cleaning with OpenRefine, and finalization using Python/Jupyter Notebook. This process should ensured the accuracy and consistency of the dataset. Discussion: NoCORA, with its extensive data compilation, presents an invaluable resource for climate-related studies in Northern Cameroon. However, users must navigate its complexities, including missing data interpretations, potential biases, and data inconsistencies. The dataset's comprehensive nature and historical span require careful handling and validation in research applications. Access to Dataset: The NoCORA dataset, while a comprehensive resource for climatological and meteorological research in Northern Cameroon, is subject to specific access conditions due to its compilation from various partner sources. The original data sources vary in their openness and accessibility, and not all partners have confirmed the open-access status of their data. As such, to ensure compliance with these varying conditions, access to the NoCORA dataset is granted on a request basis. Interested researchers and users are encouraged to contact us for permission to access the dataset. This process allows us to uphold the data sharing agreements with our partners while facilitating research and analysis within the scientific community. Authors Contributions:

Data treatment: Victor Hugo Nenwala, Carmel Foulna Tcheobe, Jérémy Lavarenne. Documentation: Jérémy Lavarenne. Funding: This project was funded by the DESIRA INNOVACC project. Changelog:

v1.0.2 : corrected interversion in column names in the coordinates dataset v1.0.1 : dataset specification file has been updated with complementary information regarding station locations v1.0.0 : initial submission
Flipkart Laptops Data
kaggle.com
zip
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Hannan Ansari (2022). Flipkart Laptops Data [Dataset]. https://www.kaggle.com/datasets/ansariabdulhannan/flipkart-laptops-data
Explore at:
zip(18601 bytes)Available download formats
Dataset updated
Jun 6, 2022
Authors
Abdul Hannan Ansari
Description
These dataset contains 40 Laptops which has scrapped from the Flipkart website using python code. I want you to clean this dataset analyze it like a Data Analyst do. I know this dataset is too short I wish to make it messier and bigger later on as the need increases. If you have analyzed the data the make sure to upload the jupyter notebook of it. All the Best!

ABOUT UPDATED DATA: I have updated the data by adding more stuffs to make it interesting and messier. You can clean the dataset and do more stuffs over this dataset. Note that the prices are in INR Rupees
S&P 500 Companies Analysis Project
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anshadkaggle (2025). S&P 500 Companies Analysis Project [Dataset]. https://www.kaggle.com/datasets/anshadkaggle/s-and-p-500-companies-analysis-project
Explore at:
zip(9721576 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
anshadkaggle
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This project focuses on analyzing the S&P 500 companies using data analysis tools like Python (Pandas), SQL, and Power BI. The goal is to extract insights related to sectors, industries, locations, and more, and visualize them using dashboards.

Included Files:

sp500_cleaned.csv – Cleaned dataset used for analysis

sp500_analysis.ipynb – Jupyter Notebook (Python + SQL code)

dashboard_screenshot.png – Screenshot of Power BI dashboard

README.md – Summary of the project and key takeaways

This project demonstrates practical data cleaning, querying, and visualization skills.
Surveys of Data Professionals (Alex the Analyst)
kaggle.com
zip
Updated Nov 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
Explore at:
zip(81050 bytes)Available download formats
Dataset updated
Nov 27, 2023
Authors
Stewie
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
[Dataset Name] - About This Dataset

Overview

This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

Context

The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

Content

The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

Acknowledgements

The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

Citation

If you use this dataset in your work, please cite it as follows:

How to Use

Download the dataset from this link.

Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

Feel free to reach out for any additional information or clarification. Happy analyzing!
Diwali_Sales_Data_Analysis
kaggle.com
Updated Aug 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adimadapala Geetika (2023). Diwali_Sales_Data_Analysis [Dataset]. https://www.kaggle.com/datasets/adimadapalageetika/diwali-sales-data-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Adimadapala Geetika
Description
Completed Jupyter Notebook project: Conducted data cleaning, exploratory data analysis with pandas, matplotlib, and seaborn. Enhanced customer experience by identifying potential customers based on demographics and improved sales by optimizing inventory planning through product analysis.
H
Python Codes for Data Analysis of The Impact of COVID-19 on Technical...
dataverse.harvard.edu
figshare.com
Updated Mar 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Szkirpan (2022). Python Codes for Data Analysis of The Impact of COVID-19 on Technical Services Units Survey Results [Dataset]. http://doi.org/10.7910/DVN/SXMSDZ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/SXMSDZ
Dataset updated
Mar 21, 2022
Dataset provided by
Harvard Dataverse
Authors
Elizabeth Szkirpan
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Copies of Anaconda 3 Jupyter Notebooks and Python script for holistic and clustered analysis of "The Impact of COVID-19 on Technical Services Units" survey results. Data was analyzed holistically using cleaned and standardized survey results and by library type clusters. To streamline data analysis in certain locations, an off-shoot CSV file was created so data could be standardized without compromising the integrity of the parent clean file. Three Jupyter Notebooks/Python scripts are available in relation to this project: COVID_Impact_TechnicalServices_HolisticAnalysis (a holistic analysis of all survey data) and COVID_Impact_TechnicalServices_LibraryTypeAnalysis (a clustered analysis of impact by library type, clustered files available as part of the Dataverse for this project).
Sample Park Analysis
figshare.com
zip
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Delmelle (2025). Sample Park Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.30509021.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30509021.v1
Dataset updated
Nov 2, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Eric Delmelle
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
README – Sample Park Analysis## OverviewThis repository contains a Google Colab / Jupyter notebook and accompanying dataset used for analyzing park features and associated metrics. The notebook demonstrates data loading, cleaning, and exploratory analysis of the Hope_Park_original.csv file.## Contents- sample park analysis.ipynb — The main analysis notebook (Colab/Jupyter format)- Hope_Park_original.csv — Source dataset containing park information- README.md — Documentation for the contents and usage## Usage1. Open the notebook in Google Colab or Jupyter.2. Upload the Hope_Park_original.csv file to the working directory (or adjust the file path in the notebook).3. Run each cell sequentially to reproduce the analysis.## RequirementsThe notebook uses standard Python data science libraries:```pythonpandasnumpymatplotlibseaborn
Z
The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...
data.niaid.nih.gov
Updated Sep 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei (2020). The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures in Multiple Cities Sensed by Static & Mobile Devices [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4028129
Explore at:
Dataset updated
Sep 25, 2020
Dataset provided by
Carnegie Mellon University
Tsinghua University
Stanford University
Authors
Chen, Xinlei; Liu, Xinyu; Eng, Kent X.; Liu, Jingxiao; Noh, Hae Young; Zhang, Lin; Zhang, Pei
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This S&M-HSTPM2d5 dataset contains the high spatial and temporal resolution of the particulates (PM2.5) measures with the corresponding timestamp and GPS location of mobile and static devices in the three Chinese cities: Foshan, Cangzhou, and Tianjin. Different numbers of static and mobile devices were set up in each city. The sampling rate was set up as one minute in Cangzhou, and three seconds in Foshan and Tianjin. For the specific detail of the setup, please refer to the Device_Setup_Description.txt file in this repository and the data descriptor paper.

After the data collection process, the data cleaning process was performed to remove and adjust the abnormal and drifting data. The script of the data cleaning algorithm is provided in this repository. The data cleaning algorithm only adjusts or removes individual data points. The removal of the entire device's data was done after the data cleaning algorithm with empirical judgment and graphic visualization. For specific detail of the data cleaning process, please refer to the script (Data_cleaning_algorithm.ipynb) in this repository and the data descriptor paper.

The dataset in this repository is the processed version. The raw dataset and removed devices are not included in this repository.

The data is stored as a CSV file. Each CSV file which is named by the device ID represents the data that was collected by the corresponding device. Each CSV file has three types of data: timestamp as the China Standard Time (GMT+8), geographic location as latitude and longitude, and PM2.5 concentration with the unit of microgram per cubic meter. The CSV files are stored in either Static or Mobile folder which represents the devices' type. The Static and Mobile folder are stored in the corresponding city's folder.

To access the dataset, any programming language that can access CSV files is appropriate. Users can also open the CSV file directly. The get_dataset.ipynb file in this repository also provides an option of accessing the dataset. To successfully execute ipynb file, Jupyter Notebook with Python 3.0 is required. The following python library is also required:

get_dataset.ipynb: 1. os library 2. pandas library

Data_cleaning_algorithm.ipynb: 1. os library 2. pandas library 3. datetime library 4. math library

The instruction of installing the libraries above can be found online. After installing the Jupyter Notebook with Python 3.0 and the required libraries, users can try to open the ipynb file with Jupyter Notebook and follow the instruction inside the file.

For questions or suggestions please e-mail Xinlei Chen
Legality Without Justice: Symbolic Governance, Institutional Denial, and the...
zenodo.org
bin, csv
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Brown; Scott Brown (2025). Legality Without Justice: Symbolic Governance, Institutional Denial, and the Ethical Foundations of Law [Dataset]. http://doi.org/10.5281/zenodo.16361108
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16361108
Dataset updated
Nov 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Scott Brown; Scott Brown
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
This dataset accompanies the empirical analysis in Legality Without Justice, a study examining the relationship between public trust in institutions and perceived governance legitimacy using data from the World Values Survey Wave 7 (2017–2022). It includes:

WVS_Cross-National_Wave_7_csv_v6_0.csv — World Values Survey Wave 7 core data.

GDP.csv — World Bank GDP per capita (current US$) for 2022 by country.

denial.ipynb — Fully documented Jupyter notebook with code for data merging, exploratory statistics, and ordinal logistic regression using OrderedModel. Includes GDP as a control for institutional trust and perceived governance.

All data processing and analysis were conducted in Python using FAIR reproducibility principles and can be replicated or extended on Google Colab.

DOI: 10.5281/zenodo.16361108
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Authors: Anon Annotator
Publication date: 2025-07-23
Language: English
Version: 1.0.0
Publisher: Zenodo
Programming language: Python

🔽 How to Download and Run on Google Colab

Step 1: Open Google Colab

Go to https://colab.research.google.com

Step 2: Upload Files

Click File > Upload notebook, and upload the denial.ipynb file.
Also upload the CSVs (WVS_Cross-National_Wave_7_csv_v6_0.csv and GDP.csv) using the file browser on the left sidebar.

Step 3: Adjust File Paths (if needed)

In denial.ipynb, ensure file paths match:

python

CopiarEditar

wvs = pd.read_csv('/content/WVS_Cross-National_Wave_7_csv_v6_0.csv') gdp = pd.read_csv('/content/GDP.csv')

Step 4: Run the Code

Execute the notebook cells from top to bottom. You may need to install required libraries:

python

CopiarEditar

!pip install statsmodels pandas numpy

The notebook performs:

Data cleaning

Merging WVS and GDP datasets

Summary statistics

Ordered logistic regression to test if confidence in courts/police (Q57, Q58) predicts belief that the country is governed in the interest of the people (Q183), controlling for GDP.
Data from: Expanding drug targets for 112 chronic diseases using a machine...
zenodo.org
Updated Feb 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robert Chen; Robert Chen; Ron Do; Ron Do (2025). Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score [Dataset]. http://doi.org/10.5281/zenodo.14905752
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.14905752
Dataset updated
Feb 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robert Chen; Robert Chen; Ron Do; Ron Do
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 21, 2025
Description
ML-GPS: Machine Learning-Assisted Genetic Priority Score

This Zenodo repository contains data and code associated with the publication:

Chen R, Duffy Á, Petrazzini BO, Vy HM, Stein D, Mort M, Park JK, Schlessinger A, Itan Y, Cooper DN, Jordan DM, Rocheleau G, Do R. Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score. Nat Commun. 2024 Oct 15;15(1):8891. doi: 10.1038/s41467-024-53333-y.

Important notes

You can interactively view the top 10% of ML-GPS predictions without download at https://rstudio-connect.hpc.mssm.edu/mlgps/.

For running Jupyter notebooks, please follow the instructions in the README of the GitHub repository at https://github.com/robchiral/ML-GPS.

Repository contents

Files needed to train ML-GPS and ML-GPS DOE:

Files needed for Jupyter notebooks.zip: Data files required for preprocessing and training.

Jupyter notebooks.zip: Notebooks for cleaning data, training models, and generating predictions.

Other files:

Predictions for all gene-phecode pairs.zip: ML-GPS and ML-GPS DOE scores for all analyzed gene-phecode pairs.

Summary statistics.zip: Genetic association summary statistics for all tested gene-phecode pairs.

Updated performance metrics

Model Open Targets AUPRC SIDER AUPRC
ML-GPS (non-DOE) 0.074 0.080
ML-GPS DOE (activator predictions) 0.029 0.042
ML-GPS DOE (inhibitor predictions) 0.067 0.064

Zenodo versions

Version 4: Updated notebooks and external data to use Open Targets 2024.9; summary statistics are unchanged

Version 3: Corrected error where DOE for rare and ultrarare variants was incorrectly incorporated

Version 2: Original release accompanying the publication
Z
Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baboz, Eliott; Lavarenne, Jérémy (2024). Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA and RHoMIS datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10556265
Explore at:
Dataset updated
Jul 7, 2024
Dataset provided by
Centre de Coopération Internationale en Recherche Agronomique pour le Développement
Authors
Baboz, Eliott; Lavarenne, Jérémy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Africa, West Africa
Description
Description: The dataset represents a significant effort to compile and clean a comprehensive set of seasonal yield data for sub-saharan West Africa (Benin, Burkina Faso, Mali, Niger). This dataset, overing more than 22,000 survey answers scattered across more than 2500 unique locations of smallholder producers’ households groups, is instrumental for researchers and policymakers working in agricultural planning and food security in the region. It integrates data from two sources, the LSMS-ISA program (link to the World Bank's site), and the RHoMIS dataset (link to RHoMIS files, RHoMIS' DOI).

The construction of the dataset involved meticulous processes, including converting production into standardized unit, yield calculation for each dataset, standardization of column names, assembly of data, extensive data cleaning, and making it a hopefully robust and reliable resource for understanding spatial yield distribution in the region.

Data Sources: The dataset comprises seven spatialized yield data sources, six of which are from the LSMS-ISA program (Mali 2014, Mali 2017, Mali 2018, Benin 2018, Burkina Faso 2018, Niger 2018) and one from the RHoMIS study (only Mali 2017 and Burkina Faso 2018 data selected).

Dataset Preparation Methods: The preparation involved integration of machine-readable files, data cleaning and finalization using Python/Jupyter Notebook. This process should ensure the accuracy and consistency of the dataset. Yield have been calculated with declared production quantities and GPS-measured plot areas. Each yield value corresponds to a single plot.

Discussion: This dataset, with its extensive data compilation, presents an invaluable resource for agricultural productivity-related studies in West Africa. However, users must navigate its complexities, including potential biases due to survey and due to UML units, and data inconsistencies. The dataset's comprehensive nature requires careful handling and validation in research applications.

Authors Contributions:

Data treatment: Eliott Baboz, Jérémy Lavarenne.

Documentation: Jérémy Lavarenne.

Funding: This project was funded by the INTEN-SAHEL TOSCA project (Centre national d’études spatiales). "123456789" was chosen randomly and is not the actual award number because there is none, but it was mandatory to put one here on Zenodo.

Changelog:

v1.0.0 : initial submission
divvy's Trip (Cyclist bike share analysis)
kaggle.com
zip
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
katabathina jyoshnavi (2024). divvy's Trip (Cyclist bike share analysis) [Dataset]. https://www.kaggle.com/datasets/katabathinajyoshnavi/divvys-trip-cyclist-bike-share-analysis
Explore at:
zip(194213174 bytes)Available download formats
Dataset updated
Apr 10, 2024
Authors
katabathina jyoshnavi
Description
Introduction:

About the Company:

Cyclistic is a bike-sharing company in Chicago, which has since expanded to include a fleet of 5,824 geotracked bicycles stationed at 692 locations across Chicago. The bikes can be unlocked at one station and returned to any other station within the network at any time. Individuals buying single-ride or full-day passes fall into the category of casual riders, while those acquiring annual memberships become recognized as Cyclistic members. Tools and Technologies: ⦁ Tableau/Power BI for dashboard development. ⦁ Python for data analysis

Phase 1: About the Dataset: The data is publicly available on an AWS server. We were tasked to work with an entire year of data, so I downloaded zipped files (CSV format) containing data from January 2023 to December 2023, one file for each month. Data Structure: Each .csv file contains a table with 13 columns with varying data types as shown below. Each column stands for a field that describes how people use Cyclist's bike-sharing service. Each row represents an observation with the details of every ride. ⦁ ride_id: This is a unique identifier assigned to each bike ride. It's like a reference number for the trip. ⦁ rideable_type: This column indicates the type of bike used in the ride. It can be "electric_bike" or "classic_bike". ⦁ started_at: This shows the date and time when the ride began. The format is YYYY-MM-DD HH:MM:SS. ⦁ ended_at: This shows the date and time when the ride ended. The format is the same as the started_at column. ⦁ start_station_name: This specifies the name of the docking station where the ride started. ⦁ start_station_id: This is a unique identifier for the starting docking station. It complements the start_station_name. ⦁ start_lat: This represents the latitude coordinate of the starting docking station. ⦁ start_lng: This represents the longitude coordinate of the starting docking station. These coordinates might be useful for mapping the station's location. ⦁ end_station_name: This specifies the name of the docking station where the ride ended. ⦁ end_station_id: This is a unique identifier for the ending docking station. It complements the end_station_name. ⦁ end_lat: This represents the latitude coordinate of the ending docking station. ⦁ end_lng: This represents the longitude coordinate of the ending docking station. These coordinates might be useful for mapping the station's location. ⦁ member_casual: This column indicates whether the rider was a member (member) or a casual user (casual) of the bike-sharing service. Phase 2: I used python for data cleaning You can view the Jupyter Notebook for the Process phase here Here are the steps that I did during this phase ⦁ Check for null and duplicates ⦁ Additional columns and data transformation (change the data type, remove trailing or leading spaces, etc.) ⦁ Extract data for analysis Data Cleaning Result Total Row Count before data cleaning: 5745324 Total Row Count before data cleaning: 4268747

Phase 3: Analyze: I used Python in my jupyter notebook to look at the huge data we cleaned earlier. I came up with questions to figure out how casual riders are different from annual members. Then, I made queries to get the answers, helping us understand more and make decisions based on the data. Questions Here are the following questions we will answer in this phase: ⦁ What is the percentage of user types from total users? ⦁ Is there a bike type preferred by different user types? ⦁ Which bike type has the longest trip duration between users? ⦁ What is the average trip duration per user type? ⦁ What is the average distance traveled per user type? ⦁ What days are most users active? ⦁ What months or seasons of the year users tend to use the bike-sharing service?

I used Tableau public in making the visualization. You can view the data visualization for the Share phase here https://public.tableau.com/app/profile/katabathina.jyoshnavi/viz/divvytripvisualisation/Dashboard7.

Findings ⦁ 63% of the total Cyclistic users are annual members while 36% are casual riders. ⦁ Both annual members and casual riders prefer classic bikes. Only casual riders use docked bikes. ⦁ Generally, casual riders have the longest average ride duration (23 minutes) compared with annual members (18 minutes). ⦁ Both annual members and casual riders have almost the same average distance traveled. ⦁ Docked bikes have the longest average ride duration which only casual riders use. Classic bikes have the longest average ride duration for annual members. ⦁ Most trips are recorded on Saturday. ⦁ There are more trips during spring and at least during winter.
Five years of quality-controlled meteorological surface data at Oak Ridge...
zenodo.org
bin, zip
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Steckler; Morgan Steckler; Xiao-Ying Yu; Xiao-Ying Yu; Kevin Birdwell; Kevin Birdwell; haowen xu; haowen xu (2025). Five years of quality-controlled meteorological surface data at Oak Ridge Reserve in Tennessee [Dataset]. http://doi.org/10.5281/zenodo.14744006
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14744006
Dataset updated
Apr 21, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morgan Steckler; Morgan Steckler; Xiao-Ying Yu; Xiao-Ying Yu; Kevin Birdwell; Kevin Birdwell; haowen xu; haowen xu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Oak Ridge, Tennessee
Description
Access to continuous, quality assessed meteorological data is critical for understanding the climatology and atmospheric dynamics of a region. Research facilities like Oak Ridge National Laboratory (ORNL) rely on such data to assess site-specific climatology, model potential emissions, establish safety baselines, and prepare for emergency scenarios. To meet these needs, on-site towers at ORNL collect meteorological data at 15-minute and hourly intervals. However, data measurements from meteorological towers are affected by sensor sensitivity, degradation, lightning strikes, power fluctuations, glitching, and sensor failures, all of which can affect data quality. To address these challenges, we conducted a comprehensive quality assessment and processing of five years of meteorological data collected from ORNL at 15-minute intervals, including measurements of temperature, pressure, humidity, wind, and solar radiation. The time series of each variable was pre-processed and gap-filled using established meteorological data collection and cleaning techniques, i.e., the time series were subjected to structural standardization, data integrity testing, automated and manual outlier detection, and gap-filling. The data product and highly generalizable processing workflow developed in Python Jupyter notebooks are publicly accessible online. As a key contribution of this study, the evaluated 5-year data will be used to train atmospheric dispersion models that simulate dispersion dynamics across the complex ridge-and-valley topography of the Oak Ridge Reservation in East Tennessee.
Shopping Mall Customer Data Segmentation Analysis
kaggle.com
zip
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
Explore at:
zip(5890828 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
DataZng
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Demographic Analysis of Shopping Behavior: Insights and Recommendations

Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/
Social Media Customer Analysis
kaggle.com
zip
Updated Apr 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nafe Muhtasim (2021). Social Media Customer Analysis [Dataset]. https://www.kaggle.com/nafemuhtasim/social-media-customer-analysis
Explore at:
zip(108529 bytes)Available download formats
Dataset updated
Apr 16, 2021
Authors
Nafe Muhtasim
Description
This is the data of a social media platform of an organization. You have been hired by the organization & given their social media data to analyze, visualize and prepare a report on it.

You are required to prepare a neat notebook on it using Jupyter Notebook/Jupyter Lab or Google Colab. Then, zip everything including the notebook file (.ipynb file) and the dataset. Finally, upload through the google forms link stated below. The notebook should be neat, containing codes with details regarding your code, visualizations, and description of your purpose of doing each task.

You are suggested but not limited to go through the general steps like -> Data Cleaning, Data preparation, Exploratory Data Analysis(EDA), Correlations finding, Feature extraction, and more. (There is no limit to your skills and ideas)

After doing what needs to be done, you are to give your organization insights and facts. For example, are they reaching more audiences on weekends? Is posting content on the weekdays turn out to be more effective? Is posting many contents on the same day make more sense? Or, should they post content regularly and keep day-to-day consistency? Did you find any trend patterns in the data? What are your advice after completing the analysis? Mention them clearly at the end of the Notebook. (These are just a few examples, your findings may be entirely different and that is totally acceptable. )

Note that, we will value clear documentation which states clear insights from analysis of data & visualizations, more than anything else. It will not matter how complex methods are you applying if it eventually does not find anything useful.
Financial ratios 4 Nasdaq 100 membrs + 12m returns
kaggle.com
zip
Updated Jun 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SheepBoss (2023). Financial ratios 4 Nasdaq 100 membrs + 12m returns [Dataset]. https://www.kaggle.com/datasets/mlcapital/financial-ratios-4-nasdaq-100-membrs-12m-returns/code
Explore at:
zip(212319 bytes)Available download formats
Dataset updated
Jun 7, 2023
Authors
SheepBoss
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
A Python module using Jupyter Notebooks to take an existing dataset available at Kaggle and undertake some data cleansing, data hard coding and data science management so it can be more useful for Machine Learning models. Source of original dataset: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data

Introduction The problem we are trying to solve is that there are very limited datasets on Kaggle if you wish to apply ML models to the problem of individual stock Share Price prediction using financial statement ratios as your input data. This is a problem that needs addressing as there is a multi-billion global fundamental financial ratio investment analysis industry that is ripe for performance enhancement by Machine Learning. We believe that the best dataset for such a purpose on Kaggle was the above dataset that we found above. The problem with this dataset for ML model use was as follows: • There was a number of data attributes that were not shown across each annual period. We removed data attributes that were not populated across all the annual periods. • We filled in data that was missing and we replaced NANs and INFs with logical and reasonable fill values. • We attached label data being 12 month ahead Share Price returns for each stock and each annual period providing this data both as discrete percentage returns and binary outperform or underperform the Nasdaq 100 index labels.

Resulting Datasets The resulting datasets cover 102 stocks using 39 financial ratios across both 4 and 5 year periods using two different types of labels.

In summary, this repository provides a Jupyter Notebook that shows the steps undertaken to generate:

Two datasets for 2017 to 2021 with the Y labels attached at the end column. • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year. And Two mote datasets for 2017 to 2020 with the same Y label data as above: • labels 1 or 0: for binary outperformance against index. • perfs labels: for actual performance for the stock for that calendar year.

Usage & Contributing At the moment the project is in development. You can use the repository and play with the Jupyter Notebook to generate your own datasets with differing assumptions to ours. We will then load up some ML models that we think can be the most effective at predicting 12 month forward Share Price outcomes based on the 39 financial ratios provided. We would welcome your thoughts on our models. Even better we would welcome YOUR ideas on the best models to use to solve such a prediction problem using these datasets? You can always help to get this problem solved. It's an open-source project after all!

Resources • Kaggle: https://www.kaggle.com/datasets/ifuurh/nasdaq100-fundamental-data • Jupyter Notebooks: https://jupyter.org/ • Yfinance: https://pypi.org/project/yfinance/
Electric Vehicle Population Analysis
kaggle.com
zip
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nibedita Sahu (2025). Electric Vehicle Population Analysis [Dataset]. https://www.kaggle.com/datasets/nibeditasahu/electric-vehicle-population-analysis
Explore at:
zip(10564209 bytes)Available download formats
Dataset updated
Jun 23, 2025
Authors
Nibedita Sahu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Electric Vehicle Population Analysis

A data-driven end-to-end analysis of Electric Vehicle adoption, performance, and policy alignment across Washington State. This project covers everything from data cleaning and exploration to visualization and presentation — using SQL, Python, and Power BI.

Tools & Technologies

SQL (MySQL): Data cleaning, filtering, type conversion, preprocessing

Python (Jupyter Notebook): Pandas, SQLAlchemy, NumPy, Matplotlib, Seaborn

Pandas Profiling / YData EDA: Automated EDA for in-depth data profiling

Power BI: Interactive, multi-page report design and visual analysis
Z
Data from: Spatial and temporal variation in the value of solar power across...
data.niaid.nih.gov
zenodo.org
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brown, Patrick R. (2020). Spatial and temporal variation in the value of solar power across United States electricity markets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3562895
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
MIT Energy Initiative
Authors
Brown, Patrick R.
Area covered
United States
Description
This repository includes python scripts and input/output data associated with the following publication:

[1] Brown, P.R.; O'Sullivan, F. "Spatial and temporal variation in the value of solar power across United States Electricity Markets". Renewable & Sustainable Energy Reviews 2019. https://doi.org/10.1016/j.rser.2019.109594

Please cite reference [1] for full documentation if the contents of this repository are used for subsequent work.

Many of the scripts, data, and descriptive text in this repository are shared with the following publication:

[2] Brown, P.R.; O'Sullivan, F. "Shaping photovoltaic array output to align with changing wholesale electricity price profiles". Applied Energy 2019, 256, 113734. https://doi.org/10.1016/j.apenergy.2019.113734

All code is in python 3 and relies on a number of dependencies that can be installed using pip or conda.

Contents

pvvm/*.py : Python module with functions for modeling PV generation and calculating PV energy revenue, capacity value, and emissions offset.

notebooks/*.ipynb : Jupyter notebooks, including:

pvvm-vos-data.ipynb: Example scripts used to download and clean input LMP data, determine LMP node locations, assign nodes to capacity zones, download NSRDB input data, and reproduce some figures in [1]

pvvm-example-generation.ipynb: Example scripts demonstrating the use of the PV generation model and a sensitivity analysis of PV generator assumptions

pvvm-example-plots.ipynb: Example scripts demonstrating different plotting functions

validate-pv-monthly-eia.ipynb: Scripts and plots for comparing modeled PV generation with monthly generation reported in EIA forms 860 and 923, as discussed in SI Note 3 of [1]

validate-pv-hourly-pvdaq.ipynb: Scripts and plots for comparing modeled PV generation with hourly generation reported in NREL PVDAQ database, as discussed in SI Note 3 of [1]

pvvm-energyvalue.ipynb: Scripts for calculating the wholesale energy market revenues of PV and reproducing some figures in [1]

pvvm-capacityvalue.ipynb: Scripts for calculating the capacity credit and capacity revenues of PV and reproducing some figures in [1]

pvvm-emissionsvalue.ipynb: Scripts for calculating the emissions offset of PV and reproducing some figures in [1]

pvvm-breakeven.ipynb: Scripts for calculating the breakeven upfront cost and carbon price for PV and reproducing some figures in [1]

html/*.html : Static images of the above Jupyter notebooks for viewing without a python kernel

data/lmp/*.gz : Day-ahead nodal locational marginal prices (LMPs) and marginal costs of energy (MCE), congestion (MCC), and losses (MCL) for CAISO, ERCOT, MISO, NYISO, and ISONE.

At the time of publication of this repository, permission had not been received from PJM to republish their LMP data. If permission is received in the future, a new version of this repository will be linked here with the complete dataset.

results/*.csv.gz : Simulation results associated with [1], including modeled energy revenue, capacity credit and revenue, emissions offsets, and breakeven costs for PV systems at all LMP nodes

Data notes

ISO LMP data are used with permission from the different ISOs. Adapting the MIT License (https://opensource.org/licenses/MIT), "The data are provided 'as is', without warranty of any kind, express or implied, including but not limited to the warranties of merchantibility, fitness for a particular purpose and noninfringement. In no event shall the authors or sources be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the data or other dealings with the data." Copyright and usage permissions for the LMP data are available on the ISO websites, linked below.

ISO-specific notes on LMP data:

CAISO data from http://oasis.caiso.com/mrioasis/logon.do are used pursuant to the terms at http://www.caiso.com/Pages/PrivacyPolicy.aspx#TermsOfUse.

ERCOT data are from http://www.ercot.com/mktinfo/prices.

MISO data are from https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/ and https://www.misoenergy.org/markets-and-operations/real-time--market-data/market-reports/market-report-archives/.

PJM data were originally downloaded from https://www.pjm.com/markets-and-operations/energy/day-ahead/lmpda.aspx and https://www.pjm.com/markets-and-operations/energy/real-time/lmp.aspx. At the time of this writing these data are currently hosted at https://dataminer2.pjm.com/feed/da_hrl_lmps and https://dataminer2.pjm.com/feed/rt_hrl_lmps.

NYISO data from http://mis.nyiso.com/public/ are used subject to the disclaimer at https://www.nyiso.com/legal-notice.

ISONE data are from https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-da-hourly and https://www.iso-ne.com/isoexpress/web/reports/pricing/-/tree/lmps-rt-hourly-final. The Material is provided on an "as is" basis. ISO New England Inc., to the fullest extent permitted by law, disclaims all warranties, either express or implied, statutory or otherwise, including but not limited to the implied warranties of merchantability, non-infringement of third parties' rights, and fitness for particular purpose. Without limiting the foregoing, ISO New England Inc. makes no representations or warranties about the accuracy, reliability, completeness, date, or timeliness of the Material. ISO New England Inc. shall have no liability to you, your employer or any other third party based on your use of or reliance on the Material.

Data workup: LMP data were downloaded directly from the ISOs using scripts similar to the pvvm.data.download_lmps() function (see below for caveats), then repackaged into single-node single-year files using the pvvm.data.nodalize() function. These single-node single-year files were then combined into the dataframes included in this repository, using the procedure shown in the pvvm-vos-data.ipynb notebook for MISO. We provide these yearly dataframes, rather than the long-form data, to minimize file size and number. These dataframes can be unpacked into the single-node files used in the analysis using the pvvm.data.copylmps() function.

Usage notes

Code is provided under the MIT License, as specified in the pvvm/LICENSE file and at the top of each *.py file.

Updates to the code, if any, will be posted in the non-static repository at https://github.com/patrickbrown4/pvvm_vos. The code in the present repository has the following version-specific dependencies:

matplotlib: 3.0.3

numpy: 1.16.2

pandas: 0.24.2

pvlib: 0.6.1

scipy: 1.2.1

tqdm: 4.31.1

To use the NSRDB download functions, you will need to modify the "settings.py" file to insert a valid NSRDB API key, which can be requested from https://developer.nrel.gov/signup/. Locations can be specified by passing (latitude, longitude) floats to pvvm.data.downloadNSRDBfile(), or by passing a string googlemaps query to pvvm.io.queryNSRDBfile(). To use the googlemaps functionality, you will need to request a googlemaps API key (https://developers.google.com/maps/documentation/javascript/get-api-key) and insert it in the "settings.py" file.

Note that many of the ISO websites have changed in the time since the functions in the pvvm.data module were written and the LMP data used in the above papers were downloaded. As such, the pvvm.data.download_lmps() function no longer works for all ISOs and years. We provide this function to illustrate the general procedure used, and do not intend to maintain it or keep it up to date with the changing ISO websites. For up-to-date functions for accessing ISO data, the following repository (no connection to the present work) may be helpful: https://github.com/catalyst-cooperative/pudl.
Spotify-Dataset_for_Self_practise
kaggle.com
zip
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sonal Anand (2025). Spotify-Dataset_for_Self_practise [Dataset]. https://www.kaggle.com/datasets/sonalanand/spotify-dataset-for-self-practise/data
Explore at:
zip(48187 bytes)Available download formats
Dataset updated
Feb 24, 2025
Authors
Sonal Anand
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🎵 Unveiling Spotify Trends: A Deep Dive into Streaming Data:

Introduction:

This Jupyter Notebook explores data manipulation, aggregation, and visualization techniques using Python’s Pandas, Matplotlib, and Seaborn libraries. The key objectives of this analysis include:

📌 Data Cleaning and Preparation ✔ Handling missing values in key columns. ✔ Standardizing and transforming categorical features (e.g., mode, release_day_name). ✔ Creating new derived features, such as decade classification and energy levels.

📌 Feature Engineering & Data Transformation ✔ Extracting release trends from date-based columns. ✔ Categorizing song durations and popularity levels dynamically. ✔ Applying lambda functions, apply(), map(), and filter() for efficient data transformations. ✔ Using groupby() and aggregation functions to analyze trends in song streams. ✔ Ranking artists based on total streams using rank().

📌 Data Aggregation and Trend Analysis ✔ Identifying the most common musical keys used in songs. ✔ Tracking song releases over time with rolling averages. ✔ Comparing Major vs. Minor key distributions in song compositions.

📌 Data Visualization ✔ Bar plots for ranking top artists and stream counts. ✔ Box plots to analyze stream distribution per release year. ✔ Heatmaps to examine feature correlations. ✔ Pie charts to understand song popularity distribution.

📌 Dataset Description The dataset consists of Spotify streaming statistics and includes features such as:

🎵 track_name – Song title. 🎤 artist(s)_name – Name(s) of performing artists. 🔢 streams – Number of times the song was streamed. 📅 released_year, released_month, released_day – Date of song release. 🎼 energy_%, danceability_%, valence_% – Audio feature metrics. 📊 in_spotify_playlists – Number of Spotify playlists featuring the song. 🎹 mode – Musical mode (Major or Minor). 🎯 Purpose This analysis is designed for: ✔ Exploring real-world datasets to develop data analyst skills. ✔ Practicing data transformation, aggregation, and visualization techniques. ✔ Preparing for data analyst interviews by working with structured workflows.

📌 Table of Contents 1️⃣ Data Cleaning & Preparation 2️⃣ Feature Engineering & Transformations (apply(), map(), filter(), groupby(), rank()) 3️⃣ Data Aggregation & Trend Analysis 4️⃣ Data Visualization & Insights 5️⃣ Conclusion and Key Takeaways

Facebook

Twitter

Click to copy link

Link copied

Cite

Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

text/x-python, csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6957842

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juliane Köhler; Juliane Köhler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Model	Open Targets AUPRC	SIDER AUPRC
ML-GPS (non-DOE)	0.074	0.080
ML-GPS DOE (activator predictions)	0.029	0.042
ML-GPS DOE (inhibitor predictions)	0.067	0.064

Data Cleaning, Translation & Split of the Dataset for the Automatic...

NoCORA - Northern Cameroon Observed Rainfall Archive

Flipkart Laptops Data

S&P 500 Companies Analysis Project

Surveys of Data Professionals (Alex the Analyst)

[Dataset Name] - About This Dataset

Overview

Context

Content

Acknowledgements

Citation

How to Use

Diwali_Sales_Data_Analysis

Python Codes for Data Analysis of The Impact of COVID-19 on Technical...

Sample Park Analysis

The S&M-HSTPM2d5 dataset: High Spatial-Temporal Resolution PM 2.5 Measures...

Legality Without Justice: Symbolic Governance, Institutional Denial, and the...

🔽 How to Download and Run on Google Colab

Step 1: Open Google Colab

Step 2: Upload Files

Step 3: Adjust File Paths (if needed)

Step 4: Run the Code

Data from: Expanding drug targets for 112 chronic diseases using a machine...

ML-GPS: Machine Learning-Assisted Genetic Priority Score

Important notes

Repository contents

Other files:

Updated performance metrics

Zenodo versions

Spatialized sorghum & millet yields in West Africa, derived from LSMS-ISA...

divvy's Trip (Cyclist bike share analysis)

Five years of quality-controlled meteorological surface data at Oak Ridge...

Shopping Mall Customer Data Segmentation Analysis

Social Media Customer Analysis

Financial ratios 4 Nasdaq 100 membrs + 12m returns

Electric Vehicle Population Analysis

Electric Vehicle Population Analysis

Tools & Technologies

Data from: Spatial and temporal variation in the value of solar power across...

Spotify-Dataset_for_Self_practise

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft