14 datasets found

Z
Data from: ReaLSAT, a global dataset of reservoir and lake surface area...
data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more
Updated Feb 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar (2023). ReaLSAT, a global dataset of reservoir and lake surface area variations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4118463
Explore at:
Dataset updated
Feb 8, 2023
Dataset provided by
Virginia Tech
Beijing University of Technology
University of Wisconsin
University of Minnesota
Authors
Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.

The dataset contains the following files:

1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.

2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.

3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.

4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.

6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.

Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:

[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.

[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).

Version Updates

Version 2.0:

extends the datasets to 2020.

provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.

provides a notebook to visualize the updated dataset.

Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.

Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb

Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.
Top Rated TV Shows
kaggle.com
zip
Updated Jan 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreya Gupta (2025). Top Rated TV Shows [Dataset]. https://www.kaggle.com/datasets/shreyajii/top-rated-tv-shows
Explore at:
zip(314571 bytes)Available download formats
Dataset updated
Jan 5, 2025
Authors
Shreya Gupta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.

Key Stats:

Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):

id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):

python Copy code import requests

api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }

response = requests.get(url, params=params) data = response.json()

Display the first show

print(data['results'][0]) Dataset Use Cases:

Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):

python Copy code import pandas as pd

Convert the API data to a DataFrame

df = pd.DataFrame(data['results'])

Save to CSV and upload to Google Drive

from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:

Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.
TMF Business Process Framework Dataset for Neo4j
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aleksei Golovin (2023). TMF Business Process Framework Dataset for Neo4j [Dataset]. https://www.kaggle.com/datasets/algord/tmf-business-process-framework-dataset-for-neo4j
Explore at:
zip(13261206 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
Aleksei Golovin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
TMF Business Process Framework Dataset for Neo4j

The dataset is a Neo4j knowledge graph based on TMF Business Process Framework v22.0 data.
CSV files contain data about the model entities, and the JSON file contains knowledge graph mapping.
The script used to generate CSV files based on the XML model can be found here.

To import the dataset, download the zip archive and upload it to Neo4j.

You also can check this dataset here.
z
The Cultural Resource Curse: How Trade Dependence Undermines Creative...
zenodo.org
bin, csv
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anon Anon; Anon Anon (2025). The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries [Dataset]. http://doi.org/10.5281/zenodo.16784974
Explore at:
csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.16784974
Dataset updated
Aug 9, 2025
Dataset provided by
Zenodo
Authors
Anon Anon; Anon Anon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the study The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries. It contains country-year panel data for 2000–2023 covering both OECD economies and the ten largest Latin American countries by land area. Variables include GDP per capita (constant PPP, USD), trade openness, internet penetration, education indicators, cultural exports per capita, and executive constraints from the Polity V dataset.

The dataset supports a comparative analysis of how economic structure, institutional quality, and infrastructure shape cultural export performance across development contexts. Within-country fixed effects models show that trade openness constrains cultural exports in OECD economies but has no measurable effect in resource-dependent Latin America. In contrast, strong executive constraints benefit cultural industries in advanced economies while constraining them in extraction-oriented systems. The results provide empirical evidence for a two-stage development framework in which colonial extraction legacies create distinct constraints on creative industry growth.

All variables are harmonized to ISO3 country codes and aligned on a common panel structure. The dataset is fully reproducible using the included Jupyter notebooks (OECD.ipynb, LATAM+OECD.ipynb, cervantes.ipynb).

Contents:

GDPPC.csv — GDP per capita series from the World Bank.

explanatory.csv — Trade openness, internet penetration, and education indicators.

culture_exports.csv — UNESCO cultural export data.

p5v2018.csv — Polity V institutional indicators.

Jupyter notebooks for data processing and replication.

Potential uses: Comparative political economy, cultural economics, institutional development, and resource curse research.

How to Run This Dataset and Code in Google Colab

These steps reproduce the OECD vs. Latin America analyses from the paper using the provided CSVs and notebooks.

1) Open Colab and set up

Go to https://colab.research.google.com

Click File → New notebook.

(Optional) If your files are in Google Drive, mount it:

python

CopiarEditar

from google.colab import drive drive.mount('/content/drive')

2) Get the data files into Colab

You have two easy options:

A. Upload the 4 CSVs + notebooks directly

In the left sidebar, click the folder icon → Upload.

Upload: GDPPC.csv, explanatory.csv, culture_exports.csv, p5v2018.csv, and any .ipynb you want to run.

B. Use Google Drive

Put those files in a Drive folder.

After mounting Drive, refer to them with paths like /content/drive/MyDrive/your_folder/GDPPC.csv.
LLM-generated essay using PaLM from Google Gen-AI
kaggle.com
zip
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingki19 / Muhammad Rizqi (2023). LLM-generated essay using PaLM from Google Gen-AI [Dataset]. https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai/code
Explore at:
zip(519291 bytes)Available download formats
Dataset updated
Nov 8, 2023
Authors
Kingki19 / Muhammad Rizqi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In this competition LLM - Detect AI Generated Text, there's problem. There is an imbalance in the data which causes the LLM-generated essays to be less than the essays written by students. Here you can see EDA for the data competition in this notebook: AI or Not AI? Delving Into Essays with EDA

So to solve this problem i made my own LLM-generated essay dataset that generated by PaLM to fix imbalance in data. You can see my work about how to generate the data in this Google Colaboratory:
Generate LLM dataset using PaLM.ipynb

Reason why i can't generated in Kaggle Notebook because Kaggle Notebook was use Kaggle Docker that can't use my own PaLM API. (My opinion)

Column in data: - id: index - prompt_id: prompt that used to generated data. You can check the prompt here! - text: LLM-generated essay by PaLM based on prompt - generated: Additional information that shows it was LLM-generated. All value in this column is 1.
NYC Jobs Dataset (Filtered Columns)
kaggle.com
zip
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
Explore at:
zip(93408 bytes)Available download formats
Dataset updated
Oct 5, 2022
Authors
Jeffery Mandrake
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Area covered
New York
Description
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

Once the csv file is uploaded to Google Colab, use these commands to process the file.

import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Z
TWristAR - wristband activity recognition
data.niaid.nih.gov
zenodo.org
Updated Feb 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis (2022). TWristAR - wristband activity recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5911807
Explore at:
Dataset updated
Feb 7, 2022
Dataset provided by
Texas State University
Authors
Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TWristAR is a small three subject dataset recorded using an e4 wristband. Each subject performed six scripted activities: upstairs/downstairs, walk/jog, and sit/stand. Each activity except stairs was performed for one minute a total of three times alternating between the pairs. Subjects 1 & 2 also completed a walking sequence of approximately 10 minutes. The dataset contains motion (accelerometer) data, temperature, electrodermal activity, and heart rate data. The .csv file associated with each datafile contains timing and labeling information and was built using the provided Excel files.

Each two activity session was recorded using a downward facing action camera. This video was used to generate the labels and is provided to investigate any data anomalies, especially for the free-form long walk. For privacy reasons only the sub1_stairs video contains audio.

The Jupyter notebook processes the acceleration data and performs hold-one-subject out evaluation of a 1D-CNN. Example results from a run performed on a google colab GPU instance (w/o GPU the training time increases to about 90 seconds per pass):

Hold-one-subject-out results Train Sub Test Sub Accuracy Training Time (HH:MM:SS) [1,2] [3] 0.757 00:00:12 [2,3] [1] 0.849 00:00:14 [1,3] [2] 0.800 00:00:11

This notebook can also be run in colab here. This video describes the processing https://mediaflo.txstate.edu/Watch/e4_data_processing.

We hope you find this a useful dataset with end-to-end code. We have several papers in process and would appreciate your citation of the dataset if you use it in your work.
Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...
zenodo.org
data.niaid.nih.gov
csv
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille (2023). Gravity Spy Machine Learning Classifications of LIGO Glitches from Observing Runs O1, O2, O3a, and O3b [Dataset]. http://doi.org/10.5281/zenodo.5649212
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5649212
Dataset updated
Jan 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.

There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).

For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.

If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command

conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy

After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.

from gwpy.table import GravitySpyTable H1_O2 = GravitySpyTable.read('H1_O2.csv') H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)] H1_O2[0:4].download(nproc=1)

Each of the columns in the CSV files are taken from various different inputs:

[‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.

[‘gravityspy_id’] is the unique identifier for each glitch in the dataset.

[‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.

[‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.

[‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.

For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.

For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.
Oslo City Bike Open Data
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
stanislav_o27 (2025). Oslo City Bike Open Data [Dataset]. https://www.kaggle.com/datasets/stanislavo27/oslo-city-bike-open-data
Explore at:
zip(251012812 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
stanislav_o27
Area covered
Oslo
Description
Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

oslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).

Dataset structure

Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen

Variables

from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station

Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.

Motivation

I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.

**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing

https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view

Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.

Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory

Challenge Ideas

Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)

Resources & Related Work

Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis

links to works I have found or that have inspired me

Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.

Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).

Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.

External Data Sources

The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo

I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv

oslo bike-sharing eda feature-engineering geospatial time-series
Cellpose models for Label Prediction from Brightfield and Digital Phase...
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Romain Guiet; Romain Guiet; Olivier Burri; Olivier Burri (2022). Cellpose models for Label Prediction from Brightfield and Digital Phase Contrast images [Dataset]. http://doi.org/10.5281/zenodo.6140111
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6140111
Dataset updated
Feb 28, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Romain Guiet; Romain Guiet; Olivier Burri; Olivier Burri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Name: Cellpose models for Brightfield and Digital Phase Contrast images

Data type: Cellpose models trained via transfer learning from the ‘nuclei’ and ‘cyto2’ pretrained model with additional Training Dataset . Includes corresponding csv files with 'Quality Control' metrics(§) (model.zip).

Training Dataset: Light microscopy (Digital Phase Contrast or Brightfield) and automatic annotations (nuclei or cyto) (https://doi.org/10.5281/zenodo.6140064)

Training Procedure: The cellpose models were trained using cellpose version 1.0.0 with GPU support (NVIDIA GeForce K40) using default settings as per the Cellpose documentation . Training was done using a Renku environment (renku template).

Command Line Execution for the different trained models

nuclei_from_bf:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _bf --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

cyto_from_bf:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _bf --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

nuclei_from_dpc:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _dpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

cyto_from_dpc:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _dpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

nuclei_from_sqrdpc:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _sqrdpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

cyto_from_sqrdpc:

cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _sqrdpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

NOTE (§): We provide a notebook for Quality Control, which is an adaptation of the "Cellpose (2D and 3D)" notebook from ZeroCostDL4Mic .

NOTE: This dataset used a training dataset from the Zenodo entry(https://doi.org/10.5281/zenodo.6140064) generated from the “HeLa “Kyoto” cells under the scope” dataset Zenodo entry(https://doi.org/10.5281/zenodo.6139958) in order to automatically generate the label images.

NOTE: Make sure that you delete the “_flow” images that are auto-computed when running the training. If you do not, then the flows from previous runs will be used for the new training, which might yield confusing results.
Social Media and Mental Health
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
Explore at:
zip(10944 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
SouvikAhmed071
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

The following is the Google Colab link to the project, done on Jupyter Notebook -

https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

The following is the GitHub Repository of the project -

https://github.com/daerkns/social-media-and-mental-health

Libraries used for the Project -

Pandas Numpy Matplotlib Seaborn Sci-kit Learn
A Capstone Project - Cyclistic shared bikes
kaggle.com
zip
Updated May 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jenny_Ai (2021). A Capstone Project - Cyclistic shared bikes [Dataset]. https://www.kaggle.com/datasets/jennyai/share-bikes/discussion
Explore at:
zip(158167874 bytes)Available download formats
Dataset updated
May 1, 2021
Authors
Jenny_Ai
Description
Context

This is one of the case studies in the Capstone project of the Google Data Analytics Certificate.

Case Study 1: How does a bike-share navigate speedy success?

Scenario

A bike-share company, Cyclistic, is trying to increase the profits in the coming years and looking for realistic business strategies. The director of marketing believes maximizing the number of annual memberships would be the most efficient way. Therefore, the analyst team would like to understand how casual riders and annual members use Cyclistic bikes differently so that they can design a new marketing strategy to convert casual riders into annual members.

About the company

Cyclistic launched a successful bike-share offering in 2016. After five years of growing, Cyclistic now has more than 5000 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Apart from classic bicycles, Cyclistic also offer bikes for disabilities and electrical bikes. Most of users use the shared bikes for leisure, but 30% use them for commute.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.

Business tasks

Maximizing the number of members by converting casual riders into annual memberships

Make recommendations on how digital media could affect their marketing tactics

Content

The data are Cyclistic's historical trip data in the past 12 months (202004 to 202103). The data has been made available by Motivate International Inc. under this license (https://www.divvybikes.com/data-license-agreement). The original data was divided into several csv files based on month, and tables shown in this notebook has been joined in BigQuery before uploaded.

The union_tables includes 'ride_id', 'rideable_type', 'started_at', 'ended_at', 'start_station_name', and 'end_station_name'; the union_tables_geo includes 'start_station_name', 'end_station_name', 'start_lat', 'start_lng', 'end_lat', and 'end_lng'.

Tools for data preparation, process and cleaning

Google Sheet, BigQuery, Python (Colab link: https://colab.research.google.com/drive/1_G0bh_Anbl-i41HnssK9qDeANhFAiaON#scrollTo=lz2H9aCSyzdV)

Data Visualization

Please check up the dashboard here: https://public.tableau.com/profile/jing.ai#!/vizhome/shared_bike_202105/Dashboard1

Conclusions

Based on the trip record data from the past 12 months, 1. The number of casual users increased remarkably in the summer in a year, in the afternoon in a day, and on the weened in a week; 2. The busiest stations have more casual users; 3. The casual users tend to bike a longer time than annual members; 4. Classic bikes were replacing the docked bikes.

Recommendations

Make promotions on digital media in the early summer, particularly in the afternoon and on weekends to let more people know the Cythe Cyclistic bikes;

Mention reasonable discounts in the promotion for users who upgrade to the membership;

Highlight that classic bikes have been replacing the docked bikes, which may bring more flexibility and convenience for users.

Other thoughts

If time and budget are allowed, a survey would probably be good to understand what stops casual users upgrade to annual members. Price? Bike quality? Do not use bikes often? etc. Additionally, the bike record data based on each user could also be helpful to understand what annual members or casual users are in common. Then, we could do think about how to improve the business strategies based on the above analysis.

Let me know how do you think about my capstone project? (It is literally my first data analysis project.) I would be very much appreciated if any comments. Thanks! :)
AI-Pastiche
kaggle.com
zip
Updated Jul 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
asperti@cs.unibo.it (2025). AI-Pastiche [Dataset]. https://www.kaggle.com/datasets/asperticsuniboit/deepfakedatabase
Explore at:
zip(1385996918 bytes)Available download formats
Dataset updated
Jul 1, 2025
Authors
asperti@cs.unibo.it
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AI-Pastiche is a carefully curated dataset comprising 953 AI-generated paintings in well-known artistic styles. It was created using 73 manually crafted text prompts, used to test 12 modern image generators, with one or more images generated for each of the selected models. The dataset includes comprehensive metadata describing the details of the generation process. Full information about the models can be found in the article A Critical Assessment of Modern Generative Models' Ability to Replicate Artistic Styles

You may use this notebook for a fast access to the dataset: AI-Pastiche_demo.

Metadata Description Metadata comprise the following columns: - generative model: the model used to generate the image - prompt: the prompt passed as input to the generator - subject: a collection of comma separated tags describing the content (as described in the prompt) - style: the style to be imitated - period: the period the generated image should belong to - generated image: the name of the generated image in the generated_images dataset

Three additional columns provide human metrics collectd through extensive surveys. All values are in the range [0,1]. - defects: presence of notable defects and artifacts (0=no evident defects, 1=major problems) - authenticity: perceived authentity of the sample (please, check the article for details about this metric) - adherence: adherence of the sample to the prompt request
OpenOrca
kaggle.com
opendatalab.com
+1more
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
Explore at:
zip(2548102631 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

By Huggingface Hub [source]

About this dataset

The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'

After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on

Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

Research Ideas

Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.

Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.

Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar (2023). ReaLSAT, a global dataset of reservoir and lake surface area variations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4118463

Data from: ReaLSAT, a global dataset of reservoir and lake surface area variations

Explore at:

Dataset updated

Feb 8, 2023

Dataset provided by

Virginia Tech
Beijing University of Technology
University of Wisconsin
University of Minnesota

Authors

Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.

The dataset contains the following files:

1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.

2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.

3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.

4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.

6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.

Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:

[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.

[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).

Version Updates

Version 2.0:

extends the datasets to 2020.
provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.
provides a notebook to visualize the updated dataset.

Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.

Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb

Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.

Clear search

Close search

Google apps

Main menu

Data from: ReaLSAT, a global dataset of reservoir and lake surface area...

Top Rated TV Shows

Display the first show

Convert the API data to a DataFrame

Save to CSV and upload to Google Drive

TMF Business Process Framework Dataset for Neo4j

TMF Business Process Framework Dataset for Neo4j

The Cultural Resource Curse: How Trade Dependence Undermines Creative...

How to Run This Dataset and Code in Google Colab

1) Open Colab and set up

2) Get the data files into Colab

LLM-generated essay using PaLM from Google Gen-AI

NYC Jobs Dataset (Filtered Columns)

TWristAR - wristband activity recognition

Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...

Oslo City Bike Open Data

Source: https://oslobysykkel.no/en/open-data/historical

I am not the author of the data, only сompiled and structured from here using python-script

Dataset structure

Variables

Motivation

Challenge Ideas

Resources & Related Work

External Data Sources

Cellpose models for Label Prediction from Brightfield and Digital Phase...

Social Media and Mental Health

A Capstone Project - Cyclistic shared bikes

Context

Scenario

About the company

Business tasks

Content

Tools for data preparation, process and cleaning

Data Visualization

Conclusions

Recommendations

Other thoughts

AI-Pastiche

OpenOrca

Open-Orca Augmented FLAN Dataset

Unlocking Advanced Language Understanding and ML Model Performance

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Data from: ReaLSAT, a global dataset of reservoir and lake surface area variationsSee More Versions

Data from: ReaLSAT, a global dataset of reservoir and lake surface area variations