14 datasets found
  1. Z

    Data from: ReaLSAT, a global dataset of reservoir and lake surface area...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    • +1more
    Updated Feb 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar (2023). ReaLSAT, a global dataset of reservoir and lake surface area variations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4118463
    Explore at:
    Dataset updated
    Feb 8, 2023
    Dataset provided by
    Virginia Tech
    Beijing University of Technology
    University of Wisconsin
    University of Minnesota
    Authors
    Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.

    The dataset contains the following files:

    1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.

    2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.

    3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.

    4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.

    6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.

    Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:

    [1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.

    [2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).

    Version Updates

    Version 2.0:

    • extends the datasets to 2020.

    • provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.

    • provides a notebook to visualize the updated dataset.

    Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.

    Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb

    Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.

  2. Top Rated TV Shows

    • kaggle.com
    zip
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreya Gupta (2025). Top Rated TV Shows [Dataset]. https://www.kaggle.com/datasets/shreyajii/top-rated-tv-shows
    Explore at:
    zip(314571 bytes)Available download formats
    Dataset updated
    Jan 5, 2025
    Authors
    Shreya Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.

    Key Stats:

    Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):

    id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):

    python Copy code import requests

    api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }

    response = requests.get(url, params=params) data = response.json()

    Display the first show

    print(data['results'][0]) Dataset Use Cases:

    Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):

    python Copy code import pandas as pd

    Convert the API data to a DataFrame

    df = pd.DataFrame(data['results'])

    Save to CSV and upload to Google Drive

    from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:

    Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.

  3. TMF Business Process Framework Dataset for Neo4j

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aleksei Golovin (2023). TMF Business Process Framework Dataset for Neo4j [Dataset]. https://www.kaggle.com/datasets/algord/tmf-business-process-framework-dataset-for-neo4j
    Explore at:
    zip(13261206 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    Aleksei Golovin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    TMF Business Process Framework Dataset for Neo4j

    The dataset is a Neo4j knowledge graph based on TMF Business Process Framework v22.0 data.
    CSV files contain data about the model entities, and the JSON file contains knowledge graph mapping.
    The script used to generate CSV files based on the XML model can be found here.

    To import the dataset, download the zip archive and upload it to Neo4j.

    You also can check this dataset here.

  4. z

    The Cultural Resource Curse: How Trade Dependence Undermines Creative...

    • zenodo.org
    bin, csv
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anon Anon; Anon Anon (2025). The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries [Dataset]. http://doi.org/10.5281/zenodo.16784974
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Zenodo
    Authors
    Anon Anon; Anon Anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the study The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries. It contains country-year panel data for 2000–2023 covering both OECD economies and the ten largest Latin American countries by land area. Variables include GDP per capita (constant PPP, USD), trade openness, internet penetration, education indicators, cultural exports per capita, and executive constraints from the Polity V dataset.

    The dataset supports a comparative analysis of how economic structure, institutional quality, and infrastructure shape cultural export performance across development contexts. Within-country fixed effects models show that trade openness constrains cultural exports in OECD economies but has no measurable effect in resource-dependent Latin America. In contrast, strong executive constraints benefit cultural industries in advanced economies while constraining them in extraction-oriented systems. The results provide empirical evidence for a two-stage development framework in which colonial extraction legacies create distinct constraints on creative industry growth.

    All variables are harmonized to ISO3 country codes and aligned on a common panel structure. The dataset is fully reproducible using the included Jupyter notebooks (OECD.ipynb, LATAM+OECD.ipynb, cervantes.ipynb).

    Contents:

    • GDPPC.csv — GDP per capita series from the World Bank.

    • explanatory.csv — Trade openness, internet penetration, and education indicators.

    • culture_exports.csv — UNESCO cultural export data.

    • p5v2018.csv — Polity V institutional indicators.

    • Jupyter notebooks for data processing and replication.

    Potential uses: Comparative political economy, cultural economics, institutional development, and resource curse research.

    How to Run This Dataset and Code in Google Colab

    These steps reproduce the OECD vs. Latin America analyses from the paper using the provided CSVs and notebooks.

    1) Open Colab and set up

    1. Go to https://colab.research.google.com

    2. Click File → New notebook.

    3. (Optional) If your files are in Google Drive, mount it:

    python
    CopiarEditar
    from google.colab import drive drive.mount('/content/drive')

    2) Get the data files into Colab

    You have two easy options:

    A. Upload the 4 CSVs + notebooks directly

    • In the left sidebar, click the folder icon → Upload.

    • Upload: GDPPC.csv, explanatory.csv, culture_exports.csv, p5v2018.csv, and any .ipynb you want to run.

    B. Use Google Drive

    • Put those files in a Drive folder.

    • After mounting Drive, refer to them with paths like /content/drive/MyDrive/your_folder/GDPPC.csv.

  5. LLM-generated essay using PaLM from Google Gen-AI

    • kaggle.com
    zip
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingki19 / Muhammad Rizqi (2023). LLM-generated essay using PaLM from Google Gen-AI [Dataset]. https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai/code
    Explore at:
    zip(519291 bytes)Available download formats
    Dataset updated
    Nov 8, 2023
    Authors
    Kingki19 / Muhammad Rizqi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    In this competition LLM - Detect AI Generated Text, there's problem. There is an imbalance in the data which causes the LLM-generated essays to be less than the essays written by students. Here you can see EDA for the data competition in this notebook: AI or Not AI? Delving Into Essays with EDA

    So to solve this problem i made my own LLM-generated essay dataset that generated by PaLM to fix imbalance in data. You can see my work about how to generate the data in this Google Colaboratory:
    Generate LLM dataset using PaLM.ipynb

    Reason why i can't generated in Kaggle Notebook because Kaggle Notebook was use Kaggle Docker that can't use my own PaLM API. (My opinion)

    Column in data: - id: index - prompt_id: prompt that used to generated data. You can check the prompt here! - text: LLM-generated essay by PaLM based on prompt - generated: Additional information that shows it was LLM-generated. All value in this column is 1.

  6. NYC Jobs Dataset (Filtered Columns)

    • kaggle.com
    zip
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffery Mandrake (2022). NYC Jobs Dataset (Filtered Columns) [Dataset]. https://www.kaggle.com/datasets/jefferymandrake/nyc-jobs-filtered-cols
    Explore at:
    zip(93408 bytes)Available download formats
    Dataset updated
    Oct 5, 2022
    Authors
    Jeffery Mandrake
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    New York
    Description

    Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial

    The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data

    I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing

    Once the csv file is uploaded to Google Colab, use these commands to process the file.

    import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)

  7. Z

    TWristAR - wristband activity recognition

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis (2022). TWristAR - wristband activity recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5911807
    Explore at:
    Dataset updated
    Feb 7, 2022
    Dataset provided by
    Texas State University
    Authors
    Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWristAR is a small three subject dataset recorded using an e4 wristband. Each subject performed six scripted activities: upstairs/downstairs, walk/jog, and sit/stand. Each activity except stairs was performed for one minute a total of three times alternating between the pairs. Subjects 1 & 2 also completed a walking sequence of approximately 10 minutes. The dataset contains motion (accelerometer) data, temperature, electrodermal activity, and heart rate data. The .csv file associated with each datafile contains timing and labeling information and was built using the provided Excel files.

    Each two activity session was recorded using a downward facing action camera. This video was used to generate the labels and is provided to investigate any data anomalies, especially for the free-form long walk. For privacy reasons only the sub1_stairs video contains audio.

    The Jupyter notebook processes the acceleration data and performs hold-one-subject out evaluation of a 1D-CNN. Example results from a run performed on a google colab GPU instance (w/o GPU the training time increases to about 90 seconds per pass):

    Hold-one-subject-out results
    
    
        Train Sub
        Test Sub
        Accuracy
        Training Time (HH:MM:SS)
    
    
    
    
        [1,2]
        [3]
        0.757
        00:00:12
    
    
        [2,3]
        [1]
        0.849
        00:00:14
    
    
        [1,3]
        [2]
        0.800
        00:00:11
    

    This notebook can also be run in colab here. This video describes the processing https://mediaflo.txstate.edu/Watch/e4_data_processing.

    We hope you find this a useful dataset with end-to-end code. We have several papers in process and would appreciate your citation of the dataset if you use it in your work.

  8. Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille (2023). Gravity Spy Machine Learning Classifications of LIGO Glitches from Observing Runs O1, O2, O3a, and O3b [Dataset]. http://doi.org/10.5281/zenodo.5649212
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.

    There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).

    For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

    For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

    The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.

    If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command

    conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy

    After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.

    from gwpy.table import GravitySpyTable
    
    H1_O2 = GravitySpyTable.read('H1_O2.csv')
    
    H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)]
    
    H1_O2[0:4].download(nproc=1)

    Each of the columns in the CSV files are taken from various different inputs:

    [‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.

    [‘gravityspy_id’] is the unique identifier for each glitch in the dataset.

    [‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.

    [‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.

    [‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.

    For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.


    For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.

  9. Oslo City Bike Open Data

    • kaggle.com
    zip
    Updated Nov 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    stanislav_o27 (2025). Oslo City Bike Open Data [Dataset]. https://www.kaggle.com/datasets/stanislavo27/oslo-city-bike-open-data
    Explore at:
    zip(251012812 bytes)Available download formats
    Dataset updated
    Nov 8, 2025
    Authors
    stanislav_o27
    Area covered
    Oslo
    Description

    Source: https://oslobysykkel.no/en/open-data/historical

    I am not the author of the data, only сompiled and structured from here using python-script

    oslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).

    Dataset structure

    Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen

    Variables

    from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station

    Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.

    Motivation

    I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.

    **Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing

    https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view

    Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.

    Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory

    Challenge Ideas

    Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)

    Resources & Related Work

    Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis

    links to works I have found or that have inspired me

    Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.

    Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).

    Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.

    External Data Sources

    The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo

    I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv

    oslo bike-sharing eda feature-engineering geospatial time-series

  10. Cellpose models for Label Prediction from Brightfield and Digital Phase...

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Romain Guiet; Romain Guiet; Olivier Burri; Olivier Burri (2022). Cellpose models for Label Prediction from Brightfield and Digital Phase Contrast images [Dataset]. http://doi.org/10.5281/zenodo.6140111
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Romain Guiet; Romain Guiet; Olivier Burri; Olivier Burri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Name: Cellpose models for Brightfield and Digital Phase Contrast images

    Data type: Cellpose models trained via transfer learning from the ‘nuclei’ and ‘cyto2’ pretrained model with additional Training Dataset . Includes corresponding csv files with 'Quality Control' metrics(§) (model.zip).

    Training Dataset: Light microscopy (Digital Phase Contrast or Brightfield) and automatic annotations (nuclei or cyto) (https://doi.org/10.5281/zenodo.6140064)

    Training Procedure: The cellpose models were trained using cellpose version 1.0.0 with GPU support (NVIDIA GeForce K40) using default settings as per the Cellpose documentation . Training was done using a Renku environment (renku template).

    Command Line Execution for the different trained models

    nuclei_from_bf:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _bf --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

    cyto_from_bf:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _bf --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

    nuclei_from_dpc:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _dpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

    cyto_from_dpc:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _dpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

    nuclei_from_sqrdpc:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _sqrdpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose

    cyto_from_sqrdpc:

    cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _sqrdpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose

    NOTE (§): We provide a notebook for Quality Control, which is an adaptation of the "Cellpose (2D and 3D)" notebook from ZeroCostDL4Mic .

    NOTE: This dataset used a training dataset from the Zenodo entry(https://doi.org/10.5281/zenodo.6140064) generated from the “HeLa “Kyoto” cells under the scope” dataset Zenodo entry(https://doi.org/10.5281/zenodo.6139958) in order to automatically generate the label images.

    NOTE: Make sure that you delete the “_flow” images that are auto-computed when running the training. If you do not, then the flows from previous runs will be used for the new training, which might yield confusing results.

  11. Social Media and Mental Health

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SouvikAhmed071 (2023). Social Media and Mental Health [Dataset]. https://www.kaggle.com/datasets/souvikahmed071/social-media-and-mental-health
    Explore at:
    zip(10944 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    SouvikAhmed071
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.

    The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.

    This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.

    The following is the Google Colab link to the project, done on Jupyter Notebook -

    https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN

    The following is the GitHub Repository of the project -

    https://github.com/daerkns/social-media-and-mental-health

    Libraries used for the Project -

    Pandas
    Numpy
    Matplotlib
    Seaborn
    Sci-kit Learn
    
  12. A Capstone Project - Cyclistic shared bikes

    • kaggle.com
    zip
    Updated May 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jenny_Ai (2021). A Capstone Project - Cyclistic shared bikes [Dataset]. https://www.kaggle.com/datasets/jennyai/share-bikes/discussion
    Explore at:
    zip(158167874 bytes)Available download formats
    Dataset updated
    May 1, 2021
    Authors
    Jenny_Ai
    Description

    Context

    This is one of the case studies in the Capstone project of the Google Data Analytics Certificate.

    Case Study 1: How does a bike-share navigate speedy success?

    Scenario

    A bike-share company, Cyclistic, is trying to increase the profits in the coming years and looking for realistic business strategies. The director of marketing believes maximizing the number of annual memberships would be the most efficient way. Therefore, the analyst team would like to understand how casual riders and annual members use Cyclistic bikes differently so that they can design a new marketing strategy to convert casual riders into annual members.

    About the company

    Cyclistic launched a successful bike-share offering in 2016. After five years of growing, Cyclistic now has more than 5000 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Apart from classic bicycles, Cyclistic also offer bikes for disabilities and electrical bikes. Most of users use the shared bikes for leisure, but 30% use them for commute.

    Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.

    Business tasks

    1. Maximizing the number of members by converting casual riders into annual memberships
    2. Make recommendations on how digital media could affect their marketing tactics

    Content

    The data are Cyclistic's historical trip data in the past 12 months (202004 to 202103). The data has been made available by Motivate International Inc. under this license (https://www.divvybikes.com/data-license-agreement). The original data was divided into several csv files based on month, and tables shown in this notebook has been joined in BigQuery before uploaded.

    The union_tables includes 'ride_id', 'rideable_type', 'started_at', 'ended_at', 'start_station_name', and 'end_station_name'; the union_tables_geo includes 'start_station_name', 'end_station_name', 'start_lat', 'start_lng', 'end_lat', and 'end_lng'.

    Tools for data preparation, process and cleaning

    Google Sheet, BigQuery, Python (Colab link: https://colab.research.google.com/drive/1_G0bh_Anbl-i41HnssK9qDeANhFAiaON#scrollTo=lz2H9aCSyzdV)

    Data Visualization

    Please check up the dashboard here: https://public.tableau.com/profile/jing.ai#!/vizhome/shared_bike_202105/Dashboard1

    Conclusions

    Based on the trip record data from the past 12 months, 1. The number of casual users increased remarkably in the summer in a year, in the afternoon in a day, and on the weened in a week; 2. The busiest stations have more casual users; 3. The casual users tend to bike a longer time than annual members; 4. Classic bikes were replacing the docked bikes.

    Recommendations

    1. Make promotions on digital media in the early summer, particularly in the afternoon and on weekends to let more people know the Cythe Cyclistic bikes;
    2. Mention reasonable discounts in the promotion for users who upgrade to the membership;
    3. Highlight that classic bikes have been replacing the docked bikes, which may bring more flexibility and convenience for users.

    Other thoughts

    If time and budget are allowed, a survey would probably be good to understand what stops casual users upgrade to annual members. Price? Bike quality? Do not use bikes often? etc. Additionally, the bike record data based on each user could also be helpful to understand what annual members or casual users are in common. Then, we could do think about how to improve the business strategies based on the above analysis.

    Let me know how do you think about my capstone project? (It is literally my first data analysis project.) I would be very much appreciated if any comments. Thanks! :)

  13. AI-Pastiche

    • kaggle.com
    zip
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    asperti@cs.unibo.it (2025). AI-Pastiche [Dataset]. https://www.kaggle.com/datasets/asperticsuniboit/deepfakedatabase
    Explore at:
    zip(1385996918 bytes)Available download formats
    Dataset updated
    Jul 1, 2025
    Authors
    asperti@cs.unibo.it
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    AI-Pastiche is a carefully curated dataset comprising 953 AI-generated paintings in well-known artistic styles. It was created using 73 manually crafted text prompts, used to test 12 modern image generators, with one or more images generated for each of the selected models. The dataset includes comprehensive metadata describing the details of the generation process. Full information about the models can be found in the article A Critical Assessment of Modern Generative Models' Ability to Replicate Artistic Styles

    You may use this notebook for a fast access to the dataset: AI-Pastiche_demo.

    Metadata Description Metadata comprise the following columns: - generative model: the model used to generate the image - prompt: the prompt passed as input to the generator - subject: a collection of comma separated tags describing the content (as described in the prompt) - style: the style to be imitated - period: the period the generated image should belong to - generated image: the name of the generated image in the generated_images dataset

    Three additional columns provide human metrics collectd through extensive surveys. All values are in the range [0,1]. - defects: presence of notable defects and artifacts (0=no evident defects, 1=major problems) - authenticity: perceived authentity of the sample (please, check the article for details about this metric) - adherence: adherence of the sample to the prompt request

  14. OpenOrca

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
    Explore at:
    zip(2548102631 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open-Orca Augmented FLAN Dataset

    Unlocking Advanced Language Understanding and ML Model Performance

    By Huggingface Hub [source]

    About this dataset

    The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

    Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

    Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

    import pandas as pd   # Library used for importing datasets into Python 
    
    df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame 
    
    df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'
    

    After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

     df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column
     Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on   
    

    Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
    Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

    Research Ideas

    • Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
    • Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
    • Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar (2023). ReaLSAT, a global dataset of reservoir and lake surface area variations [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4118463

Data from: ReaLSAT, a global dataset of reservoir and lake surface area variations

Related Article
Explore at:
Dataset updated
Feb 8, 2023
Dataset provided by
Virginia Tech
Beijing University of Technology
University of Wisconsin
University of Minnesota
Authors
Ankush Khandelwal; Anuj Karpatne; Zhihao Wei; Rahul Ghosh; Hilary Dugan; Paul Hanson; Vipin Kumar
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.

The dataset contains the following files:

1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.

2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.

3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.

4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.

6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.

Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:

[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.

[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).

Version Updates

Version 2.0:

  • extends the datasets to 2020.

  • provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.

  • provides a notebook to visualize the updated dataset.

Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.

Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb

Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.

Search
Clear search
Close search
Google apps
Main menu