Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.
The dataset contains the following files:
1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.
2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.
3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.
4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.
6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.
Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:
[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.
[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).
Version Updates
Version 2.0:
extends the datasets to 2020.
provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.
provides a notebook to visualize the updated dataset.
Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.
Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb
Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.
Key Stats:
Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):
id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):
python Copy code import requests
api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }
response = requests.get(url, params=params) data = response.json()
print(data['results'][0]) Dataset Use Cases:
Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):
python Copy code import pandas as pd
df = pd.DataFrame(data['results'])
from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:
Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset is a Neo4j knowledge graph based on TMF Business Process Framework v22.0 data.
CSV files contain data about the model entities, and the JSON file contains knowledge graph mapping.
The script used to generate CSV files based on the XML model can be found here.
To import the dataset, download the zip archive and upload it to Neo4j.
You also can check this dataset here.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the study The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries. It contains country-year panel data for 2000–2023 covering both OECD economies and the ten largest Latin American countries by land area. Variables include GDP per capita (constant PPP, USD), trade openness, internet penetration, education indicators, cultural exports per capita, and executive constraints from the Polity V dataset.
The dataset supports a comparative analysis of how economic structure, institutional quality, and infrastructure shape cultural export performance across development contexts. Within-country fixed effects models show that trade openness constrains cultural exports in OECD economies but has no measurable effect in resource-dependent Latin America. In contrast, strong executive constraints benefit cultural industries in advanced economies while constraining them in extraction-oriented systems. The results provide empirical evidence for a two-stage development framework in which colonial extraction legacies create distinct constraints on creative industry growth.
All variables are harmonized to ISO3 country codes and aligned on a common panel structure. The dataset is fully reproducible using the included Jupyter notebooks (OECD.ipynb, LATAM+OECD.ipynb, cervantes.ipynb).
Contents:
GDPPC.csv — GDP per capita series from the World Bank.
explanatory.csv — Trade openness, internet penetration, and education indicators.
culture_exports.csv — UNESCO cultural export data.
p5v2018.csv — Polity V institutional indicators.
Jupyter notebooks for data processing and replication.
Potential uses: Comparative political economy, cultural economics, institutional development, and resource curse research.
These steps reproduce the OECD vs. Latin America analyses from the paper using the provided CSVs and notebooks.
Click File → New notebook.
(Optional) If your files are in Google Drive, mount it:
from google.colab import drive
drive.mount('/content/drive')
You have two easy options:
A. Upload the 4 CSVs + notebooks directly
In the left sidebar, click the folder icon → Upload.
Upload: GDPPC.csv, explanatory.csv, culture_exports.csv, p5v2018.csv, and any .ipynb you want to run.
B. Use Google Drive
Put those files in a Drive folder.
After mounting Drive, refer to them with paths like /content/drive/MyDrive/your_folder/GDPPC.csv.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
In this competition LLM - Detect AI Generated Text, there's problem. There is an imbalance in the data which causes the LLM-generated essays to be less than the essays written by students. Here you can see EDA for the data competition in this notebook: AI or Not AI? Delving Into Essays with EDA
So to solve this problem i made my own LLM-generated essay dataset that generated by PaLM to fix imbalance in data. You can see my work about how to generate the data in this Google Colaboratory:
Generate LLM dataset using PaLM.ipynb
Reason why i can't generated in Kaggle Notebook because Kaggle Notebook was use Kaggle Docker that can't use my own PaLM API. (My opinion)
Column in data:
- id: index
- prompt_id: prompt that used to generated data. You can check the prompt here!
- text: LLM-generated essay by PaLM based on prompt
- generated: Additional information that shows it was LLM-generated. All value in this column is 1.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Use this dataset with Misra's Pandas tutorial: How to use the Pandas GroupBy function | Pandas tutorial
The original dataset came from this site: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
I used Google Colab to filter the columns with the following Pandas commands. Here's a Colab Notebook you can use with the commands listed below: https://colab.research.google.com/drive/17Jpgeytc075CpqDnbQvVMfh9j-f4jM5l?usp=sharing
Once the csv file is uploaded to Google Colab, use these commands to process the file.
import pandas as pd # load the file and create a pandas dataframe df = pd.read_csv('/content/NYC_Jobs.csv') # keep only these columns df = df[['Job ID', 'Civil Service Title', 'Agency', 'Posting Type', 'Job Category', 'Salary Range From', 'Salary Range To' ]] # save the csv file without the index column df.to_csv('/content/NYC_Jobs_filtered_cols.csv', index=False)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TWristAR is a small three subject dataset recorded using an e4 wristband. Each subject performed six scripted activities: upstairs/downstairs, walk/jog, and sit/stand. Each activity except stairs was performed for one minute a total of three times alternating between the pairs. Subjects 1 & 2 also completed a walking sequence of approximately 10 minutes. The dataset contains motion (accelerometer) data, temperature, electrodermal activity, and heart rate data. The .csv file associated with each datafile contains timing and labeling information and was built using the provided Excel files.
Each two activity session was recorded using a downward facing action camera. This video was used to generate the labels and is provided to investigate any data anomalies, especially for the free-form long walk. For privacy reasons only the sub1_stairs video contains audio.
The Jupyter notebook processes the acceleration data and performs hold-one-subject out evaluation of a 1D-CNN. Example results from a run performed on a google colab GPU instance (w/o GPU the training time increases to about 90 seconds per pass):
Hold-one-subject-out results
Train Sub
Test Sub
Accuracy
Training Time (HH:MM:SS)
[1,2]
[3]
0.757
00:00:12
[2,3]
[1]
0.849
00:00:14
[1,3]
[2]
0.800
00:00:11
This notebook can also be run in colab here. This video describes the processing https://mediaflo.txstate.edu/Watch/e4_data_processing.
We hope you find this a useful dataset with end-to-end code. We have several papers in process and would appreciate your citation of the dataset if you use it in your work.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.
There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).
For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle
For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle
The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.
If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command
conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy
After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.
from gwpy.table import GravitySpyTable
H1_O2 = GravitySpyTable.read('H1_O2.csv')
H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)]
H1_O2[0:4].download(nproc=1)
Each of the columns in the CSV files are taken from various different inputs:
[‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.
[‘gravityspy_id’] is the unique identifier for each glitch in the dataset.
[‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.
[‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.
[‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.
For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.
For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.
Facebook
Twitteroslo-city-bike License: Norwegian Licence for Open Government Data (NLOD) 2.0 According to the license, we have full rights to collect, use, modify, and distribute this data, provided you clearly indicate the source (which I do).
Folder oslobysykkel contains all available data from 2019 to 2025. Format: oslobysykkel-YYYY-MM.csv. why is oslo still appearing in the file names? because there is also similar data for Trondheim and Bergen
from oslobysykkel.no Variable Format Description started_at Timestamp Timestamp of when the trip started ended_at Timestamp Timestamp of when the trip ended duration Integer Duration of trip in seconds start_station_id String Unique ID for start station start_station_name String Name of start station start_station_description String Description of where start station is located start_station_latitude Decimal degrees in WGS84 Latitude of start station start_station_longitude Decimal degrees in WGS84 Longitude of start station end_station_id String Unique ID for end station end_station_name String Name of end station end_station_description String Description of where end station is located end_station_latitude Decimal degrees in WGS84 Latitude of end station end_station_longitude Decimal degrees in WGS84 Longitude of end station
Please note: this data and my analysis focuses on the new data format, but historical data for the period April 2016 - December 2018 (Legacy Trip Data) has a different pattern.
I myself was extremely fascinated by this open data of Oslo City Bike and in the process of deep analysis saw broad prospects. This interest turned into an idea to create a data-analytical problem book or even platfrom 'exercise bike'. Publishing this dataset to make it convenient for my own further use in the next phases of the project (Clustering, Forecasting), as well as so that anyone can participate in analysis and modeling based on this exciting data.
**Autumn's remake of Oslo bike sharing data analysis ** https://colab.research.google.com/drive/1tAxrIWVK5V-ptKLJBdODjy10zHlsppFv?usp=sharing
https://drive.google.com/file/d/17FP9Bd5opoZlw40LRxWtycgJJyXSAdC6/view
Full notebooks with code, visualizations, and commentary will be published soon! This dataset is the backbone of an ongoing project — stay tuned for see a deeper dives into anomaly detection, station clustering, and interactive learning challenges.
Index of my notebooks Phase 1: Cleaned Data & Core Insights Time-Space Dynamics Exploratory
Clustering and Segmentation Demand Forecasting (Time Series) Geospatial Analysis (Network Analysis)
Similar dataset https://www.kaggle.com/code/florestancharlaix/oslo-city-bikes-analysis
links to works I have found or that have inspired me
Exploring Open Data from Oslo City Bike Jon Olave — visualization of popular routes and seasonality analysis.
Oslo City Bike Data Wrangling Karl Tryggvason — predicting bicycle availability at stations, focusing on everyday use (e.g., trips to kindergarten).
Helsinki City Bikes: Exploratory Data Analysis Analysis of a similar system in Helsinki — useful for comparative studies and methodological ideas.
The idea is to connect with other data. For example I did it for weather data - integrate temperature, precipitation, and wind speed to explain variations in daily demand. https://meteostat.net/en/place/no/oslo
I also used data from Airbnb (that's where I took division into neighbourhoods) https://data.insideairbnb.com/norway/oslo/oslo/2025-06-27/visualisations/neighbourhoods.csv
oslo bike-sharing eda feature-engineering geospatial time-series
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Name: Cellpose models for Brightfield and Digital Phase Contrast images
Data type: Cellpose models trained via transfer learning from the ‘nuclei’ and ‘cyto2’ pretrained model with additional Training Dataset . Includes corresponding csv files with 'Quality Control' metrics(§) (model.zip).
Training Dataset: Light microscopy (Digital Phase Contrast or Brightfield) and automatic annotations (nuclei or cyto) (https://doi.org/10.5281/zenodo.6140064)
Training Procedure: The cellpose models were trained using cellpose version 1.0.0 with GPU support (NVIDIA GeForce K40) using default settings as per the Cellpose documentation . Training was done using a Renku environment (renku template).
Command Line Execution for the different trained models
nuclei_from_bf:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _bf --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose
cyto_from_bf:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _bf --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose
nuclei_from_dpc:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _dpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose
cyto_from_dpc:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _dpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose
nuclei_from_sqrdpc:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model nuclei --img_filter _sqrdpc --mask_filter _nuclei --chan 0 --chan2 0 --use_gpu --verbose
cyto_from_sqrdpc:
cellpose --train --dir 'data/train/' --test_dir 'data/test/' --pretrained_model cyto2 --img_filter _sqrdpc --mask_filter _cyto --chan 0 --chan2 0 --use_gpu --verbose
NOTE (§): We provide a notebook for Quality Control, which is an adaptation of the "Cellpose (2D and 3D)" notebook from ZeroCostDL4Mic .
NOTE: This dataset used a training dataset from the Zenodo entry(https://doi.org/10.5281/zenodo.6140064) generated from the “HeLa “Kyoto” cells under the scope” dataset Zenodo entry(https://doi.org/10.5281/zenodo.6139958) in order to automatically generate the label images.
NOTE: Make sure that you delete the “_flow” images that are auto-computed when running the training. If you do not, then the flows from previous runs will be used for the new training, which might yield confusing results.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
This dataset was originally collected for a data science and machine learning project that aimed at investigating the potential correlation between the amount of time an individual spends on social media and the impact it has on their mental health.
The project involves conducting a survey to collect data, organizing the data, and using machine learning techniques to create a predictive model that can determine whether a person should seek professional help based on their answers to the survey questions.
This project was completed as part of a Statistics course at a university, and the team is currently in the process of writing a report and completing a paper that summarizes and discusses the findings in relation to other research on the topic.
The following is the Google Colab link to the project, done on Jupyter Notebook -
https://colab.research.google.com/drive/1p7P6lL1QUw1TtyUD1odNR4M6TVJK7IYN
The following is the GitHub Repository of the project -
https://github.com/daerkns/social-media-and-mental-health
Libraries used for the Project -
Pandas
Numpy
Matplotlib
Seaborn
Sci-kit Learn
Facebook
TwitterThis is one of the case studies in the Capstone project of the Google Data Analytics Certificate.
Case Study 1: How does a bike-share navigate speedy success?
A bike-share company, Cyclistic, is trying to increase the profits in the coming years and looking for realistic business strategies. The director of marketing believes maximizing the number of annual memberships would be the most efficient way. Therefore, the analyst team would like to understand how casual riders and annual members use Cyclistic bikes differently so that they can design a new marketing strategy to convert casual riders into annual members.
Cyclistic launched a successful bike-share offering in 2016. After five years of growing, Cyclistic now has more than 5000 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Apart from classic bicycles, Cyclistic also offer bikes for disabilities and electrical bikes. Most of users use the shared bikes for leisure, but 30% use them for commute.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders.
The data are Cyclistic's historical trip data in the past 12 months (202004 to 202103). The data has been made available by Motivate International Inc. under this license (https://www.divvybikes.com/data-license-agreement). The original data was divided into several csv files based on month, and tables shown in this notebook has been joined in BigQuery before uploaded.
The union_tables includes 'ride_id', 'rideable_type', 'started_at', 'ended_at', 'start_station_name', and 'end_station_name'; the union_tables_geo includes 'start_station_name', 'end_station_name', 'start_lat', 'start_lng', 'end_lat', and 'end_lng'.
Google Sheet, BigQuery, Python (Colab link: https://colab.research.google.com/drive/1_G0bh_Anbl-i41HnssK9qDeANhFAiaON#scrollTo=lz2H9aCSyzdV)
Please check up the dashboard here: https://public.tableau.com/profile/jing.ai#!/vizhome/shared_bike_202105/Dashboard1
Based on the trip record data from the past 12 months, 1. The number of casual users increased remarkably in the summer in a year, in the afternoon in a day, and on the weened in a week; 2. The busiest stations have more casual users; 3. The casual users tend to bike a longer time than annual members; 4. Classic bikes were replacing the docked bikes.
If time and budget are allowed, a survey would probably be good to understand what stops casual users upgrade to annual members. Price? Bike quality? Do not use bikes often? etc. Additionally, the bike record data based on each user could also be helpful to understand what annual members or casual users are in common. Then, we could do think about how to improve the business strategies based on the above analysis.
Let me know how do you think about my capstone project? (It is literally my first data analysis project.) I would be very much appreciated if any comments. Thanks! :)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
AI-Pastiche is a carefully curated dataset comprising 953 AI-generated paintings in well-known artistic styles. It was created using 73 manually crafted text prompts, used to test 12 modern image generators, with one or more images generated for each of the selected models. The dataset includes comprehensive metadata describing the details of the generation process. Full information about the models can be found in the article A Critical Assessment of Modern Generative Models' Ability to Replicate Artistic Styles
You may use this notebook for a fast access to the dataset: AI-Pastiche_demo.
Metadata Description Metadata comprise the following columns: - generative model: the model used to generate the image - prompt: the prompt passed as input to the generator - subject: a collection of comma separated tags describing the content (as described in the prompt) - style: the style to be imitated - period: the period the generated image should belong to - generated image: the name of the generated image in the generated_images dataset
Three additional columns provide human metrics collectd through extensive surveys. All values are in the range [0,1]. - defects: presence of notable defects and artifacts (0=no evident defects, 1=major problems) - authenticity: perceived authentity of the sample (please, check the article for details about this metric) - adherence: adherence of the sample to the prompt request
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.
Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.
Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:
import pandas as pd # Library used for importing datasets into Python df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :
df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so onData Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as
- Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
- Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
- Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reservoir and Lake Surface Area Timeseries (ReaLSAT) dataset provides an unprecedented reconstruction of surface area variations of lakes and reservoirs at a global scale using Earth Observation (EO) data and novel machine learning techniques. The dataset provides monthly scale surface area variations (1984 to 2020) of 681,137 water bodies below 50°N and sizes greater than 0.1 square kilometers.
The dataset contains the following files:
1) ReaLSAT.zip: A shapefile that contains the reference shape of waterbodies in the dataset.
2) monthly_timeseries.zip: contains one CSV file for each water body. The CSV file provides monthly surface area variation values. The CSV files are stored in a subfolder corresponding to each 10 degree by 10 degree cell. For example, monthly_timeseries_60_-50 folders contain CSV files of lakes that lie between 60 E and 70 E longitude, and 50S and 40 S.
3) monthly_shapes_.zip: contains a geotiff for each water body that lie within the 10 degree by 10 degree cell. Please refer to the visualization notebook on how to use these geotiffs.
4) evaluation_data.zip: contains the random subsets of the dataset used for evaluation. The zip file contains a README file that describes the evaluation data.
6) generate_realsat_timeseries.ipynb: a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody.
Please refer to the following papers to learn more about the processing pipeline used to create ReaLSAT dataset:
[1] Khandelwal, Ankush, Anuj Karpatne, Praveen Ravirathinam, Rahul Ghosh, Zhihao Wei, Hilary A. Dugan, Paul C. Hanson, and Vipin Kumar. "ReaLSAT, a global dataset of reservoir and lake surface area variations." Scientific data 9, no. 1 (2022): 1-12.
[2] Khandelwal, Ankush. "ORBIT (Ordering Based Information Transfer): A Physics Guided Machine Learning Framework to Monitor the Dynamics of Water Bodies at a Global Scale." (2019).
Version Updates
Version 2.0:
extends the datasets to 2020.
provides geotiffs instead of shapefiles for individual lakes to reduce dataset size.
provides a notebook to visualize the updated dataset.
Version 1.4: added 1120 large lakes to the dataset and removed partial lakes that overlapped with these large lakes.
Version 1.3: fixed visualization related bug in generate_realsat_timeseries.ipynb
Version 1.2: added a Google Colab notebook that provides the code to generate timerseries and surface extent maps for any waterbody in ReaLSAT database.