60 datasets found
  1. h

    watches

    • huggingface.co
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gil (2025). watches [Dataset]. https://huggingface.co/datasets/yotam22/watches
    Explore at:
    Dataset updated
    Nov 17, 2025
    Authors
    gil
    Description

    🕰️ Exploratory Data Analysis of Luxury Watch Prices

      Overview
    

    This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.

      Dataset
    

    Rows: ~172,000
    Columns: 14
    Unit of observation: one watch listing

    Main columns

    name – watch/listing title
    price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.

  2. Top Rated TV Shows

    • kaggle.com
    zip
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreya Gupta (2025). Top Rated TV Shows [Dataset]. https://www.kaggle.com/datasets/shreyajii/top-rated-tv-shows
    Explore at:
    zip(314571 bytes)Available download formats
    Dataset updated
    Jan 5, 2025
    Authors
    Shreya Gupta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides information about top-rated TV shows, collected from The Movie Database (TMDb) API. It can be used for data analysis, recommendation systems, and insights on popular television content.

    Key Stats:

    Total Pages: 109 Total Results: 2098 TV shows Data Source: TMDb API Sorting Criteria: Highest-rated by vote_average (average rating) with a minimum vote count of 200 Data Fields (Columns):

    id: Unique identifier for the TV show name: Title of the TV show vote_average: Average rating given by users vote_count: Total number of votes received first_air_date: The date when the show was first aired original_language: Language in which the show was originally produced genre_ids: Genre IDs linked to the show's genres overview: A brief summary of the show popularity: Popularity score based on audience engagement poster_path: URL path for the show's poster image Accessing the Dataset via API (Python Example):

    python Copy code import requests

    api_key = 'YOUR_API_KEY_HERE' url = "https://api.themoviedb.org/3/discover/tv" params = { 'api_key': api_key, 'include_adult': 'false', 'language': 'en-US', 'page': 1, 'sort_by': 'vote_average.desc', 'vote_count.gte': 200 }

    response = requests.get(url, params=params) data = response.json()

    Display the first show

    print(data['results'][0]) Dataset Use Cases:

    Data Analysis: Explore trends in highly-rated TV shows. Recommendation Systems: Build personalized TV show suggestions. Visualization: Create charts to showcase ratings or genre distribution. Machine Learning: Predict show popularity using historical data. Exporting and Sharing the Dataset (Google Colab Example):

    python Copy code import pandas as pd

    Convert the API data to a DataFrame

    df = pd.DataFrame(data['results'])

    Save to CSV and upload to Google Drive

    from google.colab import drive drive.mount('/content/drive') df.to_csv('/content/drive/MyDrive/top_rated_tv_shows.csv', index=False) Ways to Share the Dataset:

    Google Drive: Upload and share a public link. Kaggle: Create a public dataset for collaboration. GitHub: Host the CSV file in a repository for easy sharing.

  3. Z

    Robot@Home2, a robotic dataset of home environments

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Apr 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ambrosio-Cestero, Gregorio; Ruiz-Sarmiento, José Raul; González-Jiménez, Javier (2024). Robot@Home2, a robotic dataset of home environments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3901563
    Explore at:
    Dataset updated
    Apr 4, 2024
    Dataset provided by
    University of Málaga
    Universitiy of Málaga
    Authors
    Ambrosio-Cestero, Gregorio; Ruiz-Sarmiento, José Raul; González-Jiménez, Javier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Robot-at-Home dataset (Robot@Home, paper here) is a collection of raw and processed data from five domestic settings compiled by a mobile robot equipped with 4 RGB-D cameras and a 2D laser scanner. Its main purpose is to serve as a testbed for semantic mapping algorithms through the categorization of objects and/or rooms.

    This dataset is unique in three aspects:

    The provided data were captured with a rig of 4 RGB-D sensors with an overall field of view of 180°H. and 58°V., and with a 2D laser scanner.

    It comprises diverse and numerous data: sequences of RGB-D images and laser scans from the rooms of five apartments (87,000+ observations were collected), topological information about the connectivity of these rooms, and 3D reconstructions and 2D geometric maps of the visited rooms.

    The provided ground truth is dense, including per-point annotations of the categories of the objects and rooms appearing in the reconstructed scenarios, and per-pixel annotations of each RGB-D image within the recorded sequences

    During the data collection, a total of 36 rooms were completely inspected, so the dataset is rich in contextual information of objects and rooms. This is a valuable feature, missing in most of the state-of-the-art datasets, which can be exploited by, for instance, semantic mapping systems that leverage relationships like pillows are usually on beds or ovens are not in bathrooms.

    Robot@Home2

    Robot@Home2, is an enhanced version aimed at improving usability and functionality for developing and testing mobile robotics and computer vision algorithms. It consists of three main components. Firstly, a relational database that states the contextual information and data links, compatible with Standard Query Language. Secondly,a Python package for managing the database, including downloading, querying, and interfacing functions. Finally, learning resources in the form of Jupyter notebooks, runnable locally or on the Google Colab platform, enabling users to explore the dataset without local installations. These freely available tools are expected to enhance the ease of exploiting the Robot@Home dataset and accelerate research in computer vision and robotics.

    If you use Robot@Home2, please cite the following paper:

    Gregorio Ambrosio-Cestero, Jose-Raul Ruiz-Sarmiento, Javier Gonzalez-Jimenez, The Robot@Home2 dataset: A new release with improved usability tools, in SoftwareX, Volume 23, 2023, 101490, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2023.101490.

    @article{ambrosio2023robotathome2,title = {The Robot@Home2 dataset: A new release with improved usability tools},author = {Gregorio Ambrosio-Cestero and Jose-Raul Ruiz-Sarmiento and Javier Gonzalez-Jimenez},journal = {SoftwareX},volume = {23},pages = {101490},year = {2023},issn = {2352-7110},doi = {https://doi.org/10.1016/j.softx.2023.101490},url = {https://www.sciencedirect.com/science/article/pii/S2352711023001863},keywords = {Dataset, Mobile robotics, Relational database, Python, Jupyter, Google Colab}}

    Version historyv1.0.1 Fixed minor bugs.v1.0.2 Fixed some inconsistencies in some directory names. Fixes were necessary to automate the generation of the next version.v2.0.0 SQL based dataset. Robot@Home v1.0.2 has been packed into a sqlite database along with RGB-D and scene files which have been assembled into a hierarchical structured directory free of redundancies. Path tables are also provided to reference files in both v1.0.2 and v2.0.0 directory hierarchies. This version has been automatically generated from version 1.0.2 through the toolbox.v2.0.1 A forgotten foreign key pair have been added.v.2.0.2 The views have been consolidated as tables which allows a considerable improvement in access time.v.2.0.3 The previous version does not include the database. In this version the database has been uploaded.v.2.1.0 Depth images have been updated to 16-bit. Additionally, both the RGB images and the depth images are oriented in the original camera format, i.e. landscape.

  4. NFL Game Data: Scores & Plays (2017-2025)

    • kaggle.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keoni Mortensen (2025). NFL Game Data: Scores & Plays (2017-2025) [Dataset]. https://www.kaggle.com/datasets/keonim/nfl-game-scores-dataset-2017-2023/versions/31
    Explore at:
    zip(24653407 bytes)Available download formats
    Dataset updated
    Feb 10, 2025
    Authors
    Keoni Mortensen
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Summary This dataset contains detailed information from every game listed on the NFL's official website, https://www.nfl.com/. It aims to provide a complete record of scores along with play-by-play data across all available seasons. This dataset was created with the hope of being a valuable resource for sports analysts and data scientists interested in American football statistics. The dataset was last updated on 02/10/2025.

    Data Collection The data was collected using a custom web scraper, which is openly available for review and further development. You can access the scraper code and documentation at the following GitHub repository: https://github.com/KeoniM/NFL_Scraper.git

    Dataset Features For Scores - Season: The NFL season the game belongs to. - Week: Specific week of the NFL season. - GameStatus: Current state or final status of the game. - Day: Day of the week the game was played. - Date: Exact date (month and day) of the game. - AwayTeam: Name of the visiting team. - AwayRecord: Season record of the away team at the time of the game. - AwayScore: Total points scored by the away team. - AwayWin: Boolean indicator if the away team won the game. - HomeTeam: Name of the home team. - HomeRecord: Season record of the home team at the time of the game. - HomeScore: Total points scored by the home team. - HomeWin: Boolean indicator if the home team won the game. - AwaySeeding: Playoff seeding of the away team, if applicable. - HomeSeeding: Playoff seeding of the home team, if applicable. - PostSeason: Boolean indicating whether the game is a postseason match.

    Dataset Features For Plays - Season: The NFL season the play belongs to. - Week: Specific week of the NFL season. - Day: Day of the week the play was attempted. - Date: Exact date (month and day) of the play was attempted. - AwayTeam: Name of the visiting team. - HomeTeam: Name of the home team. - Quarter: The quarter of the game the play was attempted. - DriveNumber: The drive number of the quarter the play was attempted. - TeamWithPossession: Team with possession that attempted the play. - IsScoringDrive: Did the drive result in a score. - PlayNumberInDrive: Play number during the drive that the play was attempted. - IsScoringPlay: Did the play result in a score. - PlayOutcome: Short summary of the attempted play. - PlayDescription: In depth summary of the attempted play. - PlayStart: Starting point on the field of the attempted play.

    Follow My Data Cleaning Journey If you're interested in following my process of refining and cleaning this dataset, check out my Google Colab notebook on GitHub, where I share ongoing updates and insights: https://github.com/KeoniM/NFL_Data_Cleaning.git. The notebook includes data wrangling techniques, code snippets, and continuous improvements, making this dataset even more valuable for analysis.

    Usage Notes This dataset is intended for academic and research purposes. Users are encouraged to attribute data to the source https://www.nfl.com/ when employing this dataset in their projects or publications.

  5. h

    pale

    • huggingface.co
    Updated Oct 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zeionara (2023). pale [Dataset]. https://huggingface.co/datasets/zeio/pale
    Explore at:
    Dataset updated
    Oct 31, 2023
    Authors
    zeionara
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset card for pale

      Dataset summary
    

    This dataset contains league of legends champions' quotes parsed from fandom. See dataset viewer at the derivative repo. See dataset usage example at google colab. The dataset is available in the following configurations:

    vanilla - all data pulled from the website without significant modifications apart from the web page structure parsing; quotes - truncated version of the corpus, which does't contain sound effects; annotated - an… See the full description on the dataset page: https://huggingface.co/datasets/zeio/pale.

  6. R

    Chess Pieces (old) Dataset

    • universe.roboflow.com
    zip
    Updated Aug 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph Nelson (2020). Chess Pieces (old) Dataset [Dataset]. https://universe.roboflow.com/joseph-nelson/chess-full-old/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 18, 2020
    Dataset authored and provided by
    Joseph Nelson
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Variables measured
    Pieces
    Description

    Overview

    This is a dataset of Chess board photos and various pieces. All photos were captured from a constant angle, a tripod to the left of the board. The bounding boxes of all pieces are annotated as follows: white-king, white-queen, white-bishop, white-knight, white-rook, white-pawn, black-king, black-queen, black-bishop, black-knight, black-rook, black-pawn. There are 2894 labels across 292 images.

    https://i.imgur.com/nkjobw1.png" alt="Chess Example">

    Follow this tutorial to see an example of training an object detection model using this dataset or jump straight to the Colab notebook.

    Use Cases

    At Roboflow, we built a chess piece object detection model using this dataset.

    https://blog.roboflow.ai/content/images/2020/01/chess-detection-longer.gif" alt="ChessBoss">

    You can see a video demo of that here. (We did struggle with pieces that were occluded, i.e. the state of the board at the very beginning of a game has many pieces obscured - let us know how your results fare!)

    Using this Dataset

    We're releasing the data free on a public license.

    About Roboflow

    Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.

    Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility.

    Roboflow Workmark

  7. Human alterations of the global floodplains 1965-2019

    • catalog.data.gov
    • s.cnmilf.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Human alterations of the global floodplains 1965-2019 [Dataset]. https://catalog.data.gov/dataset/human-alterations-of-the-global-floodplains-1965-2019
    Explore at:
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    We developed the first publicly available spatially explicit estimates of the human alterations along the global floodplains during the recent 27 years (1992-2019) at 250-m resolution. To maximize the reuse of our datasets and advance the open science of human floodplain alteration, we developed three web-based programming tools: (1) Floodplain Mapping Tool, (2) Land Use Change Tool, and (3) Human Alteration Tool supported with tutorials and step-by-step audiovisual instructions. Our data reveal a significant loss of natural floodplains worldwide with 460,000 km2 of new agricultural and 140,000 km2 of new developed areas between 1992 and 2019. This dataset offers critical new insights into how floodplains are being destroyed, which will help decision-makers to reinforce strategies to conserve and restore floodplain functions and habitat. This dataset is not publicly accessible because: EPA scientists provided context and commentary but did not do any of the analyses or handle any of the data. It can be accessed through the following means: The entire data record can be downloaded as a single zip file from this web link: http://www.hydroshare.org/resource/cdb5fd97e0644a14b22e58d05299f69b. The global floodplain alteration dataset is derived entirely through ArcGIS 10.5 and ENVI 5.1 geospatial analysis platforms. To assist in reuse and application of the dataset, we developed additional Python codes aggregated as three web-based tools: Floodplain Mapping Tool: https://colab.research.google.com/drive/1xQlARZXKPexmDInYV-EMoJ-HZxmFL-eW?usp=sharing. Land Use Change Tool: https://colab.research.google.com/drive/1vmIaUCkL66CoTv4rNRIWpJXYXp4TlAKd?usp=sharing. Human Alteration Tool: https://colab.research.google.com/drive/1r2zNJNpd3aWSuDV2Kc792qSEjvDbFtBy?usp=share_link See Usage Notes section in the journal article for details. Format: The global floodplain alteration dataset is available through the HydroShare open geospatial data platform. Our data record also includes all corresponding input data, intermediate calculations, and supporting information. This dataset is associated with the following publication: Rajib, A., Q. Zheng, C. Lane, H. Golden, J. Christensen, I. Isibor, and K. Johnson. Human alterations of the global floodplains 1992–2019. Scientific Data. Springer Nature, New York, NY, USA, 10: 499, (2023).

  8. ADBench_rep

    • kaggle.com
    zip
    Updated Jul 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Kenason (2025). ADBench_rep [Dataset]. https://www.kaggle.com/datasets/willkenason/adbench-rep
    Explore at:
    zip(607143879 bytes)Available download formats
    Dataset updated
    Jul 31, 2025
    Authors
    Will Kenason
    Description

    ADBench includes 57 datasets, as shown in the following Table.

    Among them, 47 widely-used real-world datasets are gathered for model evaluation, which cover many application domains, including healthcare (e.g., disease diagnosis), audio and language processing (e.g., speech recognition), image processing (e.g., object identification), finance (e.g., financial fraud detection), etc.

    we introduce 10 more complex datasets from CV and NLP domains with more samples and richer features in ADBench. Pretrained models are applied to extract data embedding from NLP and CV datasets to access more complex representation. Please see the datasets folder and our paper for detailed information.

    We organize the above 57 datasets into user-friendly format. All the datasets are named as "number_data.npz" in the datasets folder. For example, one can evaluate AD algorithms on the cardio dataset by the following codes. For multi-class dataset like CIFAR10, additional class numbers should be specified as "number_data_class.npz". Please see the folder for more details.

    We provide the data processing code for NLP datasets and for CV datasets in Google Colab, where one can quickly reproduce our procedures via the free GPUs. We hope this could be helpful for the AD community.

    We have unified all the datasets in .npz format, and you can directly access a dataset by the following script

    """ import numpy as np data = np.load('adbench/datasets/Classical/6_cardio.npz', allow_pickle=True) X, y = data['X'], data['y'] """

  9. Data from: Gravity Spy Machine Learning Classifications of LIGO Glitches...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille (2023). Gravity Spy Machine Learning Classifications of LIGO Glitches from Observing Runs O1, O2, O3a, and O3b [Dataset]. http://doi.org/10.5281/zenodo.5649212
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille; Jane Glanzer; Sharan Banagari; Scott Coughlin; Michael Zevin; Sara Bahaadini; Neda Rohani; Sara Allen; Christopher Berry; Kevin Crowston; Mabi Harandi; Corey Jackson; Vicky Kalogera; Aggelos Katsaggelos; Vahid Noroozi; Carsten Osterlund; Oli Patane; Joshua Smith; Siddharth Soni; Laura Trouille
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This data set contains all classifications that the Gravity Spy Machine Learning model for LIGO glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b). Gravity Spy classified all noise events identified by the Omicron trigger pipeline in which Omicron identified that the signal-to-noise ratio was above 7.5 and the peak frequency of the noise event was between 10 Hz and 2048 Hz. To classify noise events, Gravity Spy made Omega scans of every glitch consisting of 4 different durations, which helps capture the morphology of noise events that are both short and long in duration.

    There are 22 classes used for O1 and O2 data (including No_Glitch and None_of_the_Above), while there are two additional classes used to classify O3 data (while None_of_the_Above was removed).

    For O1 and O2, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

    For O3, the glitch classes were: 1080Lines, 1400Ripples, Air_Compressor, Blip, Blip_Low_Frequency, Chirp, Extremely_Loud, Fast_Scattering, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line, Whistle

    The data set is described in Glanzer et al. (2023), which we ask to be cited in any publications using this data release. Example code using the data can be found in this Colab notebook.

    If you would like to download the Omega scans associated with each glitch, then you can use the gravitational-wave data-analysis tool GWpy. If you would like to use this tool, please install anaconda if you have not already and create a virtual environment using the following command

    conda create --name gravityspy-py38 -c conda-forge python=3.8 gwpy pandas psycopg2 sqlalchemy

    After downloading one of the CSV files for a specific era and interferometer, please run the following Python script if you would like to download the data associated with the metadata in the CSV file. We recommend not trying to download too many images at one time. For example, the script below will read data on Hanford glitches from O2 that were classified by Gravity Spy and filter for only glitches that were labelled as Blips with 90% confidence or higher, and then download the first 4 rows of the filtered table.

    from gwpy.table import GravitySpyTable
    
    H1_O2 = GravitySpyTable.read('H1_O2.csv')
    
    H1_O2[(H1_O2["ml_label"] == "Blip") & (H1_O2["ml_confidence"] > 0.9)]
    
    H1_O2[0:4].download(nproc=1)

    Each of the columns in the CSV files are taken from various different inputs:

    [‘event_time’, ‘ifo’, ‘peak_time’, ‘peak_time_ns’, ‘start_time’, ‘start_time_ns’, ‘duration’, ‘peak_frequency’, ‘central_freq’, ‘bandwidth’, ‘channel’, ‘amplitude’, ‘snr’, ‘q_value’] contain metadata about the signal from the Omicron pipeline.

    [‘gravityspy_id’] is the unique identifier for each glitch in the dataset.

    [‘1400Ripples’, ‘1080Lines’, ‘Air_Compressor’, ‘Blip’, ‘Chirp’, ‘Extremely_Loud’, ‘Helix’, ‘Koi_Fish’, ‘Light_Modulation’, ‘Low_Frequency_Burst’, ‘Low_Frequency_Lines’, ‘No_Glitch’, ‘None_of_the_Above’, ‘Paired_Doves’, ‘Power_Line’, ‘Repeating_Blips’, ‘Scattered_Light’, ‘Scratchy’, ‘Tomte’, ‘Violin_Mode’, ‘Wandering_Line’, ‘Whistle’] contain the machine learning confidence for a glitch being in a particular Gravity Spy class (the confidence in all these columns should sum to unity). These use the original 22 classes in all cases.

    [‘ml_label’, ‘ml_confidence’] provide the machine-learning predicted label for each glitch, and the machine learning confidence in its classification.

    [‘url1’, ‘url2’, ‘url3’, ‘url4’] are the links to the publicly-available Omega scans for each glitch. ‘url1’ shows the glitch for a duration of 0.5 seconds, ‘url2’ for 1 seconds, ‘url3’ for 2 seconds, and ‘url4’ for 4 seconds.

    For the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.


    For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo.

  10. Z

    Insect Detect - insect classification dataset v2

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sittinger, Maximilian; Uhler, Johannes; Pink, Maximilian (2023). Insect Detect - insect classification dataset v2 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8325383
    Explore at:
    Dataset updated
    Dec 10, 2023
    Authors
    Sittinger, Maximilian; Uhler, Johannes; Pink, Maximilian
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Insect Detect - insect classification dataset v2 contains mainly images of various insects sitting on or flying above an artificial flower platform. All images were automatically recorded with the Insect Detect DIY camera trap, a hardware combination of the Luxonis OAK-1, Raspberry Pi Zero 2 W and PiJuice Zero pHAT for automated insect monitoring (bioRxiv preprint). Most of the images were captured by camera traps deployed at different sites in 2023. For some classes (e.g. ant, bee_bombus, beetle_cocci, bug, bug_grapho, hfly_eristal, hfly_myathr, hfly_syrphus) additional images were captured with a lab setup of the camera trap. For some classes (e.g. bee_apis, fly, hfly_episyr, wasp) images from the first dataset version were transferred to this dataset. This dataset is also available on Roboflow Universe. The images in the dataset from Roboflow are automatically compressed, which decreases model accuracy when used for training. Therefore it is recommended to use this uncompressed Zenodo version and split the dataset into train/val/test subsets in the provided training notebook.

    Classes This dataset contains the following 27 classes:

    ant (Formicidae) bee (Anthophila excluding Apis mellifera and Bombus sp.) bee_apis (Apis mellifera) bee_bombus (Bombus sp.) beetle (Coleoptera excluding Coccinellidae and some Oedemeridae) beetle_cocci (Coccinellidae) beetle_oedem (visually distinct Oedemeridae) bug (Heteroptera excluding Graphosoma italicum) bug_grapho (Graphosoma italicum) fly (Brachycera excluding Empididae, Sarcophagidae, Syrphidae and small Brachycera) fly_empi (Empididae) fly_sarco (visually distinct Sarcophagidae) fly_small (small Brachycera) hfly_episyr (hoverfly Episyrphus balteatus) hfly_eristal (hoverfly Eristalis sp., mainly Eristalis tenax) hfly_eupeo (mainly hoverfly Eupeodes corollae and Scaeva pyrastri) hfly_myathr (hoverfly Myathropa florea) hfly_sphaero (hoverfly Sphaerophoria sp., mainly Sphaerophoria scripta) hfly_syrphus (mainly hoverfly Syrphus sp.) lepi (Lepidoptera) none_bg (images with no insect - background (platform)) none_bird (images with no insect - bird sitting on platform) none_dirt (images with no insect - leaves and other plant material, bird droppings) none_shadow (images with no insect - shadows of insects or surrounding plants) other (other Arthropods, including various Hymenoptera and Symphyta, Diptera, Orthoptera, Auchenorrhyncha, Neuroptera, Araneae) scorpionfly (Panorpa sp.) wasp (mainly Vespula sp. and Polistes dominula) For the classes hfly_eupeo and hfly_syrphus a precise taxonomic distinction is not possible with images only, due to a potentially high variability in the appearance of the respective species. While most specimens will show the visual features that are important for a classification into one of these classes, some specimens of Syrphus sp. might look more like Eupeodes sp. and vice versa. The images were sorted to the respective class by considering taxonomic and visual distinctions. However, this dataset is still rather small regarding the visually extremely diverse Insecta. Insects that are not included in this dataset can therefore be classified to the wrong class. All results should always be manually validated and false classifications can be used to extend this basic dataset and retrain your custom classification model.

    Deployment You can use this dataset as starting point to train your own insect classification models with the provided Google Colab training notebook. Read the model training instructions for more information. A insect classification model trained on this dataset is available in the insect-detect-ml GitHub repo. To deploy the model on your PC (ONNX format for fast CPU inference), follow the provided step-by-step instructions.

    License This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).

  11. R

    Accident Detection Model Dataset

    • universe.roboflow.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Accident detection model (2024). Accident Detection Model Dataset [Dataset]. https://universe.roboflow.com/accident-detection-model/accident-detection-model/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset authored and provided by
    Accident detection model
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Accident Bounding Boxes
    Description

    Accident-Detection-Model

    Accident Detection Model is made using YOLOv8, Google Collab, Python, Roboflow, Deep Learning, OpenCV, Machine Learning, Artificial Intelligence. It can detect an accident on any accident by live camera, image or video provided. This model is trained on a dataset of 3200+ images, These images were annotated on roboflow.

    Problem Statement

    • Road accidents are a major problem in India, with thousands of people losing their lives and many more suffering serious injuries every year.
    • According to the Ministry of Road Transport and Highways, India witnessed around 4.5 lakh road accidents in 2019, which resulted in the deaths of more than 1.5 lakh people.
    • The age range that is most severely hit by road accidents is 18 to 45 years old, which accounts for almost 67 percent of all accidental deaths.

    Accidents survey

    https://user-images.githubusercontent.com/78155393/233774342-287492bb-26c1-4acf-bc2c-9462e97a03ca.png" alt="Survey">

    Literature Survey

    • Sreyan Ghosh in Mar-2019, The goal is to develop a system using deep learning convolutional neural network that has been trained to identify video frames as accident or non-accident.
    • Deeksha Gour Sep-2019, uses computer vision technology, neural networks, deep learning, and various approaches and algorithms to detect objects.

    Research Gap

    • Lack of real-world data - We trained model for more then 3200 images.
    • Large interpretability time and space needed - Using google collab to reduce interpretability time and space required.
    • Outdated Versions of previous works - We aer using Latest version of Yolo v8.

    Proposed methodology

    • We are using Yolov8 to train our custom dataset which has been 3200+ images, collected from different platforms.
    • This model after training with 25 iterations and is ready to detect an accident with a significant probability.

    Model Set-up

    Preparing Custom dataset

    • We have collected 1200+ images from different sources like YouTube, Google images, Kaggle.com etc.
    • Then we annotated all of them individually on a tool called roboflow.
    • During Annotation we marked the images with no accident as NULL and we drew a box on the site of accident on the images having an accident
    • Then we divided the data set into train, val, test in the ratio of 8:1:1
    • At the final step we downloaded the dataset in yolov8 format.
      #### Using Google Collab
    • We are using google colaboratory to code this model because google collab uses gpu which is faster than local environments.
    • You can use Jupyter notebooks, which let you blend code, text, and visualisations in a single document, to write and run Python code using Google Colab.
    • Users can run individual code cells in Jupyter Notebooks and quickly view the results, which is helpful for experimenting and debugging. Additionally, they enable the development of visualisations that make use of well-known frameworks like Matplotlib, Seaborn, and Plotly.
    • In Google collab, First of all we Changed runtime from TPU to GPU.
    • We cross checked it by running command ‘!nvidia-smi’
      #### Coding
    • First of all, We installed Yolov8 by the command ‘!pip install ultralytics==8.0.20’
    • Further we checked about Yolov8 by the command ‘from ultralytics import YOLO from IPython.display import display, Image’
    • Then we connected and mounted our google drive account by the code ‘from google.colab import drive drive.mount('/content/drive')’
    • Then we ran our main command to run the training process ‘%cd /content/drive/MyDrive/Accident Detection model !yolo task=detect mode=train model=yolov8s.pt data= data.yaml epochs=1 imgsz=640 plots=True’
    • After the training we ran command to test and validate our model ‘!yolo task=detect mode=val model=runs/detect/train/weights/best.pt data=data.yaml’ ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt conf=0.25 source=data/test/images’
    • Further to get result from any video or image we ran this command ‘!yolo task=detect mode=predict model=runs/detect/train/weights/best.pt source="/content/drive/MyDrive/Accident-Detection-model/data/testing1.jpg/mp4"’
    • The results are stored in the runs/detect/predict folder.
      Hence our model is trained, validated and tested to be able to detect accidents on any video or image.

    Challenges I ran into

    I majorly ran into 3 problems while making this model

    • I got difficulty while saving the results in a folder, as yolov8 is latest version so it is still underdevelopment. so i then read some blogs, referred to stackoverflow then i got to know that we need to writ an extra command in new v8 that ''save=true'' This made me save my results in a folder.
    • I was facing problem on cvat website because i was not sure what
  12. d

    A year-long deep-sea soundscape dataset off Minamitorishima Island

    • data.depositar.io
    html, ipynb, mat, wav
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ocean Biodiversity Listening Project (2023). A year-long deep-sea soundscape dataset off Minamitorishima Island [Dataset]. https://data.depositar.io/en/dataset/a-year-long-deepsea-soundscape-dataset-off-minamitorishima-island
    Explore at:
    wav(11520512), html(21009246), wav, html(23843691), mat(185428686), ipynb, html(23418232), mat(102385464), wav(11520508), html(24552182)Available download formats
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Ocean Biodiversity Listening Project
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Minami-Tori-shima
    Description

    This dataset includes acoustic data associated with the deep-sea soundscape off Minamitorishima Island, along with Python codes for using the audio data in a Google Colab notebook.

    Recording Location

    The audio recordings were collected from the seafloor east of Takuyo-Daigo Seamount (22°59.9' N, 154°24.5' E, depth 5552 m), approximately 150 km south of Minamitorishima Island.

    Recording System

    The data collection used a 6000 m rated icListen HF hydrophone (model SB60L-ETH, Ocean Sonics, Canada) and a Gordon Smart Recorder (Ocean Sonics, Canada). The hydrophone was mounted atop an Edokko Mark I Type 365 (Okamoto Glass, Japan) and connected to two pressure-resistant glass spheres housing the Gordon Smart Recorder and battery packs. The recording system has a hydrophone sensitivity of -170 dB re 1V μPa-1, with a frequency response from 10 Hz to 200 kHz (±6 dB bandwidth). See details in Onishi et al. (2023).

    Configuration of Audio Recording

    (1) Duty Cycle: 2-min recording every 4 hours (2 AM, 6 AM, 10 AM, 2 PM, 6 PM, and 10 PM, UTC+0). (2) Sampling Rate: 512 kHz. (3) File Format: WAV. (4) Audio Gain: 0 dB. (5) High Pass Filter: Off.

    Field Deployment

    The Edokko Mark I was deployed on March 13, 2020, during the R/V KAIREI cruise KR20-E01C and retrieved on April 9, 2021, during the R/V KAIREI cruise KR21-04C. However, effective recordings only covered the period between March 13, 2020 and March 5, 2021.

    Data Structure

    • Audio: Folders of audio recordings, with folder names indicating the date of recording (yyyymmdd). Each folder contains 24 wav files recorded in one day, with file names in the format BPW33333_yyyymmdd_HHMMSS. All timestamps on the folders and wav files are in UTC+0 (local time is UTC+10).
    • LTS: A mat file containing the long-term spectrogram of the year-round seafloor recordings. There are two types of LTS: median-based and mean-based. The median-based LTS was obtained by measuring the median of power spectral densities in each frequency bin, while the mean-based LTS was obtained by measuring the mean of power spectral densities in each frequency bin. Both types have a frequency resolution of 10 Hz and a time resolution of 1 minute. Use the soundscape_IR Python package or MATLAB to open the mat file.
    • Transient Sound: A mat file containing the spectral features associated with high-intensity transient sounds, learned using semi-supervised source separation. Use the soundscape_IR Python package or MATLAB to open the mat file.
    • Model: Mat files containing source separation models for fish choruses and ambient noise. Use the soundscape_IR Python package to load these models for performing source separation.

    Funding

    This work was conducted under the following ocean programs supported by the Cross-ministerial Strategic Innovation Promotion Program: Innovative Technologies for Exploration of Deep-Sea Resources from 2018 to 2022 (Lead agency: Japan Agency for Marine-Earth Science and Technology).

    Associated Publication

    Lin, T.-H., Kawagucci, S. (2023) Acoustic twilight: A year-long seafloor monitoring unveils phenological patterns in the abyssal soundscape. Limnology and Oceanography Letters, early view.

  13. h

    Turkish_Basketball_Super_League_Dataset

    • huggingface.co
    Updated Aug 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammed Onur ulu (2025). Turkish_Basketball_Super_League_Dataset [Dataset]. https://huggingface.co/datasets/onurulu17/Turkish_Basketball_Super_League_Dataset
    Explore at:
    Dataset updated
    Aug 10, 2025
    Authors
    Muhammed Onur ulu
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    📥 Load Dataset in Python

    To load this dataset in Google Colab or any Python environment:
    !pip install huggingface_hub pandas openpyxl

    from huggingface_hub import hf_hub_download import pandas as pd

    repo_id = "onurulu17/Turkish_Basketball_Super_League_Dataset"

    files = [ "leaderboard.xlsx", "player_data.xlsx", "team_data.xlsx", "team_matches.xlsx", "player_statistics.xlsx", "technic_roster.xlsx" ]

    datasets = {}

    for f in files: path =… See the full description on the dataset page: https://huggingface.co/datasets/onurulu17/Turkish_Basketball_Super_League_Dataset.

  14. u

    Data from: Efficient imaging and computer vision detection of two cell...

    • agdatacommons.nal.usda.gov
    • datasets.ai
    • +1more
    zip
    Updated Nov 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin P. Graham; Jeremy Park; Grant Billings; Amanda M. Hulse-Kemp; Candace H. Haigler; Edgar Lobaton (2025). Data from: Efficient imaging and computer vision detection of two cell shapes in young cotton fibers [Dataset]. http://doi.org/10.15482/USDA.ADC/1528324
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 21, 2025
    Dataset provided by
    Ag Data Commons
    Authors
    Benjamin P. Graham; Jeremy Park; Grant Billings; Amanda M. Hulse-Kemp; Candace H. Haigler; Edgar Lobaton
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Methods Cotton plants were grown in a well-controlled greenhouse in the NC State Phytotron as described previously (Pierce et al, 2019). Flowers were tagged on the day of anthesis and harvested three days post anthesis (3 DPA). The distinct fiber shapes had already formed by 2 DPA (Stiff and Haigler, 2016; Graham and Haigler, 2021), and fibers were still relatively short at 3 DPA, which facilitated the visualization of multiple fiber tips in one image. Cotton fiber sample preparation, digital image collection, and image analysis: Ovules with attached fiber were fixed in the greenhouse. The fixative previously used (Histochoice) (Stiff and Haigler, 2016; Pierce et al., 2019; Graham and Haigler, 2021) is obsolete, which led to testing and validation of another low-toxicity, formalin-free fixative (#A5472; Sigma-Aldrich, St. Louis, MO; Fig. S1). The boll wall was removed without damaging the ovules. (Using a razor blade, cut away the top 3 mm of the boll. Make about 1 mm deep longitudinal incisions between the locule walls, and finally cut around the base of the boll.) All of the ovules with attached fiber were lifted out of the locules and fixed (1 h, RT, 1:10 tissue:fixative ratio) prior to optional storage at 4°C. Immediately before imaging, ovules were examined under a stereo microscope (incident light, black background, 31X) to select three vigorous ovules from each boll while avoiding drying. Ovules were rinsed (3 x 5 min) in buffer [0.05 M PIPES, 12 mM EGTA. 5 mM EDTA and 0.1% (w/v) Tween 80, pH 6.8], which had lower osmolarity than a microtubule-stabilizing buffer used previously for aldehyde-fixed fibers (Seagull, 1990; Graham and Haigler, 2021). While steadying an ovule with forceps, one to three small pieces of its chalazal end with attached fibers were dissected away using a small knife (#10055-12; Fine Science Tools, Foster City, CA). Each ovule piece was placed in a single well of a 24-well slide (#63430-04; Electron Microscopy Sciences, Hatfield, PA) containing a single drop of buffer prior to applying and sealing a 24 x 60 mm coverslip with vaseline. Samples were imaged with brightfield optics and default settings for the 2.83 mega-pixel, color, CCD camera of the Keyence BZ-X810 imaging system (www.keyence.com; housed in the Cellular and Molecular Imaging Facility of NC State). The location of each sample in the 24-well slides was identified visually using a 2X objective and mapped using the navigation function of the integrated Keyence software. Using the 10X objective lens (plan-apochromatic; NA 0.45) and 60% closed condenser aperture setting, a region with many fiber apices was selected for imaging using the multi-point and z-stack capture functions. The precise location was recorded by the software prior to visual setting of the limits of the z-plane range (1.2 µm step size). Typically, three 24-sample slides (representing three accessions) were set up in parallel prior to automatic image capture. The captured z-stacks for each sample were processed into one two-dimensional image using the full-focus function of the software. (Occasional samples contained too much debris for computer vision to be effective, and these were reimaged.) Resources in this dataset:Resource Title: Deltapine 90 - Manually Annotated Training Set. File Name: GH3 DP90 Keyence 1_45 JPEG.zipResource Description: These images were manually annotated in Labelbox.Resource Title: Deltapine 90 - AI-Assisted Annotated Training Set. File Name: GH3 DP90 Keyence 46_101 JPEG.zipResource Description: These images were AI-labeled in RoboFlow and then manually reviewed in RoboFlow. Resource Title: Deltapine 90 - Manually Annotated Training-Validation Set. File Name: GH3 DP90 Keyence 102_125 JPEG.zipResource Description: These images were manually labeled in LabelBox, and then used for training-validation for the machine learning model.Resource Title: Phytogen 800 - Evaluation Test Images. File Name: Gb cv Phytogen 800.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Pima 3-79 - Evaluation Test Images. File Name: Gb cv Pima 379.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Pima S-7 - Evaluation Test Images. File Name: Gb cv Pima S7.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Coker 312 - Evaluation Test Images. File Name: Gh cv Coker 312.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Deltapine 90 - Evaluation Test Images. File Name: Gh cv Deltapine 90.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Half and Half - Evaluation Test Images. File Name: Gh cv Half and Half.zipResource Description: These images were used to validate the machine learning model. They were manually annotated in ImageJ.Resource Title: Fiber Tip Annotations - Manual. File Name: manual_annotations.coco_.jsonResource Description: Annotations in COCO.json format for fibers. Manually annotated in Labelbox.Resource Title: Fiber Tip Annotations - AI-Assisted. File Name: ai_assisted_annotations.coco_.jsonResource Description: Annotations in COCO.json format for fibers. AI annotated with human review in Roboflow.

    Resource Title: Model Weights (iteration 600). File Name: model_weights.zipResource Description: The final model, provided as a zipped Pytorch .pth file. It was chosen at training iteration 600. The model weights can be imported for use of the fiber tip type detection neural network in Python.Resource Software Recommended: Google Colab,url: https://research.google.com/colaboratory/

  15. Nou Pa Bèt: Civic Substitution and Expressive Freedoms in Post-State...

    • zenodo.org
    bin
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brown Scott; Brown Scott (2025). Nou Pa Bèt: Civic Substitution and Expressive Freedoms in Post-State Governance [Dataset]. http://doi.org/10.5281/zenodo.16858858
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Brown Scott; Brown Scott
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here's a clear Zenodo description for your dataset:

    Dataset Description

    This dataset supports the research paper "Nou Pa Bèt: Civic Substitution and Expressive Freedoms in Post-State Governance" which examines how civic participation functions as institutional substitution in fragile states, with Haiti as the primary case study. The dataset combines governance indicators from the World Bank's Worldwide Governance Indicators (WGI) with civic engagement measures from the Varieties of Democracy (V-Dem) project.

    Files Included:

    1. wgidataset.xlsx (2.57 MB) - Complete World Bank Worldwide Governance Indicators dataset covering multiple governance dimensions across countries and years
    2. CivicEngagement_SelectedCountries_Last10Years.xlsx (25.03 KB) - Processed V-Dem civic engagement indicators for fragile states sample (2015-2024) including variables for participatory governance, civil society participation, freedom of expression, freedom of assembly, anti-system movements, and direct democracy
    3. civic.ipynb (10.35 KB) - Complete Python analysis notebook containing all data processing, regression analysis, and visualization code used in the study

    How to Use in Google Colab:

    Step 1: Upload Files

    python
    from google.colab import files
    import pandas as pd
    import numpy as np
    
    # Upload the files to your Colab environment
    uploaded = files.upload()
    # Select and upload: CivicEngagement_SelectedCountries_Last10Years.xlsx and wgidataset.xlsx

    Step 2: Load the Datasets

    python
    # Load the civic engagement data (main analysis dataset)
    civic_data = pd.read_excel('CivicEngagement_SelectedCountries_Last10Years.xlsx')
    
    # Load the WGI data (if needed for extended analysis)
    wgi_data = pd.read_excel('wgidataset.xlsx')
    
    # Display basic information
    print("Civic Engagement Dataset Shape:", civic_data.shape)
    print("
    Columns:", civic_data.columns.tolist())
    print("
    First few rows:")
    civic_data.head()

    Step 3: Run the Analysis Notebook

    python
    # Download and run the complete analysis notebook
    !wget https://zenodo.org/record/[RECORD_ID]/files/civic.ipynb
    # Then open civic.ipynb in Colab or copy/paste the code cells

    Key Variables:

    Dependent Variables (WGI):

    • Control_of_Corruption - Extent to which public power is exercised for private gain
    • Government_Effectiveness - Quality of public services and policy implementation

    Independent Variables (V-Dem):

    • v2x_partip - Participatory Component Index
    • v2x_cspart - Civil Society Participation Index
    • v2cademmob - Freedom of Peaceful Assembly
    • v2cafres - Freedom of Expression
    • v2csantimv - Anti-System Movements
    • v2xdd_dd - Direct Popular Vote Index

    Sample Countries: 21 fragile states including Haiti, Sierra Leone, Liberia, DRC, CAR, Guinea-Bissau, Chad, Niger, Burundi, Yemen, South Sudan, Mozambique, Sudan, Eritrea, Somalia, Mali, Afghanistan, Papua New Guinea, Togo, Cambodia, and Timor-Leste.

    Quick Start Analysis:

    python
    # Install required packages
    !pip install statsmodels scipy
    
    # Basic regression replication
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    
    # Prepare variables for regression
    X = civic_data[['v2x_partip', 'v2x_cspart', 'v2cademmob', 'v2cafres', 'v2csantimv', 'v2xdd_dd']].dropna()
    y_corruption = civic_data['Control_of_Corruption'].dropna()
    y_effectiveness = civic_data['Government_Effectiveness'].dropna()
    
    # Run regression (example for Control of Corruption)
    X_const = sm.add_constant(X)
    model = sm.OLS(y_corruption, X_const).fit(cov_type='HC3')
    print(model.summary())

    Citation: Brown, Scott M., Fils-Aime, Jempsy, & LaTortue, Paul. (2025). Nou Pa Bèt: Civic Substitution and Expressive Freedoms in Post-State Governance [Dataset]. Zenodo. https://doi.org/10.5281/zenodo.15058161

    License: Creative Commons Attribution 4.0 International (CC BY 4.0)

    Contact: For questions about data usage or methodology, please contact the corresponding author through the institutional affiliations provided in the paper.

    This description provides clear, step-by-step instructions for researchers to immediately begin working with your data in Google Colab while explaining the theoretical and methodological context.

  16. OpenOrca

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). OpenOrca [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-orca-augmented-flan-dataset/versions/2
    Explore at:
    zip(2548102631 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open-Orca Augmented FLAN Dataset

    Unlocking Advanced Language Understanding and ML Model Performance

    By Huggingface Hub [source]

    About this dataset

    The Open-Orca Augmented FLAN Collection is a revolutionary dataset that unlocks new levels of language understanding and machine learning model performance. This dataset was created to support research on natural language processing, machine learning models, and language understanding through leveraging the power of reasoning trace-enhancement techniques. By enabling models to understand complex relationships between words, phrases, and even entire sentences in a more robust way than ever before, this dataset provides researchers expanded opportunities for furthering the progress of linguistics research. With its unique combination of features including system prompts, questions from users and responses from systems, this dataset opens up exciting possibilities for deeper exploration into the cutting edge concepts underlying advanced linguistics applications. Experience a new level of accuracy and performance - explore Open-Orca Augmented FLAN Collection today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This guide provides an introduction to the Open-Orca Augmented FLAN Collection dataset and outlines how researchers can utilize it for their language understanding and natural language processing (NLP) work. The Open-Orca dataset includes system prompts, questions posed by users, and responses from the system.

    Getting Started The first step is to download the data set from Kaggle at https://www.kaggle.com/openai/open-orca-augmented-flan and save it in a project directory of your choice on your computer or cloud storage space. Once you have downloaded the data set, launch your ‘Jupyter Notebook’ or ‘Google Colab’ program with which you want to work with this data set.

    Exploring & Preprocessing Data: To get a better understanding of the features in this dataset, import them into Pandas DataFrame as shown below. You can use other libraries as per your need:

    import pandas as pd   # Library used for importing datasets into Python 
    
    df = pd.read_csv('train.csv') #Imports train csv file into Pandas};#DataFrame 
    
    df[['system_prompt','question','response']].head() #Views top 5 rows with columns 'system_prompt','question','response'
    

    After importing check each feature using basic descriptive statistics such Pandas groupby statement: We can use groupby statements to have greater clarity over the variables present in each feature(elements). The below command will show counts of each element in System Prompt column present under train CVS file :

     df['system prompt'].value_counts().head()#shows count of each element present under 'System Prompt'column
     Output: User says hello guys 587 <br>System asks How are you?: 555 times<br>User says I am doing good: 487 times <br>..and so on   
    

    Data Transformation: After inspecting & exploring different features one may want/need certain changes that best suits their needs from this dataset before training modeling algorithms on it.
    Common transformation steps include : Removing punctuation marks : Since punctuation marks may not add any value to computation operations , we can remove them using regex functions write .replace('[^A-Za -z]+','' ) as

    Research Ideas

    • Automated Question Answering: Leverage the dataset to train and develop question answering models that can provide tailored answers to specific user queries while retaining language understanding abilities.
    • Natural Language Understanding: Use the dataset as an exploratory tool for fine-tuning natural language processing applications, such as sentiment analysis, document categorization, parts-of-speech tagging and more.
    • Machine Learning Optimizations: The dataset can be used to build highly customized machine learning pipelines that allow users to harness the power of conditioning data with pre-existing rules or models for improved accuracy and performance in automated tasks

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. [See Other Information](ht...

  17. z

    The Cultural Resource Curse: How Trade Dependence Undermines Creative...

    • zenodo.org
    bin, csv
    Updated Aug 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anon Anon; Anon Anon (2025). The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries [Dataset]. http://doi.org/10.5281/zenodo.16784974
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Aug 9, 2025
    Dataset provided by
    Zenodo
    Authors
    Anon Anon; Anon Anon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies the study The Cultural Resource Curse: How Trade Dependence Undermines Creative Industries. It contains country-year panel data for 2000–2023 covering both OECD economies and the ten largest Latin American countries by land area. Variables include GDP per capita (constant PPP, USD), trade openness, internet penetration, education indicators, cultural exports per capita, and executive constraints from the Polity V dataset.

    The dataset supports a comparative analysis of how economic structure, institutional quality, and infrastructure shape cultural export performance across development contexts. Within-country fixed effects models show that trade openness constrains cultural exports in OECD economies but has no measurable effect in resource-dependent Latin America. In contrast, strong executive constraints benefit cultural industries in advanced economies while constraining them in extraction-oriented systems. The results provide empirical evidence for a two-stage development framework in which colonial extraction legacies create distinct constraints on creative industry growth.

    All variables are harmonized to ISO3 country codes and aligned on a common panel structure. The dataset is fully reproducible using the included Jupyter notebooks (OECD.ipynb, LATAM+OECD.ipynb, cervantes.ipynb).

    Contents:

    • GDPPC.csv — GDP per capita series from the World Bank.

    • explanatory.csv — Trade openness, internet penetration, and education indicators.

    • culture_exports.csv — UNESCO cultural export data.

    • p5v2018.csv — Polity V institutional indicators.

    • Jupyter notebooks for data processing and replication.

    Potential uses: Comparative political economy, cultural economics, institutional development, and resource curse research.

    How to Run This Dataset and Code in Google Colab

    These steps reproduce the OECD vs. Latin America analyses from the paper using the provided CSVs and notebooks.

    1) Open Colab and set up

    1. Go to https://colab.research.google.com

    2. Click File → New notebook.

    3. (Optional) If your files are in Google Drive, mount it:

    python
    CopiarEditar
    from google.colab import drive drive.mount('/content/drive')

    2) Get the data files into Colab

    You have two easy options:

    A. Upload the 4 CSVs + notebooks directly

    • In the left sidebar, click the folder icon → Upload.

    • Upload: GDPPC.csv, explanatory.csv, culture_exports.csv, p5v2018.csv, and any .ipynb you want to run.

    B. Use Google Drive

    • Put those files in a Drive folder.

    • After mounting Drive, refer to them with paths like /content/drive/MyDrive/your_folder/GDPPC.csv.

  18. Z

    TWristAR - wristband activity recognition

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis (2022). TWristAR - wristband activity recognition [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5911807
    Explore at:
    Dataset updated
    Feb 7, 2022
    Dataset provided by
    Texas State University
    Authors
    Lee B. Hinkle; Gentry Atkinson; Vangelis Metsis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TWristAR is a small three subject dataset recorded using an e4 wristband. Each subject performed six scripted activities: upstairs/downstairs, walk/jog, and sit/stand. Each activity except stairs was performed for one minute a total of three times alternating between the pairs. Subjects 1 & 2 also completed a walking sequence of approximately 10 minutes. The dataset contains motion (accelerometer) data, temperature, electrodermal activity, and heart rate data. The .csv file associated with each datafile contains timing and labeling information and was built using the provided Excel files.

    Each two activity session was recorded using a downward facing action camera. This video was used to generate the labels and is provided to investigate any data anomalies, especially for the free-form long walk. For privacy reasons only the sub1_stairs video contains audio.

    The Jupyter notebook processes the acceleration data and performs hold-one-subject out evaluation of a 1D-CNN. Example results from a run performed on a google colab GPU instance (w/o GPU the training time increases to about 90 seconds per pass):

    Hold-one-subject-out results
    
    
        Train Sub
        Test Sub
        Accuracy
        Training Time (HH:MM:SS)
    
    
    
    
        [1,2]
        [3]
        0.757
        00:00:12
    
    
        [2,3]
        [1]
        0.849
        00:00:14
    
    
        [1,3]
        [2]
        0.800
        00:00:11
    

    This notebook can also be run in colab here. This video describes the processing https://mediaflo.txstate.edu/Watch/e4_data_processing.

    We hope you find this a useful dataset with end-to-end code. We have several papers in process and would appreciate your citation of the dataset if you use it in your work.

  19. h

    EnglishDatasets

    • huggingface.co
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Whitzscott (2024). EnglishDatasets [Dataset]. https://huggingface.co/datasets/Whitzz/EnglishDatasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2024
    Authors
    Whitzscott
    Description

    Dataset Overview

    Dataset Name: Whitzz/EnglishDatasetsLicense: MITDescription:This dataset consists of 100,000+ English words consisting of many version ranging from small to huge, scraped from multiple sources. It can be used for fine-tuning language models, performing text processing tasks, or for applications like spell-checking, word categorization, and more.

      How to Use
    

    To use this dataset in Google Colab or any Python environment, follow these steps:… See the full description on the dataset page: https://huggingface.co/datasets/Whitzz/EnglishDatasets.

  20. R

    Anki Vector Robot Dataset

    • universe.roboflow.com
    zip
    Updated Aug 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vectorstuff (2025). Anki Vector Robot Dataset [Dataset]. https://universe.roboflow.com/vectorstuff/vectorcompletedataset/model/4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 29, 2025
    Dataset authored and provided by
    vectorstuff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    VectorCompleteDataset Bounding Boxes
    Description

    Background

    The Anki Vector robot (assets currently owned by Digital Dream Labs LLC which bought Anki assets in 2019) was first introduced in 2018. In my opinion, the Vector robot has been the cheapest fully functional autonomous robot that has ever been built. The Vector robot can be trained to recognize people; however Vector does not have the ability to recognize another Vector. This dataset has been designed to allow one to train a model which can detect a Vector robot in the camera feed of another Vector robot.

    Details Pictures were taken with Vector’s camera with another Vector facing it and had this other Vector could move freely. This allowed pictures to be captured from different angles. These pictures were then labeled by marking the rectangular regions around Vector in all the images with the help of a free Linux utility called labelImg. Different backgrounds and lighting conditions were used to take the pictures. There is also a collection of pictures without Vector.

    Example An example use case is available in my Google Colab notebook, a version of which can be found in my Git.

    More More details are available in this article on my blog. If you are new to Computer Vision/ Deep Learning/ AI, you can consider my course on 'Learn AI with a Robot' which attempts to teach AI based on the AI4K12.org curriculum. There are more details available in this post.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
gil (2025). watches [Dataset]. https://huggingface.co/datasets/yotam22/watches

watches

yotam22/watches

Luxury Watch Prices — Exploratory Data Analysis

Explore at:
Dataset updated
Nov 17, 2025
Authors
gil
Description

🕰️ Exploratory Data Analysis of Luxury Watch Prices

  Overview

This project analyzes a large dataset of luxury watches to understand which factors influence price.We focus on brand, movement type, case material, size, gender, and production year.All work was done in Python (Pandas, NumPy, Matplotlib/Seaborn) on Google Colab.

  Dataset

Rows: ~172,000
Columns: 14
Unit of observation: one watch listing

Main columns

name – watch/listing title
price – listed… See the full description on the dataset page: https://huggingface.co/datasets/yotam22/watches.

Search
Clear search
Close search
Google apps
Main menu