58 datasets found
  1. f

    Data from: Efficient Model-Free Subsampling Method for Massive Data

    • tandf.figshare.com
    txt
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

  2. c

    Census of Population, 1880: Public Use Sample (1 in 1000 Preliminary...

    • archive.ciser.cornell.edu
    Updated Feb 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Russell Menard; Steven Ruggles (2020). Census of Population, 1880: Public Use Sample (1 in 1000 Preliminary Subsample) [Dataset]. http://doi.org/10.6077/j5/wrvf3n
    Explore at:
    Dataset updated
    Feb 25, 2020
    Authors
    Russell Menard; Steven Ruggles
    Variables measured
    Individual, Family.HouseholdFamily
    Description

    This collection is a nationally representative--although clustered--1 in 1000 preliminary subsample of the United States population in 1880. The subsample is based on every tenth microfilm reel of enumeration forms (there are a total of 1,454 reels) and, within each reel, on the census page itself. In terms of the Public Use Sample as a whole, a sample density of 1 person per 100 was chosen so that a single sample point was randomly generated for every two census pages. Sample points were chosen for inclusion in the collection only if the individual selected was the first person listed in the dwelling. Under this procedure each dwelling, family, and individual in the population had a 1 in 100 probability of inclusion in the Public Use Sample.

    Please Note: This dataset is part of the historical CISER Data Archive Collection and is also available at ICPSR at https://doi.org/10.3886/ICPSR09474.v1. We highly recommend using the ICPSR version as they may make this dataset available in multiple data formats in the future.

  3. STEAD subsample 4 CDiffSD

    • zenodo.org
    bin
    Updated Apr 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele Trappolini; Daniele Trappolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    STEAD Subsample Dataset for CDiffSD Training

    Overview

    This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

    Dataset Files

    The dataset includes the following files:

    • train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.
    • noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.
    • test: Used for the testing phase, structured similarly to train.
    • noise_test: Used for the testing phase, contains noise data for testing.

    Each file is structured to support the training and evaluation of seismic denoising models.

    Data

    The HDF5 files named noise contain two main datasets:

    • traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).
    • metadata: This dataset contains the names of the traces for each event.

    Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

    • p_arrival: Contains the arrival indices of P-waves, expressed in counts.
    • s_arrival: Contains the arrival indices of S-waves, also expressed in counts.


    Usage

    To load these files in a Python environment, use the following approach:

    ```python

    import h5py
    import numpy as np

    # Open the HDF5 file in read mode
    with h5py.File('train_noise.hdf5', 'r') as file:
    # Print all the main keys in the file
    print("Keys in the HDF5 file:", list(file.keys()))

    if 'traces' in file:
    # Access the dataset
    data = file['traces'][:10] # Load the first 10 traces

    if 'metadata' in file:
    # Access the dataset
    trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

    Ensure that the path to the file is correctly specified relative to your Python script.

    Requirements

    To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

    ```bash
    pip install numpy
    pip install h5py
    ```

  4. h

    NExtLong-512K-dataset-subset

    • huggingface.co
    Updated May 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KCSG Knowledge Computing and Service Group, IIE, CAS (2025). NExtLong-512K-dataset-subset [Dataset]. https://huggingface.co/datasets/caskcsg/NExtLong-512K-dataset-subset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 23, 2025
    Dataset authored and provided by
    KCSG Knowledge Computing and Service Group, IIE, CAS
    Description

    NExtLong: Toward Effective Long-Context Training without Long Documents

    This repository contains the code ,models and datasets for our paper NExtLong: Toward Effective Long-Context Training without Long Documents. [Github]

      Quick Links
    

    Overview NExtLong Models NExtLong Datasets Datasets list How to use NExtLong datasets

    Bugs or Questions?

      Overview
    

    Large language models (LLMs) with extended context windows have made significant strides yet remain a… See the full description on the dataset page: https://huggingface.co/datasets/caskcsg/NExtLong-512K-dataset-subset.

  5. R

    Subset Generate Dataset

    • universe.roboflow.com
    zip
    Updated Jul 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HLCV2023finalproject (2023). Subset Generate Dataset [Dataset]. https://universe.roboflow.com/hlcv2023finalproject/subset-generate/dataset/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 23, 2023
    Dataset authored and provided by
    HLCV2023finalproject
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Euro Coins Bounding Boxes
    Description

    Subset Generate

    ## Overview
    
    Subset Generate is a dataset for object detection tasks - it contains Euro Coins annotations for 2,026 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  6. R

    Subset Dataset

    • universe.roboflow.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BH (2023). Subset Dataset [Dataset]. https://universe.roboflow.com/bh-zza8w/subset/dataset/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset authored and provided by
    BH
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Toys Bounding Boxes
    Description

    Subset

    ## Overview
    
    Subset is a dataset for object detection tasks - it contains Toys annotations for 201 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  7. R

    Dior Subset Dataset

    • universe.roboflow.com
    zip
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Master Thesis (2023). Dior Subset Dataset [Dataset]. https://universe.roboflow.com/master-thesis-it8vi/dior-subset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    Master Thesis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Airplanes Vehicles Ships Bounding Boxes
    Description

    DIOR Subset

    ## Overview
    
    DIOR Subset is a dataset for object detection tasks - it contains Airplanes Vehicles Ships annotations for 927 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. Z

    Keras video classification example with a subset of UCF101 - Action...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikolaj Buchwald (2023). Keras video classification example with a subset of UCF101 - Action Recognition Data Set (top 10 videos) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882860
    Explore at:
    Dataset updated
    May 11, 2023
    Dataset authored and provided by
    Mikolaj Buchwald
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classify video clips with natural scenes of actions performed by people visible in the videos.

    See the UCF101 Dataset web page: https://www.crcv.ucf.edu/data/UCF101.php#Results_on_UCF101

    This example datasets consists of the 10 most numerous video from the UCF101 dataset. For the top 5 version, see: https://doi.org/10.5281/zenodo.7924745 .

    Based on this code: https://keras.io/examples/vision/video_classification/ (needs to be updated, if has not yet been already; see the issue: https://github.com/keras-team/keras-io/issues/1342).

    Testing if data can be downloaded from figshare with wget, see: https://github.com/mojaveazure/angsd-wrapper/issues/10

    For generating the subset, see this notebook: https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb -- however, it also needs to be adjusted (if has not yet been already - then I will post a link to the notebook here or elsewhere, e.g., in the corrected notebook with Keras example).

    I would like to thank Sayak Paul for contacting me about his example at Keras documentation being out of date.

    Cite this dataset as:

    Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402

    To download the dataset via the command line, please use:

    wget -q https://zenodo.org/record/7882861/files/ucf101_top10.tar.gz -O ucf101_top10.tar.gz tar xf ucf101_top10.tar.gz

  9. R

    Ip102 Subset Dataset

    • universe.roboflow.com
    zip
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pest (2025). Ip102 Subset Dataset [Dataset]. https://universe.roboflow.com/pest-jyqit/ip102-subset/model/2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    pest
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    16 Wireworm 50 Legume Blis Bounding Boxes
    Description

    IP102 Subset

    ## Overview
    
    IP102 Subset is a dataset for object detection tasks - it contains 16 Wireworm 50 Legume Blis annotations for 2,955 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  10. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Loist, Skadi
    Samoilova, Evgenia (Zhenya)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  11. d

    NLM LitArch Open Access Subset

    • catalog.data.gov
    • datadiscovery.nlm.nih.gov
    • +3more
    Updated Jun 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). NLM LitArch Open Access Subset [Dataset]. https://catalog.data.gov/dataset/nlm-litarch-open-access-subset
    Explore at:
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    National Library of Medicine
    Description

    A subset of the total collection of books and documents in the NLM Literature Archive (NLM LitArch), accessible through the Bookshelf website, are available through the NLM LitArch Open Access subset. Contents in the NLM LitArch Open Access subset generally include works which are in the public domain, works which are available under a Creative Commons or similar license, and works whose authors or publishers have explicitly agreed to the terms of the NLM LitArch Open Access subset. Except for public domain works, works in the NLM LitArch Open Access subset are still protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. The license terms are not the same for each work. Read the license text which is available with each downloadable file to determine terms of use.

  12. R

    Mot 17 20 Subset Dataset

    • universe.roboflow.com
    zip
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    subsetMOT1720 (2024). Mot 17 20 Subset Dataset [Dataset]. https://universe.roboflow.com/subsetmot1720/mot-17-20-subset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 11, 2024
    Dataset authored and provided by
    subsetMOT1720
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Variables measured
    Mot Bounding Boxes
    Description

    MOT 17 20 Subset

    ## Overview
    
    MOT 17 20 Subset is a dataset for object detection tasks - it contains Mot annotations for 3,849 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [BY-NC-SA 4.0 license](https://creativecommons.org/licenses/BY-NC-SA 4.0).
    
  13. R

    Crddc Custom Subset Dataset

    • universe.roboflow.com
    zip
    Updated Oct 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    crddccustomsubset (2022). Crddc Custom Subset Dataset [Dataset]. https://universe.roboflow.com/crddccustomsubset/crddc-custom-subset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 7, 2022
    Dataset authored and provided by
    crddccustomsubset
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Type Of Road Damage Bounding Boxes
    Description

    Crddc Custom Subset

    ## Overview
    
    Crddc Custom Subset is a dataset for object detection tasks - it contains Type Of Road Damage annotations for 4,235 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  14. H

    AORC Subset

    • hydroshare.org
    • beta.hydroshare.org
    • +1more
    zip
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayman Nassar; David Tarboton; Anthony M. Castronova (2023). AORC Subset [Dataset]. https://www.hydroshare.org/resource/c1bce473fff641d7a678565af9785c31
    Explore at:
    zip(28.3 KB)Available download formats
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    HydroShare
    Authors
    Ayman Nassar; David Tarboton; Anthony M. Castronova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2010 - Dec 31, 2019
    Area covered
    Description

    The objective of this HydroShare resource is to query AORC v1.0 Forcing data stored on HydroShare's Thredds server and create a subset of this dataset for a designated watershed and timeframe. The user is prompted to define their temporal and spatial frames of interest, which specifies the start and end dates for the data subset. Additionally, the user is prompted to define a spatial frame of interest, which could be a bounding box or a shapefile, to subset the data spatially.

    Before the subsetting is performed, data is queried, and geospatial metadata is added to ensure that the data is correctly aligned with its corresponding location on the Earth's surface. To achieve this, two separate notebooks were created - this notebook and this notebook - which explain how to query the dataset and add geospatial metadata to AORC v1.0 data in detail, respectively. In this notebook, we call functions from the AORC.py script to perform these preprocessing steps, resulting in a cleaner notebook that focuses solely on the subsetting process.

  15. d

    Data from: Time for a rethink: time sub-sampling methods in...

    • search.dataone.org
    • data.niaid.nih.gov
    • +2more
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Guillerme; Natalie Cooper (2025). Time for a rethink: time sub-sampling methods in disparity-through-time analyses [Dataset]. http://doi.org/10.5061/dryad.vp4q518
    Explore at:
    Dataset updated
    Apr 12, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Thomas Guillerme; Natalie Cooper
    Time period covered
    Apr 5, 2019
    Description

    Disparity-through-time analyses can be used to determine how morphological diversity changes in response to mass extinctions, and to investigate the drivers of morphological change. These analyses are routinely applied to palaeobiological datasets, yet although there is much discussion about how to best calculate disparity, there has been little consideration of how taxa should be sub-sampled through time. Standard practice is to group taxa into discrete time bins, often based on stratigraphic periods. However, this can introduce biases when bins are of unequal size, and implicitly assumes a punctuated model of evolution. In addition, many time bins may have few or no taxa, meaning that disparity cannot be calculated for the bin and making it harder to complete downstream analyses. Here we describe a different method to complement the disparity-through-time tool-kit: time-slicing. This method uses a time-calibrated phylogenetic tree to sample disparity-through-time at any fixed point in...

  16. d

    NLCD 2011 Land Cover California Subset

    • catalog.data.gov
    • data.cnra.ca.gov
    • +6more
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of Fish and Wildlife (2024). NLCD 2011 Land Cover California Subset [Dataset]. https://catalog.data.gov/dataset/nlcd-2011-land-cover-california-subset
    Explore at:
    Dataset updated
    Nov 27, 2024
    Dataset provided by
    California Department of Fish and Wildlife
    Area covered
    California
    Description

    The U.S. Geological Survey (USGS), in partnership with several federal agencies, has developed and released five National Land Cover Database (NLCD) products over the past two decades: NLCD 1992, 2001, 2006, 2011, and 2016. The 2016 release saw landcover created for additional years of 2003, 2008, and 2013. These products provide spatially explicit and reliable information on the Nation’s land cover and land cover change. To continue the legacy of NLCD and further establish a long-term monitoring capability for the Nation’s land resources, the USGS has designed a new generation of NLCD products named NLCD 2019. The NLCD 2019 design aims to provide innovative, consistent, and robust methodologies for production of a multi-temporal land cover and land cover change database from 2001 to 2019 at 2–3-year intervals. Comprehensive research was conducted and resulted in developed strategies for NLCD 2019: continued integration between impervious surface and all landcover products with impervious surface being directly mapped as developed classes in the landcover, a streamlined compositing process for assembling and preprocessing based on Landsat imagery and geospatial ancillary datasets; a multi-source integrated training data development and decision-tree based land cover classifications; a temporally, spectrally, and spatially integrated land cover change analysis strategy; a hierarchical theme-based post-classification and integration protocol for generating land cover and change products; a continuous fields biophysical parameters modeling method; and an automated scripted operational system for the NLCD 2019 production. The performance of the developed strategies and methods were tested in twenty composite referenced areas throughout the conterminous U.S. An overall accuracy assessment from the 2016 publication give a 91% overall landcover accuracy, with the developed classes also showing a 91% accuracy in overall developed. Results from this study confirm the robustness of this comprehensive and highly automated procedure for NLCD 2019 operational mapping. Questions about the NLCD 2019 land cover product can be directed to the NLCD 2019 land cover mapping team at USGS EROS, Sioux Falls, SD (605) 594-6151 or mrlc@usgs.gov. See included spatial metadata for more details.

  17. w

    Brazil - XII Recenseamento Geral do Brasil. Censo Demográfico 2010 - IPUMS...

    • wbwaterdata.org
    Updated Mar 16, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Brazil - XII Recenseamento Geral do Brasil. Censo Demográfico 2010 - IPUMS Subset - Dataset - waterdata [Dataset]. https://wbwaterdata.org/dataset/brazil-xii-recenseamento-geral-do-brasil-censo-demogrfico-2010-ipums-subset
    Explore at:
    Dataset updated
    Mar 16, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Brazil
    Description

    IPUMS-International is an effort to inventory, preserve, harmonize, and disseminate census microdata from around the world. The project has collected the world's largest archive of publicly available census samples. The data are coded and documented consistently across countries and over time to facillitate comparative research. IPUMS-International makes these data available to qualified researchers free of charge through a web dissemination system. The IPUMS project is a collaboration of the Minnesota Population Center, National Statistical Offices, and international data archives. Major funding is provided by the U.S. National Science Foundation and the Demographic and Behavioral Sciences Branch of the National Institute of Child Health and Human Development. Additional support is provided by the University of Minnesota Office of the Vice President for Research, the Minnesota Population Center, and Sun Microsystems.

  18. P

    Sony-Total-Dark Dataset

    • paperswithcode.com
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qingsen Yan; Yixu Feng; Cheng Zhang; Pei Wang; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang (2023). Sony-Total-Dark Dataset [Dataset]. https://paperswithcode.com/dataset/sony-total-dark
    Explore at:
    Dataset updated
    Nov 26, 2023
    Authors
    Qingsen Yan; Yixu Feng; Cheng Zhang; Pei Wang; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang
    Description

    Original SID dataset is introduced in "Learning to See in the Dark". The subset of SID dataset captured by Sony α7S II camera is adopted for evaluation. There are 2697 short-long-exposure RAW image pairs. To make this dataset more challenging, we converted the RAW format images to sRGB images with no gamma correction, which resulted in images becoming extremely dark.

  19. R

    Fabric Defect Detection Subset Dataset

    • universe.roboflow.com
    zip
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    First Workspace Gan (2025). Fabric Defect Detection Subset Dataset [Dataset]. https://universe.roboflow.com/first-workspace-gan/fabric-defect-detection-subset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 5, 2025
    Dataset authored and provided by
    First Workspace Gan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Defect JGX9 Bounding Boxes
    Description

    Fabric Defect Detection Subset

    ## Overview
    
    Fabric Defect Detection Subset is a dataset for object detection tasks - it contains Defect JGX9 annotations for 594 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  20. Chicago Narcotics Crime Jan 2016 - Jul 2020

    • kaggle.com
    Updated Aug 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anugerah Erlaut (2020). Chicago Narcotics Crime Jan 2016 - Jul 2020 [Dataset]. https://www.kaggle.com/aerlaut/chicago-narcotics-jan-2016-jul-2020
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2020
    Dataset provided by
    Kaggle
    Authors
    Anugerah Erlaut
    License

    https://www.usa.gov/government-works/https://www.usa.gov/government-works/

    Area covered
    Chicago
    Description

    Introduction

    Chicago is one of America's most iconic cities. It has a colorful history, which rich histories such. Recently, Chicago was also a setting for one of Netflix's popular series : Ozark. The story has it that Chicago is the center for drug distribution for the Navarro cartel.

    So, how true is the series? A quick search on the internet reveals a recently released DEA report on the. The report shows that drug crime exists in Chicago, although they are distributed by the Cartel de Jalisco Nueva Generacion, the Sinaloa Cartel and the Guerros Unidos, to name a few.

    Content

    The government of the City of Chicago has provided a publicly available crime database accessible via Google BigQuery. I have downloaded a subset of the data with crime_type narcotics and year > 2015. The data contains records between 1 Jan 2016 UTC until 23 Jul 2020 UTC.

    The dataset contains these columns : - case_number : ID of the record - date : Date of incident - iucr : Category of the crime, per Illinois Unified Crime Reporting (IUCR) code. [more](https://data.cityofchicago.org/widgets/c7ck-438e) -description: More detailed description of the crime -location_description: Location of the crime -arrest: Whether an arrest was made -domestic: Was the crime domestic? -district: Which district code where the crime happened. [more](https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Districts-current-/fthy-xz3r) -ward: The ward code where the crime happened. [more](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Wards-2015-/sp34-6z76) -community_area` : The community area code where the crime happened. more

    Acknowledgements

    The data is owned and kindly provided by the City of Chicago.

    Inspiration

    Some questions to get you started:

    1. Is there a trend? Is the crime increasing? or decreasing?
    2. Is there seasonality? Are dealers more like to be out and about in summer? Do they deal inside in winter?
    3. Are some activities more like to happen at certain locations?
    4. We tend to think that more deals happen at night, especially as people wind down, and the surroundings get dark. Does the data reflect that?
    5. Are the incidents clustered to a certain district? Certain type of location?

    Lastly, if you are : - a newly recruited analyst at the DEA / police, what would you recommend? - asked by el jefe del cartel (boss of the cartel) on how to expand operation / operate better, what would you say?

    Happy wrangling!

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2

Data from: Efficient Model-Free Subsampling Method for Massive Data

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Feb 14, 2024
Dataset provided by
Taylor & Francis
Authors
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

Search
Clear search
Close search
Google apps
Main menu