61 datasets found

f
Data from: Efficient Model-Free Subsampling Method for Massive Data
tandf.figshare.com
txt
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24347102.v2
Dataset updated
Feb 14, 2024
Dataset provided by
Taylor & Francis
Authors
Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.
c
Census of Population, 1880: Public Use Sample (1 in 1000 Preliminary...
archive.ciser.cornell.edu
Updated Feb 25, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Russell Menard; Steven Ruggles (2020). Census of Population, 1880: Public Use Sample (1 in 1000 Preliminary Subsample) [Dataset]. http://doi.org/10.6077/j5/wrvf3n
Explore at:
Unique identifier
https://doi.org/10.6077/j5/wrvf3n
Dataset updated
Feb 25, 2020
Authors
Russell Menard; Steven Ruggles
Variables measured
Individual, Family.HouseholdFamily
Description
This collection is a nationally representative--although clustered--1 in 1000 preliminary subsample of the United States population in 1880. The subsample is based on every tenth microfilm reel of enumeration forms (there are a total of 1,454 reels) and, within each reel, on the census page itself. In terms of the Public Use Sample as a whole, a sample density of 1 person per 100 was chosen so that a single sample point was randomly generated for every two census pages. Sample points were chosen for inclusion in the collection only if the individual selected was the first person listed in the dwelling. Under this procedure each dwelling, family, and individual in the population had a 1 in 100 probability of inclusion in the Public Use Sample.

Please Note: This dataset is part of the historical CISER Data Archive Collection and is also available at ICPSR at https://doi.org/10.3886/ICPSR09474.v1. We highly recommend using the ICPSR version as they may make this dataset available in multiple data formats in the future.
h
NExtLong-512K-dataset-subset
huggingface.co
Updated May 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KCSG Knowledge Computing and Service Group, IIE, CAS (2025). NExtLong-512K-dataset-subset [Dataset]. https://huggingface.co/datasets/caskcsg/NExtLong-512K-dataset-subset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 23, 2025
Dataset authored and provided by
KCSG Knowledge Computing and Service Group, IIE, CAS
Description
NExtLong: Toward Effective Long-Context Training without Long Documents

This repository contains the code ,models and datasets for our paper NExtLong: Toward Effective Long-Context Training without Long Documents. [Github]

Quick Links

Overview NExtLong Models NExtLong Datasets Datasets list How to use NExtLong datasets

Bugs or Questions?

Overview

Large language models (LLMs) with extended context windows have made significant strides yet remain a… See the full description on the dataset page: https://huggingface.co/datasets/caskcsg/NExtLong-512K-dataset-subset.
STEAD subsample 4 CDiffSD
zenodo.org
bin
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11094536
Dataset updated
Apr 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniele Trappolini; Daniele Trappolini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 15, 2024
Description
STEAD Subsample Dataset for CDiffSD Training

Overview

This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

Dataset Files

The dataset includes the following files:

train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.

noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.

test: Used for the testing phase, structured similarly to train.

noise_test: Used for the testing phase, contains noise data for testing.

Each file is structured to support the training and evaluation of seismic denoising models.

Data

The HDF5 files named noise contain two main datasets:

traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).

metadata: This dataset contains the names of the traces for each event.

Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

p_arrival: Contains the arrival indices of P-waves, expressed in counts.

s_arrival: Contains the arrival indices of S-waves, also expressed in counts.

Usage

To load these files in a Python environment, use the following approach:

```python

import h5py
import numpy as np

# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))

if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces

if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

Ensure that the path to the file is correctly specified relative to your Python script.

Requirements

To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

```bash
pip install numpy
pip install h5py
```
R
Subset Generate Dataset
universe.roboflow.com
zip
Updated Jul 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HLCV2023finalproject (2023). Subset Generate Dataset [Dataset]. https://universe.roboflow.com/hlcv2023finalproject/subset-generate/dataset/2
Explore at:
zipAvailable download formats
Dataset updated
Jul 23, 2023
Dataset authored and provided by
HLCV2023finalproject
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Euro Coins Bounding Boxes
Description
Subset Generate

## Overview Subset Generate is a dataset for object detection tasks - it contains Euro Coins annotations for 2,026 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Dior Subset Dataset
universe.roboflow.com
zip
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Master Thesis (2023). Dior Subset Dataset [Dataset]. https://universe.roboflow.com/master-thesis-it8vi/dior-subset
Explore at:
zipAvailable download formats
Dataset updated
Mar 2, 2023
Dataset authored and provided by
Master Thesis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Airplanes Vehicles Ships Bounding Boxes
Description
DIOR Subset

## Overview DIOR Subset is a dataset for object detection tasks - it contains Airplanes Vehicles Ships annotations for 927 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Z
Keras video classification example with a subset of UCF101 - Action...
data.niaid.nih.gov
zenodo.org
Updated May 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mikolaj Buchwald (2023). Keras video classification example with a subset of UCF101 - Action Recognition Data Set (top 10 videos) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882860
Explore at:
Dataset updated
May 11, 2023
Dataset authored and provided by
Mikolaj Buchwald
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classify video clips with natural scenes of actions performed by people visible in the videos.

See the UCF101 Dataset web page: https://www.crcv.ucf.edu/data/UCF101.php#Results_on_UCF101

This example datasets consists of the 10 most numerous video from the UCF101 dataset. For the top 5 version, see: https://doi.org/10.5281/zenodo.7924745 .

Based on this code: https://keras.io/examples/vision/video_classification/ (needs to be updated, if has not yet been already; see the issue: https://github.com/keras-team/keras-io/issues/1342).

Testing if data can be downloaded from figshare with wget, see: https://github.com/mojaveazure/angsd-wrapper/issues/10

For generating the subset, see this notebook: https://colab.research.google.com/github/sayakpaul/Action-Recognition-in-TensorFlow/blob/main/Data_Preparation_UCF101.ipynb -- however, it also needs to be adjusted (if has not yet been already - then I will post a link to the notebook here or elsewhere, e.g., in the corrected notebook with Keras example).

I would like to thank Sayak Paul for contacting me about his example at Keras documentation being out of date.

Cite this dataset as:

Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402

To download the dataset via the command line, please use:

wget -q https://zenodo.org/record/7882861/files/ucf101_top10.tar.gz -O ucf101_top10.tar.gz tar xf ucf101_top10.tar.gz
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Loist, Skadi
Samoilova, Evgenia (Zhenya)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
H
AORC Subset
hydroshare.org
beta.hydroshare.org
+1more
zip
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayman Nassar; David Tarboton; Anthony M. Castronova (2023). AORC Subset [Dataset]. https://www.hydroshare.org/resource/c1bce473fff641d7a678565af9785c31
Explore at:
zip(28.3 KB)Available download formats
Dataset updated
Dec 6, 2023
Dataset provided by
HydroShare
Authors
Ayman Nassar; David Tarboton; Anthony M. Castronova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2010 - Dec 31, 2019
Area covered

Description
The objective of this HydroShare resource is to query AORC v1.0 Forcing data stored on HydroShare's Thredds server and create a subset of this dataset for a designated watershed and timeframe. The user is prompted to define their temporal and spatial frames of interest, which specifies the start and end dates for the data subset. Additionally, the user is prompted to define a spatial frame of interest, which could be a bounding box or a shapefile, to subset the data spatially.

Before the subsetting is performed, data is queried, and geospatial metadata is added to ensure that the data is correctly aligned with its corresponding location on the Earth's surface. To achieve this, two separate notebooks were created - this notebook and this notebook - which explain how to query the dataset and add geospatial metadata to AORC v1.0 data in detail, respectively. In this notebook, we call functions from the AORC.py script to perform these preprocessing steps, resulting in a cleaner notebook that focuses solely on the subsetting process.
R
Subset Dataset
universe.roboflow.com
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BH (2023). Subset Dataset [Dataset]. https://universe.roboflow.com/bh-zza8w/subset/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
May 23, 2023
Dataset authored and provided by
BH
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Toys Bounding Boxes
Description
Subset

## Overview Subset is a dataset for object detection tasks - it contains Toys annotations for 201 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
d
Data from: Time for a rethink: time sub-sampling methods in...
search.dataone.org
data.niaid.nih.gov
+2more
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Guillerme; Natalie Cooper (2025). Time for a rethink: time sub-sampling methods in disparity-through-time analyses [Dataset]. http://doi.org/10.5061/dryad.vp4q518
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.vp4q518
Dataset updated
Apr 12, 2025
Dataset provided by
Dryad Digital Repository
Authors
Thomas Guillerme; Natalie Cooper
Time period covered
Apr 5, 2019
Description
Disparity-through-time analyses can be used to determine how morphological diversity changes in response to mass extinctions, and to investigate the drivers of morphological change. These analyses are routinely applied to palaeobiological datasets, yet although there is much discussion about how to best calculate disparity, there has been little consideration of how taxa should be sub-sampled through time. Standard practice is to group taxa into discrete time bins, often based on stratigraphic periods. However, this can introduce biases when bins are of unequal size, and implicitly assumes a punctuated model of evolution. In addition, many time bins may have few or no taxa, meaning that disparity cannot be calculated for the bin and making it harder to complete downstream analyses. Here we describe a different method to complement the disparity-through-time tool-kit: time-slicing. This method uses a time-calibrated phylogenetic tree to sample disparity-through-time at any fixed point in...
d
NLM LitArch Open Access Subset
catalog.data.gov
datadiscovery.nlm.nih.gov
+3more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). NLM LitArch Open Access Subset [Dataset]. https://catalog.data.gov/dataset/nlm-litarch-open-access-subset
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
A subset of the total collection of books and documents in the NLM Literature Archive (NLM LitArch), accessible through the Bookshelf website, are available through the NLM LitArch Open Access subset. Contents in the NLM LitArch Open Access subset generally include works which are in the public domain, works which are available under a Creative Commons or similar license, and works whose authors or publishers have explicitly agreed to the terms of the NLM LitArch Open Access subset. Except for public domain works, works in the NLM LitArch Open Access subset are still protected by copyright, but are made available under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work. The license terms are not the same for each work. Read the license text which is available with each downloadable file to determine terms of use.
w
Brazil - XII Recenseamento Geral do Brasil. Censo Demográfico 2010 - IPUMS...
wbwaterdata.org
Updated Mar 16, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Brazil - XII Recenseamento Geral do Brasil. Censo Demográfico 2010 - IPUMS Subset - Dataset - waterdata [Dataset]. https://wbwaterdata.org/dataset/brazil-xii-recenseamento-geral-do-brasil-censo-demogrfico-2010-ipums-subset
Explore at:
Dataset updated
Mar 16, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brazil
Description
IPUMS-International is an effort to inventory, preserve, harmonize, and disseminate census microdata from around the world. The project has collected the world's largest archive of publicly available census samples. The data are coded and documented consistently across countries and over time to facillitate comparative research. IPUMS-International makes these data available to qualified researchers free of charge through a web dissemination system. The IPUMS project is a collaboration of the Minnesota Population Center, National Statistical Offices, and international data archives. Major funding is provided by the U.S. National Science Foundation and the Demographic and Behavioral Sciences Branch of the National Institute of Child Health and Human Development. Additional support is provided by the University of Minnesota Office of the Vice President for Research, the Minnesota Population Center, and Sun Microsystems.
P
Sony-Total-Dark Dataset
paperswithcode.com
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qingsen Yan; Yixu Feng; Cheng Zhang; Pei Wang; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang (2023). Sony-Total-Dark Dataset [Dataset]. https://paperswithcode.com/dataset/sony-total-dark
Explore at:
Dataset updated
Nov 26, 2023
Authors
Qingsen Yan; Yixu Feng; Cheng Zhang; Pei Wang; Peng Wu; Wei Dong; Jinqiu Sun; Yanning Zhang
Description
Original SID dataset is introduced in "Learning to See in the Dark". The subset of SID dataset captured by Sony α7S II camera is adopted for evaluation. There are 2697 short-long-exposure RAW image pairs. To make this dataset more challenging, we converted the RAW format images to sRGB images with no gamma correction, which resulted in images becoming extremely dark.
Student oriented subset of the Open University Learning Analytics dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Sep 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriella Casalino; Gabriella Casalino; Giovanna Castellano; Giovanna Castellano; Gennaro Vessio; Gennaro Vessio (2021). Student oriented subset of the Open University Learning Analytics dataset [Dataset]. http://doi.org/10.5281/zenodo.4264397
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4264397
Dataset updated
Sep 30, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gabriella Casalino; Gabriella Casalino; Giovanna Castellano; Giovanna Castellano; Gennaro Vessio; Gennaro Vessio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Open University (OU) dataset is an open database containing student demographic and click-stream interaction with the virtual learning platform. The available data are structured in different CSV files. You can find more information about the original dataset at the following link: https://analyse.kmi.open.ac.uk/open_dataset.

We extracted a subset of the original dataset that focuses on student information. 25,819 records were collected referring to a specific student, course and semester. Each record is described by the following 20 attributes: code_module, code_presentation, gender, highest_education, imd_band, age_band, num_of_prev_attempts, studies_credits, disability, resource, homepage, forum, glossary, outcontent, subpage, url, outcollaborate, quiz, AvgScore, count.

Two target classes were considered, namely Fail and Pass, combining the original four classes (Fail and Withdrawn and Pass and Distinction, respectively). The final_result attribute contains the target values.

All features have been converted to numbers for automatic processing.

Below is the mapping used to convert categorical values to numeric:

code_module: 'AAA'=0, 'BBB'=1, 'CCC'=2, 'DDD'=3, 'EEE'=4, 'FFF'=5, 'GGG'=6

code_presentation: '2013B'=0, '2013J'=1, '2014B'=2, '2014J'=3

gender: 'F'=0, 'M'=1

highest_education: 'No_Formal_quals'=0, 'Post_Graduate_Qualification'=1, 'HE_Qualification'=2, 'Lower_Than_A_Level'=3, 'A_level_or_Equivalent'=4

IMBD_band: 'unknown'=0, 'between_0_and_10_percent'=1, 'between_10_and_20_percent'=2, 'between_20_and_30_percent'=3, 'between_30_and_40_percent'=4, 'between_40_and_50_percent'=5, 'between_50_and_60_percent'=6, 'between_60_and_70_percent'=7, 'between_70_and_80_percent'=8, 'between_80_and_90_percent'=9, 'between_90_and_100_percent'=10

age_band: 'between_0_and_35'=0, 'between_35_and_55'=1, 'higher_than_55'=2

disability: 'N'=0, 'Y'=1

student's outcome: 'Fail'=0, 'Pass'=1

For more detailed information, please refer to:

Casalino G., Castellano G., Vessio G. (2021) Exploiting Time in Adaptive Learning from Educational Data. In: Agrati L.S. et al. (eds) Bridges and Mediation in Higher Distance Education. HELMeTO 2020. Communications in Computer and Information Science, vol 1344. Springer, Cham. https://doi.org/10.1007/978-3-030-67435-9_1
Chicago Narcotics Crime Jan 2016 - Jul 2020
kaggle.com
Updated Aug 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anugerah Erlaut (2020). Chicago Narcotics Crime Jan 2016 - Jul 2020 [Dataset]. https://www.kaggle.com/aerlaut/chicago-narcotics-jan-2016-jul-2020
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2020
Dataset provided by
Kaggle
Authors
Anugerah Erlaut
License
https://www.usa.gov/government-works/https://www.usa.gov/government-works/
Area covered
Chicago
Description
Introduction

Chicago is one of America's most iconic cities. It has a colorful history, which rich histories such. Recently, Chicago was also a setting for one of Netflix's popular series : Ozark. The story has it that Chicago is the center for drug distribution for the Navarro cartel.

So, how true is the series? A quick search on the internet reveals a recently released DEA report on the. The report shows that drug crime exists in Chicago, although they are distributed by the Cartel de Jalisco Nueva Generacion, the Sinaloa Cartel and the Guerros Unidos, to name a few.

Content

The government of the City of Chicago has provided a publicly available crime database accessible via Google BigQuery. I have downloaded a subset of the data with crime_type narcotics and year > 2015. The data contains records between 1 Jan 2016 UTC until 23 Jul 2020 UTC.

The dataset contains these columns : - case_number : ID of the record - date : Date of incident - iucr : Category of the crime, per Illinois Unified Crime Reporting (IUCR) code. [more](https://data.cityofchicago.org/widgets/c7ck-438e) -description: More detailed description of the crime -location_description: Location of the crime -arrest: Whether an arrest was made -domestic: Was the crime domestic? -district: Which district code where the crime happened. [more](https://data.cityofchicago.org/Public-Safety/Boundaries-Police-Districts-current-/fthy-xz3r) -ward: The ward code where the crime happened. [more](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Wards-2015-/sp34-6z76) -community_area` : The community area code where the crime happened. more

Acknowledgements

The data is owned and kindly provided by the City of Chicago.

Inspiration

Some questions to get you started:

Is there a trend? Is the crime increasing? or decreasing?

Is there seasonality? Are dealers more like to be out and about in summer? Do they deal inside in winter?

Are some activities more like to happen at certain locations?

We tend to think that more deals happen at night, especially as people wind down, and the surroundings get dark. Does the data reflect that?

Are the incidents clustered to a certain district? Certain type of location?

Lastly, if you are : - a newly recruited analyst at the DEA / police, what would you recommend? - asked by el jefe del cartel (boss of the cartel) on how to expand operation / operate better, what would you say?

Happy wrangling!
a
Million Song Dataset Subset
academictorrents.com
bittorrent
Updated Oct 12, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere (2015). Million Song Dataset Subset [Dataset]. https://academictorrents.com/details/e0b6b5ff012fcda7c4a14e4991d8848a6a2bf52b
Explore at:
bittorrent(1994614463)Available download formats
Dataset updated
Oct 12, 2015
Dataset authored and provided by
Thierry Bertin-Mahieux and Daniel P.W. Ellis and Brian Whitman and Paul Lamere
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
To let you get a feel for the dataset without committing to a full download, we also provide a subset consisting of 10,000 songs (1%, 1.8 gb) selected at random. It contains "additional files" (SQLite databases) in the same format as those for the full set, but referring only to the 10K song subset. Therefore, you can develop code on the subset, then port it to the full dataset. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest s) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. Note, howeve
f
Data from: A Subsampling Strategy for AIC-based Model Averaging with...
tandf.figshare.com
zip
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jun Yu; HaiYing Wang; Mingyao Ai (2024). A Subsampling Strategy for AIC-based Model Averaging with Generalized Linear Models [Dataset]. http://doi.org/10.6084/m9.figshare.27089534.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27089534.v1
Dataset updated
Nov 13, 2024
Dataset provided by
Taylor & Francis
Authors
Jun Yu; HaiYing Wang; Mingyao Ai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Subsampling is an effective approach to address computational challenges associated with massive datasets. However, existing subsampling methods do not consider model uncertainty. In this article, we investigate the subsampling technique for the Akaike information criterion (AIC) and extend the subsampling method to the smoothed AIC model-averaging framework in the context of generalized linear models. By correcting the asymptotic bias of the maximized subsample objective function used to approximate the Kullback–Leibler divergence, we derive the form of the AIC based on the subsample. We then provide a subsampling strategy for the smoothed AIC model-averaging estimator and study the corresponding asymptotic properties of the loss and the resulting estimator. A practically implementable algorithm is developed, and its performance is evaluated through numerical experiments on both real and simulated datasets.
d
NLCD 2021 Land Cover California Subset
catalog.data.gov
data.cnra.ca.gov
+6more
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of Fish and Wildlife (2024). NLCD 2021 Land Cover California Subset [Dataset]. https://catalog.data.gov/dataset/nlcd-2021-land-cover-california-subset
Explore at:
Dataset updated
Nov 27, 2024
Dataset provided by
California Department of Fish and Wildlife
Area covered
California
Description
The U.S. Geological Survey (USGS), in partnership with several federal agencies, has now developed and released seven National Land Cover Database (NLCD) products: NLCD 1992, 2001, 2006, 2011, 2016, 2019, and 2021. Beginning with the 2016 release, land cover products were created for two-to-three-year intervals between 2001 and the most recent year. These products provide spatially explicit and reliable information on the Nation’s land cover and land cover change. NLCD continues to provide innovative, consistent, and robust methodologies for production of a multi-temporal land cover and land cover change database. NLCD 2021 adds an additional year to the map products produced for NLCD 2019, with a streamlined compositing process for assembling and preprocessing Landsat imagery and geospatial ancillary datasets; a temporally, spectrally, and spatially integrated land cover change analysis strategy; a theme-based post-classification protocol for generating land cover and change products; a continuous fields biophysical parameters modeling method; and a scripted operational system. The overall accuracy of the 2019 Level I land cover was 91%. Results from this study confirm the robustness of this comprehensive and highly automated procedure for NLCD 2021 operational mapping (see https://doi.org/10.1080/15481603.2023.2181143 for the latest accuracy assessment publication). Questions about the NLCD 2021 land cover product can be directed to the NLCD 2021 land cover mapping team at USGS EROS, Sioux Falls, SD (605) 594-6151 or mrlc@usgs.gov. See included spatial metadata for more details.
d
PubMed Central Open Access Subset (PMC OA)
catalog.data.gov
data.virginia.gov
+2more
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Library of Medicine (2025). PubMed Central Open Access Subset (PMC OA) [Dataset]. https://catalog.data.gov/dataset/pubmed-central-open-access-subset-pmc-oa
Explore at:
Dataset updated
Jun 19, 2025
Dataset provided by
National Library of Medicine
Description
Not all articles in PMC are available for text mining and other reuse, many have copyright protection, however articles in the PMC Open Access Subset are made available for download under a Creative Commons or similar license that generally allows more liberal redistribution and reuse than a traditional copyrighted work.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou (2024). Efficient Model-Free Subsampling Method for Massive Data [Dataset]. http://doi.org/10.6084/m9.figshare.24347102.v2

Data from: Efficient Model-Free Subsampling Method for Massive Data

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.24347102.v2

Dataset updated

Feb 14, 2024

Dataset provided by

Taylor & Francis

Authors

Zheng Zhou; Zebin Yang; Aijun Zhang; Yongdao Zhou

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.

Clear search

Close search

Google apps

Main menu

Data from: Efficient Model-Free Subsampling Method for Massive Data

Census of Population, 1880: Public Use Sample (1 in 1000 Preliminary...

NExtLong-512K-dataset-subset

STEAD subsample 4 CDiffSD

STEAD Subsample Dataset for CDiffSD Training

Overview

Dataset Files

Data

Usage

Requirements

Subset Generate Dataset

Subset Generate

Dior Subset Dataset

DIOR Subset

Keras video classification example with a subset of UCF101 - Action...

Film Circulation dataset

AORC Subset

Subset Dataset

Subset

Data from: Time for a rethink: time sub-sampling methods in...

NLM LitArch Open Access Subset

Brazil - XII Recenseamento Geral do Brasil. Censo Demográfico 2010 - IPUMS...

Sony-Total-Dark Dataset

Student oriented subset of the Open University Learning Analytics dataset

Chicago Narcotics Crime Jan 2016 - Jul 2020

Introduction

Content

Acknowledgements

Inspiration

Million Song Dataset Subset

Data from: A Subsampling Strategy for AIC-based Model Averaging with...

NLCD 2021 Land Cover California Subset

PubMed Central Open Access Subset (PMC OA)

Data from: Efficient Model-Free Subsampling Method for Massive Data