71 datasets found

Source Code - Characterizing Variability and Uncertainty for Parameter...
catalog.data.gov
s.cnmilf.com
Updated May 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models [Dataset]. https://catalog.data.gov/dataset/source-code-characterizing-variability-and-uncertainty-for-parameter-subset-selection-in-p
Explore at:
Dataset updated
May 1, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Source Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.
d
Data release for solar-sensor angle analysis subset associated with the...
catalog.data.gov
data.usgs.gov
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data release for solar-sensor angle analysis subset associated with the journal article "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States" [Dataset]. https://catalog.data.gov/dataset/data-release-for-solar-sensor-angle-analysis-subset-associated-with-the-journal-article-so
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Western United States, United States
Description
This dataset provides geospatial location data and scripts used to analyze the relationship between MODIS-derived NDVI and solar and sensor angles in a pinyon-juniper ecosystem in Grand Canyon National Park. The data are provided in support of the following publication: "Solar and sensor geometry, not vegetation response, drive satellite NDVI phenology in widespread ecosystems of the western United States". The data and scripts allow users to replicate, test, or further explore results. The file GrcaScpnModisCellCenters.csv contains locations (latitude-longitude) of all the 250-m MODIS (MOD09GQ) cell centers associated with the Grand Canyon pinyon-juniper ecosystem that the Southern Colorado Plateau Network (SCPN) is monitoring through its land surface phenology and integrated upland monitoring programs. The file SolarSensorAngles.csv contains MODIS angle measurements for the pixel at the phenocam location plus a random 100 point subset of pixels within the GRCA-PJ ecosystem. The script files (folder: 'Code') consist of 1) a Google Earth Engine (GEE) script used to download MODIS data through the GEE javascript interface, and 2) a script used to calculate derived variables and to test relationships between solar and sensor angles and NDVI using the statistical software package 'R'. The file Fig_8_NdviSolarSensor.JPG shows NDVI dependence on solar and sensor geometry demonstrated for both a single pixel/year and for multiple pixels over time. (Left) MODIS NDVI versus solar-to-sensor angle for the Grand Canyon phenocam location in 2018, the year for which there is corresponding phenocam data. (Right) Modeled r-squared values by year for 100 randomly selected MODIS pixels in the SCPN-monitored Grand Canyon pinyon-juniper ecosystem. The model for forward-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle. The model for back-scatter MODIS-NDVI is log(NDVI) ~ solar-to-sensor angle + sensor zenith angle. Boxplots show interquartile ranges; whiskers extend to 10th and 90th percentiles. The horizontal line marking the average median value for forward-scatter r-squared (0.835) is nearly indistinguishable from the back-scatter line (0.833). The dataset folder also includes supplemental R-project and packrat files that allow the user to apply the workflow by opening a project that will use the same package versions used in this study (eg, .folders Rproj.user, and packrat, and files .RData, and PhenocamPR.Rproj). The empty folder GEE_DataAngles is included so that the user can save the data files from the Google Earth Engine scripts to this location, where they can then be incorporated into the r-processing scripts without needing to change folder names. To successfully use the packrat information to replicate the exact processing steps that were used, the user should refer to packrat documentation available at https://cran.r-project.org/web/packages/packrat/index.html and at https://www.rdocumentation.org/packages/packrat/versions/0.5.0. Alternatively, the user may also use the descriptive documentation phenopix package documentation, and description/references provided in the associated journal article to process the data to achieve the same results using newer packages or other software programs.
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
NLM LitArch Open Access Subset - 7xxt-tws6 - Archive Repository
healthdata.gov
application/rdfxml +5
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NLM LitArch Open Access Subset - 7xxt-tws6 - Archive Repository [Dataset]. https://healthdata.gov/dataset/NLM-LitArch-Open-Access-Subset-7xxt-tws6-Archive-R/va2w-r76v
Explore at:
csv, application/rssxml, tsv, application/rdfxml, xml, jsonAvailable download formats
Dataset updated
Jul 11, 2025
Description
This dataset tracks the updates made on the dataset "NLM LitArch Open Access Subset" as a repository for previous versions of the data and metadata.
f
Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big...
plos.figshare.com
doc
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexey Miroshnikov; Erin M. Conlon (2023). Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big Data and Analytics [Dataset]. http://doi.org/10.1371/journal.pone.0108425.s001
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0108425.s001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Alexey Miroshnikov; Erin M. Conlon
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Remarks on kernels and bandwidth selection for semiparametric density product estimator method. (DOC)
h
2024-election-subreddit-threads-173k
huggingface.co
Updated Nov 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Binghamton University (2024). 2024-election-subreddit-threads-173k [Dataset]. https://huggingface.co/datasets/BinghamtonUniversity/2024-election-subreddit-threads-173k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2024
Dataset authored and provided by
Binghamton University
Description
About

This dataset contains threads from 23 political subreddits from July 2024 - November 2024 (about a week after the US election). Use this dataset as a baseline for subsets pertaining to Reddit's opinion on the 2024 election. We recommend using each thread's metadata as guidance. E.g.,

r/politics subset controversial comments subset highly upvoted posts subset leftist/liberal threads subset

etc.

Subreddits

These are the subreddits scraped. Each conversation's… See the full description on the dataset page: https://huggingface.co/datasets/BinghamtonUniversity/2024-election-subreddit-threads-173k.
a
fens
hub.arcgis.com
Updated Aug 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minnesota Pollution Control Agency (2018). fens [Dataset]. https://hub.arcgis.com/datasets/fc9bb1f0c4fc41eab0764c34c967af5b
Explore at:
Dataset updated
Aug 13, 2018
Dataset authored and provided by
Minnesota Pollution Control Agency
Area covered

Description
These are the Outstanding Resource Value Waters (ORVW) as specified by MPCA rules Minn.R. 7050.0180 and as specifically listed in Minn.R. 7050.0470 effective March 17, 2008. The points are a subset from a database received from the MN DNR that represent calcareous fens as defined in Minnesota Rules, part 8420.0935, subpart 2, with the addition of two locations that were not contained in thie original MN DNR dataest. The ORVW fens are a subset of all calcareous fens, only that subset is included in this dataset
P
Data from: Tiny ImageNet-R Dataset
paperswithcode.com
Updated Aug 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sven Oehri; Nikolas Ebert; Ahmed Abdullah; Didier Stricker; Oliver Wasenmüller (2024). Tiny ImageNet-R Dataset [Dataset]. https://paperswithcode.com/dataset/tiny-imagenet-r
Explore at:
Dataset updated
Aug 25, 2024
Authors
Sven Oehri; Nikolas Ebert; Ahmed Abdullah; Didier Stricker; Oliver Wasenmüller
Description
Tiny ImageNet-R is a subset of the ImageNet-R dataset by Hendrycks et al. ("The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization") with 10,456 images spanning 62 of the 200 Tiny ImageNet dataset. It is a test set achieved by collecting images of joint classes of Tiny ImageNet and ImageNet. The resized images of size 64×64 contain art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video game renditions of ImageNet classes. For further information on ImageNet-R visit the original GitHub repository of ImageNet-R.
SDSS Galaxy Subset
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Sep 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Ramos Carvalho; Nuno Ramos Carvalho (2022). SDSS Galaxy Subset [Dataset]. http://doi.org/10.5281/zenodo.7050898
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7050898
Dataset updated
Sep 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nuno Ramos Carvalho; Nuno Ramos Carvalho
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 100077 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.

The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):

objid: unique SDSS object identifier

mjd: MJD of observation

plate: plate identifier

tile: tile identifier

fiberid: fiber identifier

run: run number

rerun: rerun number

camcol: camera column

field: field number

ra: right ascension

dec: declination

class: spectroscopic class (only objetcs with GALAXY are included)

subclass: spectroscopic subclass

modelMag_u: better of DeV/Exp magnitude fit for band u

modelMag_g: better of DeV/Exp magnitude fit for band g

modelMag_r: better of DeV/Exp magnitude fit for band r

modelMag_i: better of DeV/Exp magnitude fit for band i

modelMag_z: better of DeV/Exp magnitude fit for band z

redshift: final redshift from SDSS data z

stellarmass: stellar mass extracted from the eBOSS Firefly catalog

w1mag: WISE W1 "standard" aperture magnitude

w2mag: WISE W2 "standard" aperture magnitude

w3mag: WISE W3 "standard" aperture magnitude

w4mag: WISE W4 "standard" aperture magnitude

gz2c_f: Galaxy Zoo 2 classification from Willett et al 2013

gz2c_s: simplified version of Galaxy Zoo 2 classification (labels set)

Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:

sdss-gs/ ├── data.csv ├── fits ├── img ├── spectra └── ssel

Where, each directory contains:

img: RGB images from the object in JPEG format, 150x150 pixels, generated using the SkyServer DR16 API

fits: FITS data subsets around the object across the u, g, r, i, z bands; cut is done using the ImageCutter library

spectra: full best fit spectra data from SDSS between 4000 and 9000 wavelengths

ssel: best fit spectra data from SDSS for specific selected intervals of wavelengths discussed by Sánchez Almeida 2010

Changelog

v0.0.4 - Increase number of objects to ~100k.

v0.0.3 - Increase number of objects to ~80k.

v0.0.2 - Increase number of objects to ~60k.

v0.0.1 - Initial import.

MERRA-2 subset for evaluation of renewables with merra2ools R-package:...

datadryad.org

zip

Updated Mar 29, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Oleg Lugovoy; Shuo Gao (2021). MERRA-2 subset for evaluation of renewables with merra2ools R-package: 1980-2020 hourly, 0.5° lat x 0.625° lon global grid [Dataset]. http://doi.org/10.5061/dryad.v41ns1rtt

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.v41ns1rtt

Dataset updated

Mar 29, 2021

Dataset provided by

Dryad

Authors

Oleg Lugovoy; Shuo Gao

Time period covered

2021

Description

The merra2ools dataset has been assembled through the following steps:

The MERRA-2 collections tavg1_2d_flx_Nx (Surface Flux Diagnostics), tavg1_2d_rad_Nx (Radiation Diagnostics), and tavg1_2d_slv_Nx (Single-level atmospheric state variables) downloaded from NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) (https://disc.gsfc.nasa.gov/datasets?project=MERRA-2) using GNU Wget network utility (https://disc.gsfc.nasa.gov/data-access). Every of the three collections consist of daily netCDF-4 files with 3-dimensional variables (lon x lat x hour). 
The following variables obtained from the netCDF-4 files and merged into long-term time-series:



Northward (V) and Eastward (U) wind at 10 and 50 meters (V10M, V50M, U10M, U50M, respectively), and 10-meter air temperature (T10M) from the tavg1_2d_slv_Nx collection;
Incident shortwave land (SWGDN) and Surface albedo (ALBEDO) fro...

E
CELEX Dutch lexical database - Frequency Subset
catalogue.elra.info
live.european-language-grid.eu
Updated Oct 5, 2005
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2005). CELEX Dutch lexical database - Frequency Subset [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-L0029_07/
Explore at:
Dataset updated
Oct 5, 2005
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The Dutch CELEX data is derived from R.H. Baayen, R. Piepenbrock & L. Gulikers, The CELEX Lexical Database (CD-ROM), Release 2, Dutch Version 3.1, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 1995.Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For the Dutch data, frequencies have been disambiguated on the basis of the 42.4m Dutch Instituut voor Nederlandse Lexicologie text corpora.To make for greater compatibility with other operating systems, the databases have not been tailored to fit any particular database management program. Instead, the information is presented in a series of plain ASCII files, which can be queried with tools such as AWK and ICON. Unique identity numbers allow the linking of information from different files.This database can be divided into different subsets:· orthography: with or without diacritics, with or without word division positions, alternative spellings, number of letters/syllables;· phonology: phonetic transcriptions with syllable boundaries or primary and secondary stress markers, consonant-vowel patterns, number of phonemes/syllables, alternative pronunciations, frequency per phonetic syllable within words;· morphology: division into stems and affixes, flat or hierarchical representations, stems and their inflections;· syntax: word class, subcategorisations per word class;· frequency of the entries: disambiguated for homographic lemmata.
f
Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama
smithsonian.figshare.com
zip
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao (2024). [Dataset:] Data from Tree Censuses and Inventories in Panama [Dataset]. http://doi.org/10.5479/data.stri.2016.0622
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5479/data.stri.2016.0622
Dataset updated
Apr 18, 2024
Dataset provided by
Smithsonian Tropical Research Institute
Authors
Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.Description:marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories in Panama, in R format. The column 'sp' in this table is a code identifying the species in the full census tables (marena.full and marena.stem, components 1-4 and 5-8 above). RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (components 1-4 above). RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (component 5-8 above). PanamaPlot.txt: Locations of all tree plots and inventories in Panama.
m
HUN AWRA-R calibration nodes v01
demo.dev.magda.io
cloud.csiss.gmu.edu
+2more
zip
Updated Dec 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). HUN AWRA-R calibration nodes v01 [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-997357fb-c08a-4cc2-b3f3-315922a4b59d
Explore at:
zipAvailable download formats
Dataset updated
Dec 4, 2022
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset is a shapefile which is a subset for the Hunter subregion containing geographical locations and other characteristics (see below) of streamflow gauging stations. T There are 3 …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset is a shapefile which is a subset for the Hunter subregion containing geographical locations and other characteristics (see below) of streamflow gauging stations. T There are 3 files that have been extracted from the Hydstra database to aid in identifying sites in the Hunter subregion and the type of data collected from each on. the 3 files are: Site - lists all sites available in Hydstra from data providers. The data provider is listed in the #Station as _xxx. For example, sites in NSW are _77, QLD are _66. Some sites do not have locational information and will not be able to be plotted. Period - the period table lists all the variables that are recorded at each site and the period of record. Variable - the variable table shows variable codes and names which can be linked to the period table. Purpose Locations are used as pour points in order to define reach areas for river system modelling. Dataset History Subset of data for the Hunter subregion that was extracted from the Bureau of Meteorology's hydstra system and includes all gauges where data has been received from the lead water agency of each jurisdiction. The gauges shapefile for all bioregions was intersected with the Hunter subregion boundary to identify and extract gauges within the subregion. Dataset Citation Bioregional Assessment Programme (2016) HUN AWRA-R calibration nodes v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/f2da394a-3d08-4cf4-8c24-bf7751ea06a1. Dataset Ancestors Derived From Gippsland Project boundary Derived From Bioregional Assessment areas v04 Derived From Natural Resource Management (NRM) Regions 2010 Derived From Bioregional Assessment areas v03 Derived From Victoria - Seamless Geology 2014 Derived From Bioregional Assessment areas v05 Derived From National Surface Water sites Hydstra Derived From Bioregional Assessment areas v01 Derived From Bioregional Assessment areas v02 Derived From GEODATA TOPO 250K Series 3 Derived From NSW Catchment Management Authority Boundaries 20130917 Derived From Geological Provinces - Full Extent Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Galilee groundwater data October 2016
researchdata.edu.au
data.gov.au
Updated Dec 9, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2018). Galilee groundwater data October 2016 [Dataset]. https://researchdata.edu.au/galilee-groundwater-october-2016/2989378
Explore at:
Dataset updated
Dec 9, 2018
Dataset provided by
Data.govhttps://data.gov/
Authors
Bioregional Assessment Program
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Area covered
Galilee
Description
Abstract \r

\r This dataset was derived by the Bioregional Assessment Programme. The parent datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this dataset are described in the History field in this metadata statement.\r \r \r \r This dataset is a subset of bore data for the Galilee subregion and was extracted from the source dataset shown in the lineage.\r \r

Dataset History \r

\r Data from the following source datasheets within the QLD groundwater database were used: Registrations, Casing, Aquifer, Stratigraphy, Pump test and water levels. \r \r Relevant data was extracted from these source datasheets and compiled in new tab called "SWL Interp". Then the hydrostratigraphic interpretation from the original data compilation was incorporated into the "SWL interp" datasheet in the column called BA "strat". These interpretations are tagged in the "strat interp source" column as "July14 SWL" . For completeness the original interpretation is included in the datasheet called "July14SWL"\r \r \r \r The focus of these updated interpretation were bores that maybe screened in aquifers in the Galilee Basin. Bores that were interpreted utilising the updated data are tagged in the "strat interp source" column as "Mar17 interp".\r \r

Dataset Citation \r

\r Bioregional Assessment Programme (2017) Galilee groundwater data October 2016. Bioregional Assessment Derived Dataset. Viewed 10 December 2018, http://data.bioregionalassessments.gov.au/dataset/27bc422e-f2c0-47b7-b2ff-9c63faf1de49.\r \r

Dataset Ancestors \r

\r * Derived From Gippsland Project boundary\r \r * Derived From Geological Provinces - Full Extent\r \r * Derived From Natural Resource Management (NRM) Regions 2010\r \r * Derived From Bioregional Assessment areas v03\r \r * Derived From Victoria - Seamless Geology 2014\r \r * Derived From Bioregional Assessment areas v05\r \r * Derived From Queensland groundwater data October 2016\r \r * Derived From Bioregional Assessment areas v01\r \r * Derived From Bioregional Assessment areas v02\r \r * Derived From GEODATA TOPO 250K Series 3\r \r * Derived From Bioregional Assessment areas v06\r \r * Derived From NSW Catchment Management Authority Boundaries 20130917\r \r * Derived From Bioregional Assessment areas v04\r \r * Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)\r \r
Meta Kaggle Code
kaggle.com
zip
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle (2025). Meta Kaggle Code [Dataset]. https://www.kaggle.com/datasets/kaggle/meta-kaggle-code/code
Explore at:
zip(148301844275 bytes)Available download formats
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Kagglehttp://kaggle.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Explore our public notebook content!

Meta Kaggle Code is an extension to our popular Meta Kaggle dataset. This extension contains all the raw source code from hundreds of thousands of public, Apache 2.0 licensed Python and R notebooks versions on Kaggle used to analyze Datasets, make submissions to Competitions, and more. This represents nearly a decade of data spanning a period of tremendous evolution in the ways ML work is done.

Why we’re releasing this dataset

By collecting all of this code created by Kaggle’s community in one dataset, we hope to make it easier for the world to research and share insights about trends in our industry. With the growing significance of AI-assisted development, we expect this data can also be used to fine-tune models for ML-specific code generation tasks.

Meta Kaggle for Code is also a continuation of our commitment to open data and research. This new dataset is a companion to Meta Kaggle which we originally released in 2016. On top of Meta Kaggle, our community has shared nearly 1,000 public code examples. Research papers written using Meta Kaggle have examined how data scientists collaboratively solve problems, analyzed overfitting in machine learning competitions, compared discussions between Kaggle and Stack Overflow communities, and more.

The best part is Meta Kaggle enriches Meta Kaggle for Code. By joining the datasets together, you can easily understand which competitions code was run against, the progression tier of the code’s author, how many votes a notebook had, what kinds of comments it received, and much, much more. We hope the new potential for uncovering deep insights into how ML code is written feels just as limitless to you as it does to us!

Sensitive data

While we have made an attempt to filter out notebooks containing potentially sensitive information published by Kaggle users, the dataset may still contain such information. Research, publications, applications, etc. relying on this data should only use or report on publicly available, non-sensitive information.

Joining with Meta Kaggle

The files contained here are a subset of the KernelVersions in Meta Kaggle. The file names match the ids in the KernelVersions csv file. Whereas Meta Kaggle contains data for all interactive and commit sessions, Meta Kaggle Code contains only data for commit sessions.

File organization

The files are organized into a two-level directory structure. Each top level folder contains up to 1 million files, e.g. - folder 123 contains all versions from 123,000,000 to 123,999,999. Each sub folder contains up to 1 thousand files, e.g. - 123/456 contains all versions from 123,456,000 to 123,456,999. In practice, each folder will have many fewer than 1 thousand files due to private and interactive sessions.

The ipynb files in this dataset hosted on Kaggle do not contain the output cells. If the outputs are required, the full set of ipynbs with the outputs embedded can be obtained from this public GCS bucket: kaggle-meta-kaggle-code-downloads. Note that this is a "requester pays" bucket. This means you will need a GCP account with billing enabled to download. Learn more here: https://cloud.google.com/storage/docs/requester-pays

Questions / Comments

We love feedback! Let us know in the Discussion tab.

Happy Kaggling!
I
Data from: OpCitance: Citation contexts identified from the PubMed Central...
databank.illinois.edu
Updated Feb 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tzu-Kun Hsiao; Vetle Torvik (2023). OpCitance: Citation contexts identified from the PubMed Central open access articles [Dataset]. http://doi.org/10.13012/B2IDB-4353270_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-4353270_V1
Dataset updated
Feb 15, 2023
Authors
Tzu-Kun Hsiao; Vetle Torvik
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • Supplementary_File_1.zip – This file contains the code for generating the dataset.
m
HUN AWRA-R simulation nodes v01
demo.dev.magda.io
researchdata.edu.au
+2more
zip
Updated Dec 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). HUN AWRA-R simulation nodes v01 [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-d9a4fd10-e099-48cb-b7ee-07d4000bb829
Explore at:
zipAvailable download formats
Dataset updated
Dec 4, 2022
Dataset provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract The dataset was derived by the Bioregional Assessment Programme from multiple datasets. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. The dataset consists of an excel spreadsheet and shapefile representing the locations of simulation nodes used in the AWRA-R model. Some of the nodes correspond to gauging station locations or dam …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from multiple datasets. The source dataset is identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. The dataset consists of an excel spreadsheet and shapefile representing the locations of simulation nodes used in the AWRA-R model. Some of the nodes correspond to gauging station locations or dam locations whereas other locations represent river confluences or catchment outlets which have no gauging. These are marked as "Dummy". Purpose Locations are used as pour points in oder to define reach areas for river system modelling. Dataset History Subset of data for the Hunter that was extracted from the Bureau of Meteorology's hydstra system and includes all gauges where data has been received from the lead water agency of each jurisdiction. Simulation nodes were added in locations in which the model will provide simulated streamflow. There are 3 files that have been extracted from the Hydstra database to aid in identifying sites in each bioregion and the type of data collected from each on. These data were used to determine the simulation node locations where model outputs were generated. The 3 files contained within the source dataset used for this determination are: Site - lists all sites available in Hydstra from data providers. The data provider is listed in the #Station as _xxx. For example, sites in NSW are _77, QLD are _66. Some sites do not have locational information and will not be able to be plotted. Period - the period table lists all the variables that are recorded at each site and the period of record. Variable - the variable table shows variable codes and names which can be linked to the period table. Relevant location information and other data were extracted to construct the spreadsheet and shapefile within this dataset. Dataset Citation Bioregional Assessment Programme (XXXX) HUN AWRA-R simulation nodes v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/fda20928-d486-49d2-b362-e860c1918b06. Dataset Ancestors Derived From National Surface Water sites Hydstra
d
Multivariate Time Series Search
catalog.data.gov
data.wu.ac.at
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Multivariate Time Series Search [Dataset]. https://catalog.data.gov/dataset/multivariate-time-series-search
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Multivariate Time-Series (MTS) are ubiquitous, and are generated in areas as disparate as sensor recordings in aerospace systems, music and video streams, medical monitoring, and financial systems. Domain experts are often interested in searching for interesting multivariate patterns from these MTS databases which can contain up to several gigabytes of data. Surprisingly, research on MTS search is very limited. Most existing work only supports queries with the same length of data, or queries on a fixed set of variables. In this paper, we propose an efficient and flexible subsequence search framework for massive MTS databases, that, for the first time, enables querying on any subset of variables with arbitrary time delays between them. We propose two provably correct algorithms to solve this problem — (1) an R-tree Based Search (RBS) which uses Minimum Bounding Rectangles (MBR) to organize the subsequences, and (2) a List Based Search (LBS) algorithm which uses sorted lists for indexing. We demonstrate the performance of these algorithms using two large MTS databases from the aviation domain, each containing several millions of observations. Both these tests show that our algorithms have very high prune rates (>95%) thus needing actual disk access for only less than 5% of the observations. To the best of our knowledge, this is the first flexible MTS search algorithm capable of subsequence search on any subset of variables. Moreover, MTS subsequence search has never been attempted on datasets of the size we have used in this paper.
g
Indonesian Family Life Study, merged subset
laurabotzet.github.io
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAND corporation (2016). Indonesian Family Life Study, merged subset [Dataset]. https://laurabotzet.github.io/birth_order_ifls/2_codebook.html
Explore at:
Dataset updated
2016
Authors
RAND corporation
Time period covered
2014 - 2015
Area covered
13 Indonesian provinces. The sample is representative of about 83% of the Indonesian population and contains over 30, 000 individuals living in 13 of the 27 provinces in the country. See URL for more.
Variables measured
a1, a2, c1, c3, e1, e3, n2, n3, o1, o2, and 138 more
Description
Data from the IFLS, merged across waves, most outcomes taken from wave 5. Includes birth order, family structure, Big 5 Personality, intelligence tests, and risk lotteries

Table of variables

This table contains variable names, labels, and number of missing values. See the complete codebook for more.

[truncated]

Note

This dataset was automatically described using the codebook R package (version 0.8.2).
o
Data from: Marine turtle sightings, strandings and captures in French waters...
obis.org
gbif.org
zip
Updated Apr 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Duke University (2021). Marine turtle sightings, strandings and captures in French waters 1990-2003 [Dataset]. https://obis.org/dataset/fe0652c6-1899-49b5-a013-5aa94b45813f
Explore at:
zipAvailable download formats
Dataset updated
Apr 24, 2021
Dataset authored and provided by
Duke University
Time period covered
1990 - 2003
Area covered
French
Description
Original provider: Matthew Witt, University of Exeter

Dataset credits: Matthew Witt, University of Exeter

Abstract: We present data spanning approximately 100 years regarding the spatial and temporal occurrence of marine turtle sightings and strandings in the northeast Atlantic from two public recording schemes and demonstrate potential signals of changing population status. Records of loggerhead (n = 317) and Kemp’s ridley (n = 44) turtles occurring on the European continental shelf were most prevalent during the autumn and winter, when waters were coolest. In contrast, endothermic leatherback turtles (n = 1,668) were most common during the summer. Analysis of the spatial distribution of hard-shell marine turtle sightings and strandings highlights a pattern of decreasing records with increasing latitude. The spatial distribution of sighting and stranding records indicates that arrival in waters of the European continental shelf is most likely driven by North Atlantic current systems. Future patterns of spatial-temporal distribution, gathered from the periphery of juvenile marine turtles habitat range, may allow for a broader assessment of the future impacts of global climate change on species range and population size.

Purpose: We set out to determine the spatial and temporal trends for sightings, strandings and captures of hard-shell marine turtles in the northeast Atlantic from two recording schemes. One recording scheme (presented here) included marine turtle sightings, strandings and captures occurring in French waters that originated from annual sightings and strandings publications of Duguy and colleagues (Duguy 1990, 1992, 1993, 1994, 1995, 1996, 2004; Duguy et al. 1997a, b, 1999, 2000, 2001, 2002, 2003). Records presented in Duguy publications prior to 2001 contained location descriptions, providing no geographic coordinates with error estimates. Longitude and latitude positions for these events were estimated to be the closest coastal point to the descriptive location. Duguy publications, 2001 onwards, were accompanied by maps displaying the approximate location of sightings and strandings events. These maps were digitized and georeferenced and coordinate positions determined for all appropriate records. Georefenced hard-shell turtle (Lk and Cc) capture/sighting/stranding records from the papers of Duguy for France 1990-2003 (featured in Witt et al. 2007) only includes records that could have coordinates derived from their locational descriptions. The second recording scheme used were records of sightings and strandings of marine turtles in the British Isles obtained from the TURTLE database operated by Marine Environmental Monitoring. Data from the TURTLE database were submitted to EurOBIS and can be viewed on OBIS-SEAMAP: Marine Turtles.

Supplemental information: Abstract is from Witt et al. 2007; data included in this dataset are a subset of data presented in Witt et al. 2007. References: Duguy, R. 1990. Observations de tortues marines en 1990 (Manche et Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 7:1053–1057. Duguy, R. 1992. Observations de tortues marines en 1991 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:35–37. Duguy, R. 1993. Observations de tortues marines en 1992 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:129–131. Duguy, R. 1994. Observations de tortues marines en 1993 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:235–238. Duguy, R. 1995. Observations de tortues marines en 1994 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:403–406. Duguy, R. 1996. Observations de tortues marines en 1995 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:505–513. Duguy, R. 2004. Observations de tortues marines en 2003 (cotes Atlantiques). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:361–366. Duguy, R., P. Moriniere and A. Meunier. 1997a. Observations de tortues marines en 1997. Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:761–779. Duguy, R., P. Moriniere and M.A. Spano. 1997b. Observations de tortues marines en 1996 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:625–632. Duguy, R., P. Moriniere and A. Meunier. 1999. Observations de tortues marines en 1998 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime:911–924. Duguy, R., P. Moriniere and A. Meunier. 2000. Observations de tortues marines en 1999. Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:1025–1034. Duguy R, P. Moriniere and A. Meunier. 2001. Observations tortues marines en 2000 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:17–25. Duguy, R., P. Moriniere and A. Meunier. 2002. Observations de tortues marines en 2001 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9. Duguy, R., P. Moriniere and A. Meunier. 2003. Observations de tortues marines en 2002 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:265–273.

Facebook

Twitter

Click to copy link

Link copied

Cite

U.S. EPA Office of Research and Development (ORD) (2025). Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models [Dataset]. https://catalog.data.gov/dataset/source-code-characterizing-variability-and-uncertainty-for-parameter-subset-selection-in-p

Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models

Explore at:

Dataset updated

May 1, 2025

Dataset provided by

United States Environmental Protection Agencyhttp://www.epa.gov/

Description

Source Code for the manuscript "Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models" -- This R code generates the results presented in this manuscript; the zip folder contains PBPK model files (for chloroform and DCM) and corresponding scripts to compile the models, generate human equivalent doses, and run sensitivity analysis.

Clear search

Close search

Google apps

Main menu

Source Code - Characterizing Variability and Uncertainty for Parameter...

Data release for solar-sensor angle analysis subset associated with the...

Film Circulation dataset

NLM LitArch Open Access Subset - 7xxt-tws6 - Archive Repository

Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big...

2024-election-subreddit-threads-173k

fens

Data from: Tiny ImageNet-R Dataset

SDSS Galaxy Subset

MERRA-2 subset for evaluation of renewables with merra2ools R-package:...

CELEX Dutch lexical database - Frequency Subset

Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

HUN AWRA-R calibration nodes v01

Galilee groundwater data October 2016

Abstract \r

Dataset History \r

Dataset Citation \r

Dataset Ancestors \r

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data from: OpCitance: Citation contexts identified from the PubMed Central...

HUN AWRA-R simulation nodes v01

Multivariate Time Series Search

Indonesian Family Life Study, merged subset

Table of variables

Note

Data from: Marine turtle sightings, strandings and captures in French waters...

Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models

Source Code - Characterizing Variability and Uncertainty for Parameter...

Data release for solar-sensor angle analysis subset associated with the...

Film Circulation dataset

NLM LitArch Open Access Subset - 7xxt-tws6 - Archive Repository

Appendix S1 - parallelMCMCcombine: An R Package for Bayesian Methods for Big...

2024-election-subreddit-threads-173k

fens

Data from: Tiny ImageNet-R Dataset

SDSS Galaxy Subset

MERRA-2 subset for evaluation of renewables with merra2ools R-package:...

CELEX Dutch lexical database - Frequency Subset

Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

HUN AWRA-R calibration nodes v01

Galilee groundwater data October 2016

Abstract \r

Dataset History \r

Dataset Citation \r

Dataset Ancestors \r

Meta Kaggle Code

Explore our public notebook content!

Why we’re releasing this dataset

Sensitive data

Joining with Meta Kaggle

File organization

Questions / Comments

Data from: OpCitance: Citation contexts identified from the PubMed Central...

HUN AWRA-R simulation nodes v01

Multivariate Time Series Search

Indonesian Family Life Study, merged subset

Table of variables

Note

Data from: Marine turtle sightings, strandings and captures in French waters...

Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK ModelsSee More Versions

Source Code - Characterizing Variability and Uncertainty for Parameter Subset Selection in PBPK Models