100+ datasets found

g
EarthTrends, Paper and Paperboard Consumption Imports and Exports, World,...
geocommons.com
Updated May 27, 2008
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EarthTrends.com (2008). EarthTrends, Paper and Paperboard Consumption Imports and Exports, World, 2005 [Dataset]. http://geocommons.com/search.html
Explore at:
Dataset updated
May 27, 2008
Dataset provided by
EarthTrends.com
data
Description
This data set illustrates the consumption, imports, and exports of paper and paperboard across the globe. The value of -100 means that no data was available. http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=571&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=573&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=569&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=568&action=select_years September 26, 2007
n
Data from: Data reuse and the open data citation advantage
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Oct 1, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heather A. Piwowar; Todd J. Vision (2013). Data reuse and the open data citation advantage [Dataset]. http://doi.org/10.5061/dryad.781pv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.781pv
Dataset updated
Oct 1, 2013
Dataset provided by
National Evolutionary Synthesis Center
Authors
Heather A. Piwowar; Todd J. Vision
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.
t
Dataset accompanying "how much demand side flexibility do we need? -...
service.tib.eu
radar.kit.edu
+2more
Updated Nov 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset accompanying "how much demand side flexibility do we need? - analyzing where to exploit flexibility in industrial processes" [Dataset]. https://service.tib.eu/ldmservice/dataset/rdr-doi-10-35097-1115
Explore at:
Dataset updated
Nov 28, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TechnicalRemarks: This data accompanies the Paper "How much demand side flexibility do we need? - Analyzing where to exploit flexibility in industrial processes".[0] The raw data which this data set is based on, is the HIPE dataset[1], which can be found at https://www.energystatusdata.kit.edu/hipe.php . In the accompanying publication, you can find an in-depth description of the data, how it was gathered, what types of machines were covered, etc. This data package contains: The instances of the four test sets in [0]. These can be found in the subfolder "instances". The "PS_Nonuniform", "PS_Uniform", "PSG" and "OM" subfolders contain the 450 instances of each set, one instance per file. The file format is explained in the "file_format.{md, html, pdf}" files.
Z
Data articles in journals
data.niaid.nih.gov
Updated Sep 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Loureiro, Vanesa (2023). Data articles in journals [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3753373
Explore at:
Dataset updated
Sep 22, 2023
Dataset provided by
Loureiro, Vanesa
Balsa-Sanchez, Carlota
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Version: 5

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2023/09/05

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

data_articles_journal_list_v5.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published

data_articles_journal_list_v5.csv: full list of 140 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 5th version - Information updated: number of journals, URL, document types associated to a specific journal.

Version: 4

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/12/15

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

data_articles_journal_list_v4.xlsx: full list of 140 academic journals in which data papers or/and software papers could be published

data_articles_journal_list_v4.csv: full list of 140 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 4th version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR), Scopus and Web of Science (WOS), Journal Master List.

Version: 3

Authors: Carlota Balsa-Sánchez, Vanesa Loureiro

Date of data collection: 2022/10/28

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

data_articles_journal_list_v3.xlsx: full list of 124 academic journals in which data papers or/and software papers could be published

data_articles_journal_list_3.csv: full list of 124 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 3rd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Journal Citation Reports (JCR) and/or Scimago Journal and Country Rank (SJR).

Erratum - Data articles in journals Version 3:

Botanical Studies -- ISSN 1999-3110 -- JCR (JIF) Q2 Data -- ISSN 2306-5729 -- JCR (JIF) n/a Data in Brief -- ISSN 2352-3409 -- JCR (JIF) n/a

Version: 2

Author: Francisco Rubio, Universitat Politècnia de València.

Date of data collection: 2020/06/23

General description: The publication of datasets according to the FAIR principles, could be reached publishing a data paper (or software paper) in data journals or in academic standard journals. The excel and CSV file contains a list of academic journals that publish data papers and software papers. File list:

data_articles_journal_list_v2.xlsx: full list of 56 academic journals in which data papers or/and software papers could be published

data_articles_journal_list_v2.csv: full list of 56 academic journals in which data papers or/and software papers could be published

Relationship between files: both files have the same information. Two different formats are offered to improve reuse

Type of version of the dataset: final processed version

Versions of the files: 2nd version - Information updated: number of journals, URL, document types associated to a specific journal, publishers normalization and simplification of document types - Information added : listed in the Directory of Open Access Journals (DOAJ), indexed in Web of Science (WOS) and quartile in Scimago Journal and Country Rank (SJR)

Total size: 32 KB

Version 1: Description

This dataset contains a list of journals that publish data articles, code, software articles and database articles.

The search strategy in DOAJ and Ulrichsweb was the search for the word data in the title of the journals. Acknowledgements: Xaquín Lores Torres for his invaluable help in preparing this dataset.
n
Phenotypic and genetic diversity data recorded in island and mainland...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Mária Csergő; Kevin Healy; Maude E. A. Baudraz; David J. Kelly; Darren P. O’Connell; Fionn Ó Marcaigh; Annabel L. Smith; Jesus Villellas; Cian White; Qiang Yang; Yvonne M. Buckley (2023). Phenotypic and genetic diversity data recorded in island and mainland populations worldwide [Dataset]. http://doi.org/10.5061/dryad.h18931zqg
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.h18931zqg
Dataset updated
Sep 13, 2023
Dataset provided by
Trinity College Dublin
Universidad de Alcalá
Magyar Agrár- és Élettudományi Egyetem
The University of Queensland
Ollscoil na Gaillimhe – University of Galway
University College Dublin
German Centre for Integrative Biodiversity Research
Authors
Anna Mária Csergő; Kevin Healy; Maude E. A. Baudraz; David J. Kelly; Darren P. O’Connell; Fionn Ó Marcaigh; Annabel L. Smith; Jesus Villellas; Cian White; Qiang Yang; Yvonne M. Buckley
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
We used this dataset to assess the strength of isolation due to geographic and macroclimatic distance across island and mainland systems, comparing published measurements of phenotypic traits and neutral genetic diversity for populations of plants and animals worldwide. The dataset includes 112 studies of 108 species (72 animals and 36 plants) in 868 island populations and 760 mainland populations, with population-level taxonomic and biogeographic information, totalling 7438 records. Methods Description of methods used for collection/generation of data: We searched the ISI Web of Science in March 2017 for comparative studies that included data on phenotypic traits and/or neutral genetic diversity of populations on true islands and on mainland sites in any taxonomic group. Search terms were 'island' and ('mainland' or 'continental') and 'population*' and ('demograph*' or 'fitness' or 'survival' or 'growth' or 'reproduc*' or 'density' or 'abundance' or 'size' or 'genetic diversity' or 'genetic structure' or 'population genetics') and ('plant*' or 'tree*' or 'shrub*or 'animal*' or 'bird*' or 'amphibian*' or 'mammal*' or 'reptile*' or 'lizard*' or 'snake*' or 'fish'), subsequently refined to the Web of Science categories 'Ecology' or 'Evolutionary Biology' or 'Zoology' or 'Genetics Heredity' or 'Biodiversity Conservation' or 'Marine Freshwater Biology' or 'Plant Sciences' or 'Geography Physical' or 'Ornithology' or 'Biochemistry Molecular Biology' or 'Multidisciplinary Sciences' or 'Environmental Sciences' or 'Fisheries' or 'Oceanography' or 'Biology' or 'Forestry' or 'Reproductive Biology' or 'Behavioral Sciences'. The search included the whole text including abstract and title, but only abstracts and titles were searchable for older papers depending on the journal. The search returned 1237 papers which were distributed among coauthors for further scrutiny. First paper filter To be useful, the papers must have met the following criteria: Overall study design criteria: Include at least two separate islands and two mainland populations; Eliminate studies comparing populations on several islands where there were no clear mainland vs. island comparisons; Present primary research data (e.g., meta-analyses were discarded); Include a field study (e.g., experimental studies and ex situ populations were discarded); Can include data from sub-populations pooled within an island or within a mainland population (but not between islands or between mainland sites); Island criteria: Island populations situated on separate islands (papers where all information on island populations originated from a single island were discarded); Can include multiple populations recorded on the same island, if there is more than one island in the study; While we accepted the authors' judgement about island vs. mainland status, in 19 papers we made our own judgement based on the relative size of the island or position relative to the mainland (e.g. Honshu Island of Japan, sized 227 960 km² was interpreted as mainland relative to islands less than 91 km²); Include islands surrounded by sea water but not islands in a lake or big river; Include islands regardless of origin (continental shelf, volcanic); Taxonomic criteria: Include any taxonomic group; The paper must compare populations within a single species; Do not include marine species (including coastline organisms); Databases used to check species delimitation: Handbook of Birds of the World (www.hbw.com/); International Plant Names Index (https://www.ipni.org/); Plants of the World Online(https://powo.science.kew.org/); Handbook of the Mammals of the World; Global Biodiversity Information Facility (https://www.gbif.org/); Biogeographic criteria: Include all continents, as well as studies on multiple continents; Do not include papers regarding migratory species; Only include old / historical invasions to islands (>50 yrs); do not include recent invasions; Response criteria: Do not include studies which report community-level responses such as species richness; Include genetic diversity measures and/or individual and population-level phenotypic trait responses; The first paper filter resulted in 235 papers which were randomly reassigned for a second round of filtering. Second paper filter In the second filter, we excluded papers that did not provide population geographic coordinates and population-level quantitative data, unless data were provided upon contacting the authors or could be obtained from figures using DataThief (Tummers 2006). We visually inspected maps plotted for each study separately and we made minor adjustments to the GPS coordinates when the coordinates placed the focal population off the island or mainland. For this study, we included only responses measured at the individual level, therefore we removed papers referring to demographic performance and traits such as immunity, behaviour and diet that are heavily reliant on ecosystem context. We extracted data on population-level mean for two broad categories of response: i) broad phenotypic measures, which included traits (size, weight and morphology of entire body or body parts), metabolism products, physiology, vital rates (growth, survival, reproduction) and mean age of sampled mature individuals; and ii) genetic diversity, which included heterozygosity,allelic richness, number of alleles per locus etc. The final dataset includes 112 studies and 108 species. Methods for processing the data: We made minor adjustments to the GPS location of some populations upon visual inspection on Google Maps of the correct overlay of the data point with the indicated island body or mainland. For each population we extracted four climate variables reflecting mean and variation in temperature and precipitation available in CliMond V1.2 (Kritikos et al. 2012) at 10 minutes resolution: mean annual temperature (Bio1), annual precipitation (Bio12), temperature seasonality (CV) (Bio4) and precipitation seasonality (CV) (Bio15) using the "prcomp function" in the stats package in R. For populations where climate variables were not available on the global climate maps mostly due to small island size not captured in CliMond, we extracted data from the geographically closest grid cell with available climate values, which was available within 3.5 km away from the focal grid cell for all localities. We normalised the four climate variables using the "normalizer" package in R (Vilela 2020), and we performed a Principal Component Analysis (PCA) using the psych package in R (Revelle 2018). We saved the loadings of the axes for further analyses. References:

Bruno Vilela (2020). normalizer: Making data normal again.. R package version 0.1.0. Kriticos, D.J., Webber, B.L., Leriche, A., Ota, N., Macadam, I., Bathols, J., et al.(2012). CliMond: global high-resolution historical and future scenario climate surfaces for bioclimatic modelling. Methods Ecol. Evol., 3, 53--64. Revelle, W. (2018) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, https://CRAN.R-project.org/package=psych Version = 1.8.12. Tummers, B. (2006). DataThief III. https://datathief.org/
Data from: FISBe: A real-world benchmark dataset for instance segmentation...
zenodo.org
data.niaid.nih.gov
bin, json +3
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa (2024). FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures [Dataset]. http://doi.org/10.5281/zenodo.10875063
Explore at:
zip, text/x-python, bin, json, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10875063
Dataset updated
Apr 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lisa Mais; Lisa Mais; Peter Hirsch; Peter Hirsch; Claire Managan; Claire Managan; Ramya Kandarpa; Josef Lorenz Rumberger; Josef Lorenz Rumberger; Annika Reinke; Annika Reinke; Lena Maier-Hein; Lena Maier-Hein; Gudrun Ihrke; Gudrun Ihrke; Dagmar Kainmueller; Dagmar Kainmueller; Ramya Kandarpa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 26, 2024
Description
General

For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.

Summary

A new dataset for neuron instance segmentation in 3d multicolor light microscopy data of fruit fly brains

30 completely labeled (segmented) images

71 partly labeled images

altogether comprising ∼600 expert-labeled neuron instances (labeling a single neuron takes between 30-60 min on average, yet a difficult one can take up to 4 hours)

To the best of our knowledge, the first real-world benchmark dataset for instance segmentation of long thin filamentous objects

A set of metrics and a novel ranking score for respective meaningful method benchmarking

An evaluation of three baseline methods in terms of the above metrics and score

Abstract

Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.

Dataset documentation:

We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:

>> FISBe Datasheet

Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.

Files

fisbe_v1.0_{completely,partly}.zip

contains the image and ground truth segmentation data; there is one zarr file per sample, see below for more information on how to access zarr files.

fisbe_v1.0_mips.zip

maximum intensity projections of all samples, for convenience.

sample_list_per_split.txt

a simple list of all samples and the subset they are in, for convenience.

view_data.py

a simple python script to visualize samples, see below for more information on how to use it.

dim_neurons_val_and_test_sets.json

a list of instance ids per sample that are considered to be of low intensity/dim; can be used for extended evaluation.

Readme.md

general information

How to work with the image files

Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.

We recommend to work in a virtual environment, e.g., by using conda:

conda create -y -n flylight-env -c conda-forge python=3.9
conda activate flylight-env

How to open zarr files

Install the python zarr package:
pip install zarr

Opened a zarr file with:

import zarr
raw = zarr.open(
seg = zarr.open(

# optional:
import numpy as np
raw_np = np.array(raw)

Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.

How to view zarr image files

We recommend to use napari to view the image data.

Install napari:
pip install "napari[all]"

Save the following Python script:

import zarr, sys, napari

raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")
gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")

viewer = napari.Viewer(ndisplay=3)
for idx, gt in enumerate(gts):
viewer.add_labels(
gt, rendering='translucent', blending='additive', name=f'gt_{idx}')
viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')
viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')
viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')
napari.run()

Execute:
python view_data.py

Metrics

S: Average of avF1 and C

avF1: Average F1 Score

C: Average ground truth coverage

clDice_TP: Average true positives clDice

FS: Number of false splits

FM: Number of false merges

tp: Relative number of true positives

For more information on our selected metrics and formal definitions please see our paper.

Baseline

To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.

License

The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Citation

If you use FISBe in your research, please use the following BibTeX entry:

@misc{mais2024fisbe, title = {FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures}, author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller}, year = 2024, eprint = {2404.00130}, archivePrefix ={arXiv}, primaryClass = {cs.CV} }

Acknowledgments

We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.

Changelog

There have been no changes to the dataset so far.
All future change will be listed on the changelog page.

Contributing

If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.

All contributions are welcome!
Death in the United States
kaggle.com
zip
Updated Aug 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centers for Disease Control and Prevention (2017). Death in the United States [Dataset]. https://www.kaggle.com/datasets/cdc/mortality
Explore at:
zip(766333584 bytes)Available download formats
Dataset updated
Aug 3, 2017
Dataset authored and provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
United States
Description
Every year the CDC releases the country’s most detailed report on death in the United States under the National Vital Statistics Systems. This mortality dataset is a record of every death in the country for 2005 through 2015, including detailed information about causes of death and the demographic background of the deceased.

It's been said that "statistics are human beings with the tears wiped off." This is especially true with this dataset. Each death record represents somebody's loved one, often connected with a lifetime of memories and sometimes tragically too short.

Putting the sensitive nature of the topic aside, analyzing mortality data is essential to understanding the complex circumstances of death across the country. The US Government uses this data to determine life expectancy and understand how death in the U.S. differs from the rest of the world. Whether you’re looking for macro trends or analyzing unique circumstances, we challenge you to use this dataset to find your own answers to one of life’s great mysteries.

Overview

This dataset is a collection of CSV files each containing one year's worth of data and paired JSON files containing the code mappings, plus an ICD 10 code set. The CSVs were reformatted from their original fixed-width file formats using information extracted from the CDC's PDF manuals using this script. Please note that this process may have introduced errors as the text extracted from the pdf is not a perfect match. If you have any questions or find errors in the preparation process, please leave a note in the forums. We hope to publish additional years of data using this method soon.

A more detailed overview of the data can be found here. You'll find that the fields are consistent within this time window, but some of data codes change every few years. For example, the 113_cause_recode entry 069 only covers ICD codes (I10,I12) in 2005, but by 2015 it covers (I10,I12,I15). When I post data from years prior to 2005, expect some of the fields themselves to change as well.

All data comes from the CDC’s National Vital Statistics Systems, with the exception of the Icd10Code, which are sourced from the World Health Organization.

Project ideas

The CDC's mortality data was the basis of a widely publicized paper, by Anne Case and Nobel prize winner Angus Deaton, arguing that middle-aged whites are dying at elevated rates. One of the criticisms against the paper is that it failed to properly account for the exact ages within the broad bins available through the CDC's WONDER tool. What do these results look like with exact/not-binned age data?

Similarly, how sensitive are the mortality trends being discussed in the news to the choice of bin-widths?

As noted above, the data preparation process could have introduced errors. Can you find any discrepancies compared to the aggregate metrics on WONDER? If so, please let me know in the forums!

WONDER is cited in numerous economics, sociology, and public health research papers. Can you find any papers whose conclusions would be altered if they used the exact data available here rather than binned data from Wonder?

Differences from the first version of the dataset

This version of the dataset was prepared in a completely different many. This has allowed us to provide a much larger volume of data and ensure that codes are available for every field.

We've replaced the batch of sql files with a single JSON per year. Kaggle's platform currently offer's better support for JSON files, and this keeps the number of files manageable.

A tutorial kernel providing a quick introduction to the new format is available here.

Lastly, I apologize if the transition has interrupted anyone's work! If need be, you can still download v1.
HOSPI-Tools Dataset - DSLR
zenodo.org
zip
Updated Jul 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Rodrigues; Mark Rodrigues (2022). HOSPI-Tools Dataset - DSLR [Dataset]. http://doi.org/10.5281/zenodo.5895068
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5895068
Dataset updated
Jul 5, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mark Rodrigues; Mark Rodrigues
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are working to develop a comprehensive dataset of surgical tools based on specialities, with a hierarchical structure – speciality, pack, set and tool. We belive that this dataset can be useful for computer vision and deep learning research into surgical tool tracking, management and surgical training and audit. We have therefore created an initial dataset of surgical tool (instrument and implant) images, captured using under different lighting conditions and with different backgrounds. We captured RGB images of surgical tools using a DSLR camera and webcam on site in a major hospital under realistic conditions and with the surgical tools currently in use. Image backgrounds in our initial dataset were essentially flat colours, even though different colour backgrounds were used. As we further developed our dataset, we will try to include much greater occlusions, illumination changes, and the presence of blood, tissue and smoke in the images which would be more reflective of crowded, messy, real-world conditions.

Illumination sources included natural light – direct sunlight and shaded light – LED, halogen and fluorescent lighting, and this accurately reflected the illumination working conditions within the hospital. Distances of the surgical tools to the camera to the object ranged from 60 to 150 cms., and the average class size was 74 images. Images captured included individual object images as well as cluttered, clustered and occluded objects. Our initial focus was on Orthopaedics and General Surgery, two out of the 14 surgical specialities. We selected these specialities since general surgery instruments are the most commonly used tools across all surgeries and provide instrument volume, while orthopaedics provides variety and complexity given the wide range of procedures, instruments and implants used in orthopaedic surgery. We will add other specialities as we develop this dataset, to reflect the complexities inherent in each of the surgical specialities. This dataset was designed to offer a large variety of tools, arranged hierarchically to reflect how surgical tools are organised in real-world conditions.

If you do find our dataset useful, please cite our papers in your work:

Rodrigues, M., Mayo, M, and Patros, P. (2022). OctopusNet: Machine Learning for Intelligent Management of Surgical Tools. Published in “Smart Health”, Volume 23, 2022. https://doi.org/10.1016/j.smhl.2021.100244

Rodrigues, M., Mayo, M, and Patros, P. (2021). Evaluation of Deep Learning Techniques on a Novel Hierarchical Surgical Tool Dataset. Accepted paper at The 2021 Australasian Joint Conference on Artificial Intelligence. 2021. To be Published in Lecture Notes in Computer Science series.

Rodrigues, M., Mayo, M, and Patros, P. (2021). Interpretable deep learning for surgical tool management. In M. Reyes, P. Henriques Abreu, J. Cardoso, M. Hajij, G. Zamzmi, P. Rahul, and L. Thakur (Eds.), Proc 4th International Workshop on Interpretability of Machine Intelligence in Medical Image Computing (iMIMIC 2021) LNCS 12929 (pp. 3-12). Cham: Springer.
Z
MGD: Music Genre Dataset
data.niaid.nih.gov
zenodo.org
Updated May 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirella M. Moro (2021). MGD: Music Genre Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4778562
Explore at:
Dataset updated
May 28, 2021
Dataset provided by
Danilo B. Seufitelli
Mariana O. Silva
Anisio Lacerda
Gabriel P. Oliveira
Mirella M. Moro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MGD: Music Genre Dataset

Over recent years, the world has seen a dramatic change in the way people consume music, moving from physical records to streaming services. Since 2017, such services have become the main source of revenue within the global recorded music market. Therefore, this dataset is built by using data from Spotify. It provides a weekly chart of the 200 most streamed songs for each country and territory it is present, as well as an aggregated global chart.

Considering that countries behave differently when it comes to musical tastes, we use chart data from global and regional markets from January 2017 to December 2019, considering eight of the top 10 music markets according to IFPI: United States (1st), Japan (2nd), United Kingdom (3rd), Germany (4th), France (5th), Canada (8th), Australia (9th), and Brazil (10th).

We also provide information about the hit songs and artists present in the charts, such as all collaborating artists within a song (since the charts only provide the main ones) and their respective genres, which is the core of this work. MGD also provides data about musical collaboration, as we build collaboration networks based on artist partnerships in hit songs. Therefore, this dataset contains:

Genre Networks: Success-based genre collaboration networks

Genre Mapping: Genre mapping from Spotify genres to super-genres

Artist Networks: Success-based artist collaboration networks

Artists: Some artist data

Hit Songs: Hit Song data and features

Charts: Enhanced data from Spotify Weekly Top 200 Charts

This dataset was originally built for a conference paper at ISMIR 2020. If you make use of the dataset, please also cite the following paper:

Gabriel P. Oliveira, Mariana O. Silva, Danilo B. Seufitelli, Anisio Lacerda, and Mirella M. Moro. Detecting Collaboration Profiles in Success-based Music Genre Networks. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR 2020), 2020.

@inproceedings{ismir/OliveiraSSLM20, title = {Detecting Collaboration Profiles in Success-based Music Genre Networks}, author = {Gabriel P. Oliveira and Mariana O. Silva and Danilo B. Seufitelli and Anisio Lacerda and Mirella M. Moro}, booktitle = {21st International Society for Music Information Retrieval Conference} pages = {726--732}, year = {2020} }
d
Replication Data for: \"A Topic-based Segmentation Model for Identifying...
search.dataone.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert (2024). Replication Data for: \"A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews\" [Dataset]. http://doi.org/10.7910/DVN/EE3DE2
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/EE3DE2
Dataset updated
Sep 25, 2024
Dataset provided by
Harvard Dataverse
Authors
Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert
Description
We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer...
Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...
zenodo.org
bz2
Updated Mar 15, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.2592524
Dataset updated
Mar 15, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

This repository contains two files:

dump.tar.bz2

jupyter_reproducibility.tar.bz2

The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.

archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.

paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

Reproducing the Analysis

This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38

First, download dump.tar.bz2 and extract it:

tar -xjf dump.tar.bz2

It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

psql jupyter < db2019-03-13.dump

It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Create a conda environment with Python 3.7:

conda create -n analyses python=3.7 conda activate analyses

Go to the analyses folder and install all the dependencies of the requirements.txt

cd jupyter_reproducibility/analyses pip install -r requirements.txt

For reproducing the analyses, run jupyter on this folder:

jupyter notebook

Execute the notebooks on this order:

Index.ipynb

N0.Repository.ipynb

N1.Skip.Notebook.ipynb

N2.Notebook.ipynb

N3.Cell.ipynb

N4.Features.ipynb

N5.Modules.ipynb

N6.AST.ipynb

N7.Name.ipynb

N8.Execution.ipynb

N9.Cell.Execution.Order.ipynb

N10.Markdown.ipynb

N11.Repository.With.Notebook.Restriction.ipynb

N12.To.Paper.ipynb

Reproducing or Expanding the Collection

The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

Requirements

This time, we have extra requirements:

All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account

Environment

First, set the following environment variables:

export JUP_MACHINE="db"; # machine identifier export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories export JUP_LOGS_DIR="/home/jupyter/logs"; # log files export JUP_COMPRESSION="lbzip2"; # compression program export JUP_VERBOSE="5"; # verbose level export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection export JUP_GITHUB_USERNAME="github_username"; # your github username export JUP_GITHUB_PASSWORD="github_password"; # your github password export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB) export JUP_FIRST_DATE="2013-01-01"; # initial date to query github export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address export JUP_EMAIL_TO="target@email.com"; # email that receives notifications export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank export JUP_WITH_EXECUTION="1"; # run execute python notebooks export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies export JUP_EXECUTION_MODE="-1"; # run following the execution order export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction # Frequenci of log report export JUP_ASTROID_FREQUENCY="5"; export JUP_IPYTHON_FREQUENCY="5"; export JUP_NOTEBOOKS_FREQUENCY="5"; export JUP_REQUIREMENT_FREQUENCY="5"; export JUP_CRAWLER_FREQUENCY="1"; export JUP_CLONE_FREQUENCY="1"; export JUP_COMPRESS_FREQUENCY="5"; export JUP_DB_IP="localhost"; # postgres database IP

Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

Scripts

Download and extract jupyter_reproducibility.tar.bz2:

tar -xjf jupyter_reproducibility.tar.bz2

Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

Conda 2.7

conda create -n raw27 python=2.7 -y conda activate raw27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 2.7

conda create -n py27 python=2.7 anaconda -y conda activate py27 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.4

It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

conda create -n raw34 python=3.4 -y conda activate raw34 conda install jupyter -c conda-forge -y conda uninstall jupyter -y pip install --upgrade pip pip install jupyter pip install pipenv pip install -e jupyter_reproducibility/archaeology pip install pathlib2

Anaconda 3.4

conda create -n py34 python=3.4 anaconda -y conda activate py34 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.5

conda create -n raw35 python=3.5 -y conda activate raw35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.5

It requires the manual installation of other anaconda packages.

conda create -n py35 python=3.5 anaconda -y conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator conda activate py35 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.6

conda create -n raw36 python=3.6 -y conda activate raw36 pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Anaconda 3.6

conda create -n py36 python=3.6 anaconda -y conda activate py36 conda install -y anaconda-navigator jupyterlab_server navigator-updater pip install --upgrade pip pip install pipenv pip install -e jupyter_reproducibility/archaeology

Conda 3.7

<code
I
Self-citation analysis data based on PubMed Central subset (2002-2005)
databank.illinois.edu
Updated Apr 27, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik (2018). Self-citation analysis data based on PubMed Central subset (2002-2005) [Dataset]. http://doi.org/10.13012/B2IDB-9665377_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9665377_V1
Dataset updated
Apr 27, 2018
Authors
Shubhanshu Mishra; Brent D Fegley; Jana Diesner; Vetle I. Torvik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
U.S. National Science Foundation (NSF)
Description
Self-citation analysis data based on PubMed Central subset (2002-2005) ---------------------------------------------------------------------- Created by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik on April 5th, 2018 ## Introduction This is a dataset created as part of the publication titled: Mishra S, Fegley BD, Diesner J, Torvik VI (2018) Self-Citation is the Hallmark of Productive Authors, of Any Gender. PLOS ONE. It contains files for running the self citation analysis on articles published in PubMed Central between 2002 and 2005, collected in 2015. The dataset is distributed in the form of the following tab separated text files: * Training_data_2002_2005_pmc_pair_First.txt (1.2G) - Data for first authors * Training_data_2002_2005_pmc_pair_Last.txt (1.2G) - Data for last authors * Training_data_2002_2005_pmc_pair_Middle_2nd.txt (964M) - Data for middle 2nd authors * Training_data_2002_2005_pmc_pair_txt.header.txt - Header for the data * COLUMNS_DESC.txt file - Descriptions of all columns * model_text_files.tar.gz - Text files containing model coefficients and scores for model selection. * results_all_model.tar.gz - Model coefficient and result files in numpy format used for plotting purposes. v4.reviewer contains models for analysis done after reviewer comments. * README.txt file ## Dataset creation Our experiments relied on data from multiple sources including properitery data from Thompson Rueter's (now Clarivate Analytics) Web of Science collection of MEDLINE citations. Author's interested in reproducing our experiments should personally request from Clarivate Analytics for this data. However, we do make a similar but open dataset based on citations from PubMed Central which can be utilized to get similar results to those reported in our analysis. Furthermore, we have also freely shared our datasets which can be used along with the citation datasets from Clarivate Analytics, to re-create the datased used in our experiments. These datasets are listed below. If you wish to use any of those datasets please make sure you cite both the dataset as well as the paper introducing the dataset. * MEDLINE 2015 baseline: https://www.nlm.nih.gov/bsd/licensee/2015_stats/baseline_doc.html * Citation data from PubMed Central (original paper includes additional citations from Web of Science) * Author-ity 2009 dataset: - Dataset citation: Torvik, Vetle I.; Smalheiser, Neil R. (2018): Author-ity 2009 - PubMed author name disambiguated dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4222651_V1 - Paper citation: Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data, 3(3), 1–29. https://doi.org/10.1145/1552303.1552304 - Paper citation: Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2004). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology, 56(2), 140–158. https://doi.org/10.1002/asi.20105 * Genni 2.0 + Ethnea for identifying author gender and ethnicity: - Dataset citation: Torvik, Vetle (2018): Genni + Ethnea for the Author-ity 2009 dataset. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-9087546_V1 - Paper citation: Smith, B. N., Singh, M., & Torvik, V. I. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries - JCDL ’13. ACM Press. https://doi.org/10.1145/2467696.2467720 - Paper citation: Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geo-coded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington DC, USA. http://hdl.handle.net/2142/88927 * MapAffil for identifying article country of affiliation: - Dataset citation: Torvik, Vetle I. (2018): MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4354331_V1 - Paper citation: Torvik VI. MapAffil: A Bibliographic Tool for Mapping Author Affiliation Strings to Cities and Their Geocodes Worldwide. D-Lib magazine : the magazine of the Digital Library Forum. 2015;21(11-12):10.1045/november2015-torvik * IMPLICIT journal similarity: - Dataset citation: Torvik, Vetle (2018): Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-4742014_V1 * Novelty dataset for identify article level novelty: - Dataset citation: Mishra, Shubhanshu; Torvik, Vetle I. (2018): Conceptual novelty scores for PubMed articles. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-5060298_V1 - Paper citation: Mishra S, Torvik VI. Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib magazine : The Magazine of the Digital Library Forum. 2016;22(9-10):10.1045/september2016-mishra - Code: https://github.com/napsternxg/Novelty * Expertise dataset for identifying author expertise on articles: * Source code provided at: https://github.com/napsternxg/PubMed_SelfCitationAnalysis Note: The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions Additional data related updates can be found at Torvik Research Group ## Acknowledgments This work was made possible in part with funding to VIT from NIH grant P01AG039347 and NSF grant 1348742. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. ## License Self-citation analysis data based on PubMed Central subset (2002-2005) by Shubhanshu Mishra, Brent D. Fegley, Jana Diesner, and Vetle Torvik is licensed under a Creative Commons Attribution 4.0 International License. Permissions beyond the scope of this license may be available at https://github.com/napsternxg/PubMed_SelfCitationAnalysis.
Z
Data from: OSDG Community Dataset (OSDG-CD)
data.niaid.nih.gov
Updated Jun 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PPMI (2024). OSDG Community Dataset (OSDG-CD) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5550237
Explore at:
Dataset updated
Jun 3, 2024
Dataset provided by
OSDG
PPMI
UNDP IICPSD SDG AI Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The OSDG Community Dataset (OSDG-CD) is a public dataset of thousands of text excerpts, which were validated by over 1,400 OSDG Community Platform (OSDG-CP) citizen scientists from over 140 countries, with respect to the Sustainable Development Goals (SDGs).

Dataset Information

In support of the global effort to achieve the Sustainable Development Goals (SDGs), OSDG is realising a series of SDG-labelled text datasets. The OSDG Community Dataset (OSDG-CD) is the direct result of the work of more than 1,400 volunteers from over 130 countries who have contributed to our understanding of SDGs via the OSDG Community Platform (OSDG-CP). The dataset contains tens of thousands of text excerpts (henceforth: texts) which were validated by the Community volunteers with respect to SDGs. The data can be used to derive insights into the nature of SDGs using either ontology-based or machine learning approaches.

📘 The file contains 43,0210 (+390) text excerpts and a total of 310,328 (+3,733) assigned labels.

To learn more about the project, please visit the OSDG website and the official GitHub page. Explore a detailed overview of the OSDG methodology in our recent paper "OSDG 2.0: a multilingual tool for classifying text data by UN Sustainable Development Goals (SDGs)".

Source Data

The dataset consists of paragraph-length text excerpts derived from publicly available documents, including reports, policy documents and publication abstracts. A significant number of documents (more than 3,000) originate from UN-related sources such as SDG-Pathfinder and SDG Library. These sources often contain documents that already have SDG labels associated with them. Each text is comprised of 3 to 6 sentences and is about 90 words on average.

Methodology

All the texts are evaluated by volunteers on the OSDG-CP. The platform is an ambitious attempt to bring together researchers, subject-matter experts and SDG advocates from all around the world to create a large and accurate source of textual information on the SDGs. The Community volunteers use the platform to participate in labelling exercises where they validate each text's relevance to SDGs based on their background knowledge.

In each exercise, the volunteer is shown a text together with an SDG label associated with it – this usually comes from the source – and asked to either accept or reject the suggested label.

There are 3 types of exercises:

All volunteers start with the mandatory introductory exercise that consists of 10 pre-selected texts. Each volunteer must complete this exercise before they can access 2 other exercise types. Upon completion, the volunteer reviews the exercise by comparing their answers with the answers of the rest of the Community using aggregated statistics we provide, i.e., the share of those who accepted and rejected the suggested SDG label for each of the 10 texts. This helps the volunteer to get a feel for the platform.

SDG-specific exercises where the volunteer validates texts with respect to a single SDG, e.g., SDG 1 No Poverty.

All SDGs exercise where the volunteer validates a random sequence of texts where each text can have any SDG as its associated label.

After finishing the introductory exercise, the volunteer is free to select either SDG-specific or All SDGs exercises. Each exercise, regardless of its type, consists of 100 texts. Once the exercise is finished, the volunteer can either label more texts or exit the platform. Of course, the volunteer can finish the exercise early. All progress is saved and recorded still.

To ensure quality, each text is validated by up to 9 different volunteers and all texts included in the public release of the data have been validated by at least 3 different volunteers.

It is worth keeping in mind that all exercises present the volunteers with a binary decision problem, i.e., either accept or reject a suggested label. The volunteers are never asked to select one or more SDGs that a certain text might relate to. The rationale behind this set-up is that asking a volunteer to select from 17 SDGs is extremely inefficient. Currently, all texts are validated against only one associated SDG label.

Column Description

doi - Digital Object Identifier of the original document

text_id - unique text identifier

text - text excerpt from the document

sdg - the SDG the text is validated against

labels_negative - the number of volunteers who rejected the suggested SDG label

labels_positive - the number of volunteers who accepted the suggested SDG label

agreement - agreement score based on the formula (agreement = \frac{|labels_{positive} - labels_{negative}|}{labels_{positive} + labels_{negative}})

Further Information

Do not hesitate to share with us your outputs, be it a research paper, a machine learning model, a blog post, or just an interesting observation. All queries can be directed to community@osdg.ai.
Virtual Sensors: Efficiently Estimating Missing Spectra - Dataset - NASA...
data.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Virtual Sensors: Efficiently Estimating Missing Spectra - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/virtual-sensors-efficiently-estimating-missing-spectra
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Various instruments are used to create images of the Earth and other objects in the universe in a diverse set of wavelength bands with the aim of understanding natural phenomena. Sometimes these instruments are built in a phased approach, with additional measurement capabilities added in later phases. In other cases, technology may mature to the point that the instrument offers new measurement capabilities that were not planned in the original design of the instrument. In still other cases, high resolution spectral measurements may be too costly to perform on a large sample and therefore lower resolution spectral instruments are used to take the majority of measurements. Many applied science questions that are relevant to the earth science remote sensing community require analysis of enormous amounts of data that were generated by instruments with disparate measurement capabilities. This paper addresses this problem using Virtual Sensors: a method that uses modelstrained on spectrally rich (high spectral resolution) data to "fill in" unmeasured spectral channels in spectrally poor (low spectral resolution) data. The models we use in this paper are Multi-Layer Perceptrons (MLPs), Support Vector Machines (SVMs) with Radial Basis Function (RBF) kernels and SVMs with Mixture Density Mercer Kernels (MDMK). We demonstrate this method by using models trained on the high spectral resolution Terra MODIS instrument to estimate what the equivalent of the MODIS 1.6 micron channel would be for the NOAA AVHRR/2 instrument. The scientific motivation for the simulation of the 1.6 micron channel is to improve the ability of the AVHRR/2 sensor to detect clouds over snow and ice.
Liege Microgrid Open Data
kaggle.com
zip
Updated Aug 29, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Dumas (2019). Liege Microgrid Open Data [Dataset]. https://www.kaggle.com/datasets/jonathandumas/liege-microgrid-open-data/code
Explore at:
zip(16511113 bytes)Available download formats
Dataset updated
Aug 29, 2019
Authors
Jonathan Dumas
Area covered
Liège
Description
Context

This open dataset is used in the scientific study "Coordination of operational planning and real-time optimization in microgrids" currently in the submission process of the Power Systems Computation Conference (PSCC) 2020.

The paper abstract is:

Hierarchical microgrid control levels range from distributed device level controllers that run at a high frequency to centralized controllers optimizing market integration that run much less frequently. Centralized controllers are often subdivided in operational planning controllers that optimize decisions over a time horizon of one or several days, and real-time optimization controllers that deal with actions in the current market period. The coordination of these levels is of paramount importance. In this paper we propose a value function based approach as a way to propagate information from operational planning to real-time optimization. We apply this method to an environment where operational planning, using day-ahead forecasts, optimizes at a market period resolution the decisions to minimize the total energy cost and revenues, the peak consumption and injection related costs, and plans for reserve requirements, while real-time optimization copes with the forecast errors and yields implementable actions based on real-time measurements. The approach is compared to a rule-based controller on three use cases, and its sensitivity to forecast error is assessed.

Content

The dataset is composed:

15 minutes resolution weather forecast (solar irradiation, air temperature, etc) produced by the Laboratory of Climatology of the university of Liège, based on the MAR regional climate model;

5 seconds resolution monitored PV and consumption of the MiRIS microgrid;

15 minutes resolution PV and consumption weather based forecasts produced by our forecaster and used in the study.

The weather based forecasts are multi outputs with an horizon of 24 hours ahead and a resolution of 15 minutes. They are quarterly produced on rolling basis with a learning set of one week. Every six hours the model is refreshed and the learning set is moved consequently. This means, each quarter a PV and consumption forecast is produced, composed of 96 values (one per quarter of the 24 hours ahead).

More information about the MiRIS microgrid located at the John Cockerill Group’s international headquarters in Seraing, Belgium, is available at https://johncockerill.com/fr/energy/stockage-denergie/.

Acknowledgements

We would like to thank John Cockerill and Nethys for their financial support, and Xavier Fettweis of the Laboratory of Climatology of ULiège who produced the weather forecasts based on the MAR regional climate model.

Inspiration

You can freely use the data to reproduce the numerical results of the study or to produce better weather based forecasts.

Weather based forecast description

Two "classic" deterministic techniques are implemented, a Recurrent Neural Network(RNN) with the keras python library and a Gradient Boosting Regression (GBR) with the scikit-learn python library.

The RNN is a LSTM with one hidden layer, 5000 epochs, RELU as activation function (hidden layer and output), a batch size of 200 and drop out rate of 0.4.

The GBR is the multi output GBR of sklearn with 200 estimator and 20 as max depth.

Weather forecast description

The weather forecast is made at the MiRIS microgrid loation.

The file is composed of forecast of several weather variables: - CD = low clouds (0 to 1) - CM = medium clouds (0 to 1) - CU = high clouds (0 to 1) - PREC = precipitation (mm / 15 min) - RH2m = relative humidity (%) - SNOW = snow height (mm) - ST = Surface Temperature (°C) - SWD = Global Horizontal Irradiance (W/m2) - SWDtop = Total Solar Irradiance at the top of the atmosphere (W/m2) - TT2M = temperature 2 meters above the ground (°C) - WS100m = Wind speed at 100m from the ground (m/s) - WS10m =Wind speed at 10m from the ground (m/s)
Z
The dataset for DATA 2022 paper "Dataset: An Indoor Smart Traffic Dataset...
data.niaid.nih.gov
Updated Oct 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guan Nan (2022). The dataset for DATA 2022 paper "Dataset: An Indoor Smart Traffic Dataset and Data Collection System" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7181313
Explore at:
Dataset updated
Oct 13, 2022
Dataset provided by
Ling Neiwen
Fu Heming
Xing Guoliang
He Yuze
Guan Nan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for DATA 2022 paper "Dataset: An Indoor Smart Traffic Dataset and Data Collection System"

This archive contains a traffic light dataset that can be used for traffic light detection/classification. The dataset is collected from an indoor smart traffic testbed. In this testbed, we use fences to simulate the road's boundaries and use movable toy traffic signs and traffic lights to simulate those in real-world traffic scenes. An F1TENTH vehicle drives along the fence autonomously. Two cameras are mounted on both sides of the vehicle, which capture images of traffic lights and traffic signs on both sides of the track.

This dataset contains 3507 images captured by the F1TENTH vehicle. Each image comes with ground truth bounding boxes that enclose the traffic lights and a label indicating the current state of the traffic light, 0 for a green light and 1 for a red light.

Please cite our paper if you are using this dataset.
D
Data from: Compromised through Compression: Python source code for DLMS...
phys-techsciences.datastations.nl
text/markdown, txt +2
Updated Dec 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll (2021). Compromised through Compression: Python source code for DLMS compression privacy analysis & graphing [Dataset]. http://doi.org/10.17026/DANS-2BY-BNA3
Explore at:
xml(5795), zip(20542), text/markdown(792), txt(626), zip(12920)Available download formats
Unique identifier
https://doi.org/10.17026/DANS-2BY-BNA3
Dataset updated
Dec 14, 2021
Dataset provided by
DANS Data Station Physical and Technical Sciences
Authors
P.J.M. van Aubel; E. Poll; P.J.M. van Aubel; E. Poll
License
http://opensource.org/licenses/BSD-2-Clausehttp://opensource.org/licenses/BSD-2-Clause
Description
Python code (for Python 3.9 & Pandas 1.3.2) to generate the results used in "Compromised through Compression: Privacy Implications of Smart Meter Traffic Analysis".Smart metering comes with risks to privacy. One concern is the possibility of an attacker seeing the traffic that reports the energy use of a household and deriving private information from that. Encryption helps to mask the actual energy measurements, but is not sufficient to cover all risks. One aspect which has yet gone unexplored — and where encryption does not help — is traffic analysis, i.e. whether the length of messages communicating energy measurements can leak privacy-sensitive information to an observer. In this paper we examine whether using encodings or compression for smart metering data could potentially leak information about household energy use. Our analysis is based on the real-world energy use data of ±80 Dutch households.We find that traffic analysis could reveal information about the energy use of individual households if compression is used. As a result, when messages are sent daily, an attacker performing traffic analysis would be able to determine when all the members of a household are away or not using electricity for an entire day. We demonstrate this issue by recognizing when households from our dataset were on holiday. If messages are sent more often, more granular living patterns could likely be determined.We propose a method of encoding the data that is nearly as effective as compression at reducing message size, but does not leak the information that compression leaks. By not requiring compression to achieve the best possible data savings, the risk of traffic analysis is eliminated.This code operates on the relative energy measurements from the "Zonnedael dataset" from Liander N.V. This dataset needs to be obtained separately; see instructions accompanying the code. The code transforms the dataset into absolute measurements such as would be taken by a smart meter. It then generates batch messages covering 24-hour periods starting at midnight, similar to how the Dutch infrastructure batches daily meter readings, in the different possible encodings with and without compression applied. For an explanation of the different encodings, see the paper. The code will then provide statistics on the efficiency of encoding and compression for the entire dataset, and attempt to find the periods of multi-day absences for each household. It will also generate the graphs in the style used in the paper and presentation.
Z
Dataset: Shell Commands Used by Participants of Hands-on Cybersecurity...
data.niaid.nih.gov
zenodo.org
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pavel Seda (2023). Dataset: Shell Commands Used by Participants of Hands-on Cybersecurity Training [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5137354
Explore at:
Dataset updated
Jul 18, 2023
Dataset provided by
Pavel Čeleda
Jan Vykopal
Pavel Seda
Valdemar Švábenský
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains supplementary materials for the following journal paper:

Valdemar Švábenský, Jan Vykopal, Pavel Seda, Pavel Čeleda. Dataset of Shell Commands Used by Participants of Hands-on Cybersecurity Training. In Elsevier Data in Brief. 2021. https://doi.org/10.1016/j.dib.2021.107398

How to cite

If you use or build upon the materials, please use the BibTeX entry below to cite the original paper (not only this web link).

@article{Svabensky2021dataset, author = {\v{S}v\'{a}bensk\'{y}, Valdemar and Vykopal, Jan and Seda, Pavel and \v{C}eleda, Pavel}, title = {{Dataset of Shell Commands Used by Participants of Hands-on Cybersecurity Training}}, journal = {Data in Brief}, publisher = {Elsevier}, volume = {38}, year = {2021}, issn = {2352-3409}, url = {https://doi.org/10.1016/j.dib.2021.107398}, doi = {10.1016/j.dib.2021.107398}, }

The data were collected using a logging toolset referenced here.

Attached content

Dataset (data.zip). The collected data are attached here on Zenodo. A copy is also available in this repository.

Analytical tools (toolset.zip). To analyze the data, you can instantiate the toolset or this project for ELK.

Version history

Version 1 (https://zenodo.org/record/5137355) contains 13446 log records from 175 trainees. These data are precisely those that are described in the associated journal paper. Version 1 provides a snapshot of the state when the article was published.

Version 2 (https://zenodo.org/record/5517479) contains 13446 log records from 175 trainees. The data are unchanged from Version 1, but the analytical toolset includes a minor fix.

Version 3 (https://zenodo.org/record/6670113) contains 21762 log records from 275 trainees. It is a superset of Version 2, with newly collected data added to the dataset.

The current Version 4 (https://zenodo.org/record/8136017) contains 21459 log records from 275 trainees. Compared to Version 3, we cleaned 303 invalid/duplicate command records.
h
NIPS-2022-Accepted-Papers
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepNLP, NIPS-2022-Accepted-Papers [Dataset]. https://huggingface.co/datasets/DeepNLP/NIPS-2022-Accepted-Papers
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
DeepNLP
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
NeurIPS 2022 Accepted Paper Meta Info Dataset

This dataset is collected from the NeurIPS 2022 Advances in Neural Information Processing Systems 35 conference accepted paper (https://papers.nips.cc/paper_files/paper/2023) as well as the arxiv website DeepNLP paper arxiv (http://www.deepnlp.org/content/paper/nips2022). For researchers who are interested in doing analysis of NIPS 2022 accepted papers and potential research trends, you can use the already cleaned up json file in the… See the full description on the dataset page: https://huggingface.co/datasets/DeepNLP/NIPS-2022-Accepted-Papers.
u
Data from: Solving the Robot-World Hand-Eye(s) Calibration Problem with...
agdatacommons.nal.usda.gov
catalog.data.gov
+2more
txt
Updated Feb 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy Tabb (2024). Data from: Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Methods [Dataset]. http://doi.org/10.15482/USDA.ADC/1340592
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1340592
Dataset updated
Feb 13, 2024
Dataset provided by
Ag Data Commons
Authors
Amy Tabb
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
These datasets were generated for calibrating robot-camera systems. In an extension, we also considered the problem of calibrating robots with more than one camera.
These datasets are provided as a companion to the paper "Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Methods" by Amy Tabb and Khalil M. Ahmad Yousef. Included are eight datasets in zipped files, numbered DS1.zip, DS2.zip, etc. Explanations of the format of the datasets is provided in the README resource in the file "README_input_format.txt". Generally, each zipped folder consists of images and a text file of robot positions when those images were acquired. Open source code can be found at: https://github.com/amy-tabb/RWHEC-Tabb-AhmadYousef We also include the results of using our code on one of the datasets so that you can be sure that the code worked correctly. This folder is named DS1_write.zip and can be found in the resource titled "Output from running methods on Dataset 1". Problems/Comments/Bugs should be addressed to amy.tabb@ars.usda.gov Resources in this dataset:Resource Title: README. File Name: README_input_format.txt.txtResource Description: This file gives an in-depth description of the image and robot position datasets.Resource Title: Dataset 1. File Name: DS1.zipResource Title: Dataset 2. File Name: DS2.zipResource Title: Dataset 3. File Name: DS3.zipResource Title: Dataset 4. File Name: DS4.zipResource Title: Dataset 5. File Name: DS5.zipResource Title: Dataset 6. File Name: DS6.zipResource Title: Dataset 7. File Name: DS7.zipResource Title: Dataset 8. File Name: DS8.zipResource Title: Output from running methods on Datatset 1. File Name: DS1_write.zip

Facebook

Twitter

Click to copy link

Link copied

Cite

EarthTrends.com (2008). EarthTrends, Paper and Paperboard Consumption Imports and Exports, World, 2005 [Dataset]. http://geocommons.com/search.html

EarthTrends, Paper and Paperboard Consumption Imports and Exports, World, 2005

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 27, 2008

Dataset provided by

EarthTrends.com
data

Description

This data set illustrates the consumption, imports, and exports of paper and paperboard across the globe. The value of -100 means that no data was available. http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=571&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=573&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=569&action=select_years http://earthtrends.wri.org/searchable_db/index.php?step=countries&ccID%5B%5D=0&allcountries=checkbox&theme=9&variable_ID=568&action=select_years September 26, 2007

Clear search

Close search

Google apps

Main menu

EarthTrends, Paper and Paperboard Consumption Imports and Exports, World,...

Data from: Data reuse and the open data citation advantage

Dataset accompanying "how much demand side flexibility do we need? -...

Data articles in journals

Phenotypic and genetic diversity data recorded in island and mainland...

Data from: FISBe: A real-world benchmark dataset for instance segmentation...

General

Summary

Abstract

Dataset documentation:

Files

How to work with the image files

How to open zarr files

How to view zarr image files

Metrics

Baseline

License

Citation

Acknowledgments

Changelog

Contributing

Death in the United States

Overview

Project ideas

Differences from the first version of the dataset

HOSPI-Tools Dataset - DSLR

MGD: Music Genre Dataset

Replication Data for: \"A Topic-based Segmentation Model for Identifying...

Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

Self-citation analysis data based on PubMed Central subset (2002-2005)

Data from: OSDG Community Dataset (OSDG-CD)

Virtual Sensors: Efficiently Estimating Missing Spectra - Dataset - NASA...

Liege Microgrid Open Data

Context

Content

Acknowledgements

Inspiration

Weather based forecast description

Weather forecast description

The dataset for DATA 2022 paper "Dataset: An Indoor Smart Traffic Dataset...

Data from: Compromised through Compression: Python source code for DLMS...

Dataset: Shell Commands Used by Participants of Hands-on Cybersecurity...

NIPS-2022-Accepted-Papers

Data from: Solving the Robot-World Hand-Eye(s) Calibration Problem with...

EarthTrends, Paper and Paperboard Consumption Imports and Exports, World, 2005