48 datasets found

MERGE Dataset
zenodo.org
zip
Updated Feb 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva (2025). MERGE Dataset [Dataset]. http://doi.org/10.5281/zenodo.13939205
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13939205
Dataset updated
Feb 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The MERGE dataset is a collection of audio, lyrics, and bimodal datasets for conducting research on Music Emotion Recognition. A complete version is provided for each modality. The audio datasets provide 30-second excerpts for each sample, while full lyrics are provided in the relevant datasets. The amount of available samples in each dataset is the following:

MERGE Audio Complete: 3554

MERGE Audio Balanced: 3232

MERGE Lyrics Complete: 2568

MERGE Lyrics Balanced: 2400

MERGE Bimodal Complete: 2216

MERGE Bimodal Balanced: 2000

Additional Contents

Each dataset contains the following additional files:

av_values: File containing the arousal and valence values for each sample sorted by their identifier;

tvt_dataframes: Train, validate, and test splits for each dataset. Both a 70-15-15 and a 40-30-30 split are provided.

Metadata

A metadata spreadsheet is provided for each dataset with the following information for each sample, if available:

Song (Audio and Lyrics datasets) - Song identifiers. Identifiers starting with MT were extracted from the AllMusic platform, while those starting with A or L were collected from private collections;

Quadrant - Label corresponding to one of the four quadrants from Russell's Circumplex Model;

AllMusic Id - For samples starting with A or L, the matching AllMusic identifier is also provided. This was used to complement the available information for the samples originally obtained from the platform;

Artist - First performing artist or band;

Title - Song title;

Relevance - AllMusic metric representing the relevance of the song in relation to the query used;

Duration - Song length in seconds;

Moods - User-generated mood tags extracted from the AllMusic platform and available in Warriner's affective dictionary;

MoodsAll - User-generated mood tags extracted from the AllMusic platform;

Genres - User-generated genre tags extracted from the AllMusic platform;

Themes - User-generated theme tags extracted from the AllMusic platform;

Styles - User-generated style tags extracted from the AllMusic platform;

AppearancesTrackIDs - All AllMusic identifiers related with a sample;

Sample - Availability of the sample in the AllMusic platform;

SampleURL - URL to the 30-second excerpt in AllMusic;

ActualYear - Year of song release.

Citation

If you use some part of the MERGE dataset in your research, please cite the following article:

Louro, P. L. and Redinho, H. and Santos, R. and Malheiro, R. and Panda, R. and Paiva, R. P. (2024). MERGE - A Bimodal Dataset For Static Music Emotion Recognition. arxiv. URL: https://arxiv.org/abs/2407.06060.

BibTeX:

@misc{louro2024mergebimodaldataset,
title={MERGE -- A Bimodal Dataset for Static Music Emotion Recognition},
author={Pedro Lima Louro and Hugo Redinho and Ricardo Santos and Ricardo Malheiro and Renato Panda and Rui Pedro Paiva},
year={2024},
eprint={2407.06060},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2407.06060},
}

Acknowledgements

This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020.

Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.
NSF/NCAR GV HIAPER 1 Minute Data Merge
data.ucar.edu
ascii
Updated Dec 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2024). NSF/NCAR GV HIAPER 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/R1RA-JHKZ-W913
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/R1RA-JHKZ-W913
Dataset updated
Dec 26, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 18, 2012 - Jun 30, 2012
Area covered

Description
This data set contains NSF/NCAR GV HIAPER 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 30 June 2012. These are updated merges from the NASA DC3 archive that were made available 13 June 2014. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-gV_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Data from: KORUS-AQ Aircraft Merge Data Files
catalog.data.gov
s.cnmilf.com
+1more
Updated Jul 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NASA/LARC/SD/ASDC (2025). KORUS-AQ Aircraft Merge Data Files [Dataset]. https://catalog.data.gov/dataset/korus-aq-aircraft-merge-data-files
Explore at:
Dataset updated
Jul 3, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.
e
Merger of BNV-D data (2008 to 2019) and enrichment
data.europa.eu
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick VINCOURT, Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f
Explore at:
zip(18530465)Available download formats
Dataset authored and provided by
Patrick VINCOURT
Description
Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}
NASA DC-8 1 Second Data Merge
data.ucar.edu
ascii
Updated Dec 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Langley Research Center (LaRC), NASA (2024). NASA DC-8 1 Second Data Merge [Dataset]. http://doi.org/10.5065/D6SF2TXB
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.5065/D6SF2TXB
Dataset updated
Dec 26, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Gao Chen; Jennifer R. Olson; Langley Research Center (LaRC), NASA
Time period covered
May 18, 2012 - Jun 22, 2012
Area covered

Description
This data set contains NASA DC-8 1 Second Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. These merges are an updated version that were provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. No "grand merge" has been provided for the 1-second data on the DC8 aircraft due to its prohibitive size (~1.5GB). In most cases, downloading the individual merge files for each day and simply concatenating them should suffice. This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments. For more information on the updates to this dataset, please see the readme file.
f
Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...
scielo.figshare.com
tiff
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo (2023). HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE USING R SOFTWARE AND RSTUDIO [Dataset]. http://doi.org/10.6084/m9.figshare.19899537.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19899537.v1
Dataset updated
Jun 4, 2023
Dataset provided by
SciELO journals
Authors
Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
ABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.
f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
n
Multilevel modeling of time-series cross-sectional data reveals the dynamic...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated Mar 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kodai Kusano (2020). Multilevel modeling of time-series cross-sectional data reveals the dynamic interaction between ecological threats and democratic development [Dataset]. http://doi.org/10.5061/dryad.547d7wm3x
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.547d7wm3x
Dataset updated
Mar 6, 2020
Dataset provided by
University of Nevada, Reno
Authors
Kodai Kusano
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
What is the relationship between environment and democracy? The framework of cultural evolution suggests that societal development is an adaptation to ecological threats. Pertinent theories assume that democracy emerges as societies adapt to ecological factors such as higher economic wealth, lower pathogen threats, less demanding climates, and fewer natural disasters. However, previous research confused within-country processes with between-country processes and erroneously interpreted between-country findings as if they generalize to within-country mechanisms. In this article, we analyze a time-series cross-sectional dataset to study the dynamic relationship between environment and democracy (1949-2016), accounting for previous misconceptions in levels of analysis. By separating within-country processes from between-country processes, we find that the relationship between environment and democracy not only differs by countries but also depends on the level of analysis. Economic wealth predicts increasing levels of democracy in between-country comparisons, but within-country comparisons show that democracy declines as countries become wealthier over time. This relationship is only prevalent among historically wealthy countries but not among historically poor countries, whose wealth also increased over time. By contrast, pathogen prevalence predicts lower levels of democracy in both between-country and within-country comparisons. Our longitudinal analyses identifying temporal precedence reveal that not only reductions in pathogen prevalence drive future democracy, but also democracy reduces future pathogen prevalence and increases future wealth. These nuanced results contrast with previous analyses using narrow, cross-sectional data. As a whole, our findings illuminate the dynamic process by which environment and democracy shape each other.

Methods Our Time-Series Cross-Sectional data combine various online databases. Country names were first identified and matched using R-package “countrycode” (Arel-Bundock, Enevoldsen, & Yetman, 2018) before all datasets were merged. Occasionally, we modified unidentified country names to be consistent across datasets. We then transformed “wide” data into “long” data and merged them using R’s Tidyverse framework (Wickham, 2014). Our analysis begins with the year 1949, which was occasioned by the fact that one of the key time-variant level-1 variables, pathogen prevalence was only available from 1949 on. See our Supplemental Material for all data, Stata syntax, R-markdown for visualization, supplemental analyses and detailed results (available at https://osf.io/drt8j/).
NCES Academic Library Survey Dataset 1996 - 2020 -- alsMERGE_2020.csv
figshare.com
txt
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Starr Hoffman (2024). NCES Academic Library Survey Dataset 1996 - 2020 -- alsMERGE_2020.csv [Dataset]. http://doi.org/10.6084/m9.figshare.25007429.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25007429.v1
Dataset updated
Jan 16, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Starr Hoffman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data from the National Center for Education Statistics' Academic Library Survey, which was gathered every two years from 1996 - 2014, and annually in IPEDS starting in 2014 (this dataset has continued to only merge data every two years, following the original schedule). This data was merged, transformed, and used for research by Starr Hoffman and Samantha Godbey.This data was merged using R; R scripts for this merge can be made available upon request. Some variables changed names or definitions during this time; a view of these variables over time is provided in the related Figshare Project. Carnegie Classification changed several times during this period; all Carnegie classifications were crosswalked to the 2000 classification version; that information is also provided in the related Figshare Project. This data was used for research published in several articles, conference papers, and posters starting in 2018 (some of this research used an older version of the dataset which was deposited in the University of Nevada, Las Vegas's repository).SourcesAll data sources were downloaded from the National Center for Education Statistics website https://nces.ed.gov/. Individual datasets and years accessed are listed below.[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Academic Libraries Survey (ALS) Public Use Data File, Library Statistics Program, (2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/surveys/libraries/aca_data.asp[dataset] U.S. Department of Education, National Center for Education Statistics, Institutional Characteristics component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Enrollment component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006, 2004, 2002, 2000, 1998, 1996), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Human Resources component, Integrated Postsecondary Education Data System (IPEDS), (2020, 2018, 2016, 2014, 2012, 2010, 2008, 2006), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Employees Assigned by Position component, Integrated Postsecondary Education Data System (IPEDS), (2004, 2002), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7[dataset] U.S. Department of Education, National Center for Education Statistics, Fall Staff component, Integrated Postsecondary Education Data System (IPEDS), (1999, 1997, 1995), https://nces.ed.gov/ipeds/datacenter/login.aspx?gotoReportId=7
NASA DC-8 SAGAAERO Data Merge
data.ucar.edu
ascii
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2024). NASA DC-8 SAGAAERO Data Merge [Dataset]. http://doi.org/10.26023/ANQE-HZRR-P30K
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/ANQE-HZRR-P30K
Dataset updated
Dec 26, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 18, 2012 - Jun 22, 2012
Area covered

Description
This data set contains NASA DC-8 SAGAAERO Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. These merge files were updated by NASA. The data have been merged to SAGAAero file timeline. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrgSAGAAero-dc8_merge_YYYYMMdd_R*_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
Data supporting the Master thesis "Monitoring von Open Data Praktiken -...
zenodo.org
data.niaid.nih.gov
zip
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Zinke; Katharina Zinke (2024). Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" [Dataset]. http://doi.org/10.5281/zenodo.14196539
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14196539
Dataset updated
Nov 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Katharina Zinke; Katharina Zinke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data supporting the Master thesis "Monitoring von Open Data Praktiken - Herausforderungen beim Auffinden von Datenpublikationen am Beispiel der Publikationen von Forschenden der TU Dresden" (Monitoring open data practices - challenges in finding data publications using the example of publications by researchers at TU Dresden) - Katharina Zinke, Institut für Bibliotheks- und Informationswissenschaften, Humboldt-Universität Berlin, 2023

This ZIP-File contains the data the thesis is based on, interim exports of the results and the R script with all pre-processing, data merging and analyses carried out. The documentation of the additional, explorative analysis is also available. The actual PDFs and text files of the scientific papers used are not included as they are published open access.

The folder structure is shown below with the file names and a brief description of the contents of each file. For details concerning the analyses approach, please refer to the master's thesis (publication following soon).

## Data sources

Folder 01_SourceData/

- PLOS-Dataset_v2_Mar23.csv (PLOS-OSI dataset)

- ScopusSearch_ExportResults.csv (export of Scopus search results from Scopus)

- ScopusSearch_ExportResults.ris (export of Scopus search results from Scopus)

- Zotero_Export_ScopusSearch.csv (export of the file names and DOIs of the Scopus search results from Zotero)

## Automatic classification

Folder 02_AutomaticClassification/

- (NOT INCLUDED) PDFs folder (Folder for PDFs of all publications identified by the Scopus search, named AuthorLastName_Year_PublicationTitle_Title)

- (NOT INCLUDED) PDFs_to_text folder (Folder for all texts extracted from the PDFs by ODDPub, named AuthorLastName_Year_PublicationTitle_Title)

- PLOS_ScopusSearch_matched.csv (merge of the Scopus search results with the PLOS_OSI dataset for the files contained in both)

- oddpub_results_wDOIs.csv (results file of the ODDPub classification)

- PLOS_ODDPub.csv (merge of the results file of the ODDPub classification with the PLOS-OSI dataset for the publications contained in both)

## Manual coding

Folder 03_ManualCheck/

- CodeSheet_ManualCheck.txt (Code sheet with descriptions of the variables for manual coding)

- ManualCheck_2023-06-08.csv (Manual coding results file)

- PLOS_ODDPub_Manual.csv (Merge of the results file of the ODDPub and PLOS-OSI classification with the results file of the manual coding)

## Explorative analysis for the discoverability of open data

Folder04_FurtherAnalyses

Proof_of_of_Concept_Open_Data_Monitoring.pdf (Description of the explorative analysis of the discoverability of open data publications using the example of a researcher) - in German

## R-Script

Analyses_MA_OpenDataMonitoring.R (R-Script for preparing, merging and analyzing the data and for performing the ODDPub algorithm)
Z
Data from: A dataset to model Levantine landcover and land-use change...
data.niaid.nih.gov
zenodo.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kempf, Michael (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10396147
Explore at:
Dataset updated
Dec 16, 2023
Dataset authored and provided by
Kempf, Michael
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS. The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection). Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.

(9_workflow_diagramme) this simple code can be used to plot a workflow diagram and is detached from the actual analysis.

Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data Curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration, and Funding acquisition: Michael
d
Health and Retirement Study (HRS)
search.dataone.org
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Health and Retirement Study (HRS) [Dataset]. http://doi.org/10.7910/DVN/ELEKOY
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/ELEKOY
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the health and retirement study (hrs) with r the hrs is the one and only longitudinal survey of american seniors. with a panel starting its third decade, the current pool of respondents includes older folks who have been interviewed every two years as far back as 1992. unlike cross-sectional or shorter panel surveys, respondents keep responding until, well, death d o us part. paid for by the national institute on aging and administered by the university of michigan's institute for social research, if you apply for an interviewer job with them, i hope you like werther's original. figuring out how to analyze this data set might trigger your fight-or-flight synapses if you just start clicking arou nd on michigan's website. instead, read pages numbered 10-17 (pdf pages 12-19) of this introduction pdf and don't touch the data until you understand figure a-3 on that last page. if you start enjoying yourself, here's the whole book. after that, it's time to register for access to the (free) data. keep your username and password handy, you'll need it for the top of the download automation r script. next, look at this data flowchart to get an idea of why the data download page is such a righteous jungle. but wait, good news: umich recently farmed out its data management to the rand corporation, who promptly constructed a giant consolidated file with one record per respondent across the whole panel. oh so beautiful. the rand hrs files make much of the older data and syntax examples obsolete, so when you come across stuff like instructions on how to merge years, you can happily ignore them - rand has done it for you. the health and retirement study only includes noninstitutionalized adults when new respondents get added to the panel (as they were in 1992, 1993, 1998, 2004, and 2010) but once they're in, they're in - respondents have a weight of zero for interview waves when they were nursing home residents; but they're still responding and will continue to contribute to your statistics so long as you're generalizing about a population from a previous wave (for example: it's possible to compute "among all americans who were 50+ years old in 1998, x% lived in nursing homes by 2010"). my source for that 411? page 13 of the design doc. wicked. this new github repository contains five scripts: 1992 - 2010 download HRS microdata.R loop through every year and every file, download, then unzip everything in one big party impor t longitudinal RAND contributed files.R create a SQLite database (.db) on the local disk load the rand, rand-cams, and both rand-family files into the database (.db) in chunks (to prevent overloading ram) longitudinal RAND - analysis examples.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create tw o database-backed complex sample survey object, using a taylor-series linearization design perform a mountain of analysis examples with wave weights from two different points in the panel import example HRS file.R load a fixed-width file using only the sas importation script directly into ram with < a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html">SAScii parse through the IF block at the bottom of the sas importation script, blank out a number of variables save the file as an R data file (.rda) for fast loading later replicate 2002 regression.R connect to the sql database created by the 'import longitudinal RAND contributed files' program create a database-backed complex sample survey object, using a taylor-series linearization design exactly match the final regression shown in this document provided by analysts at RAND as an update of the regression on pdf page B76 of this document . click here to view these five scripts for more detail about the health and retirement study (hrs), visit: michigan's hrs homepage rand's hrs homepage the hrs wikipedia page a running list of publications using hrs notes: exemplary work making it this far. as a reward, here's the detailed codebook for the main rand hrs file. note that rand also creates 'flat files' for every survey wave, but really, most every analysis you c an think of is possible using just the four files imported with the rand importation script above. if you must work with the non-rand files, there's an example of how to import a single hrs (umich-created) file, but if you wish to import more than one, you'll have to write some for loops yourself. confidential to sas, spss, stata, and sudaan users: a tidal wave is coming. you can get water up your nose and be dragged out to sea, or you can grab a surf board. time to transition to r. :D
terraceDL: A geomorphology deep learning dataset of agricultural terraces in...
figshare.com
bin
Updated Mar 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2023). terraceDL: A geomorphology deep learning dataset of agricultural terraces in Iowa, USA [Dataset]. http://doi.org/10.6084/m9.figshare.22320373.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22320373.v2
Dataset updated
Mar 22, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Iowa
Description
scripts.zip

arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

terraceDL.zip

dems: LiDAR DTM data partitioned into training, testing, and validation datasets based on HUC8 watershed boundaries. Original DTM data were provided by the Iowa BMP mapping project: https://www.gis.iastate.edu/BMPs. extents: extents of the training, testing, and validation areas as defined by HUC 8 watershed boundaries. vectors: vector features representing agricultural terraces and partitioned into separate training, testing, and validation datasets. Original digitized features were provided by the Iowa BMP Mapping Project: https://www.gis.iastate.edu/BMPs.
DLR Falcon 1 Second Data Merge
data.ucar.edu
ascii
Updated Dec 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao Chen; Jennifer R. Olson; Michael Shook (2024). DLR Falcon 1 Second Data Merge [Dataset]. http://doi.org/10.26023/PYCX-YRVR-AB0W
Explore at:
asciiAvailable download formats
Unique identifier
https://doi.org/10.26023/PYCX-YRVR-AB0W
Dataset updated
Dec 26, 2024
Dataset provided by
University Corporation for Atmospheric Research
Authors
Gao Chen; Jennifer R. Olson; Michael Shook
Time period covered
May 29, 2012 - Jun 14, 2012
Area covered

Description
This data set contains DLR Falcon 1 Second Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 29 May 2012 through 14 June 2012. These merges were created using data in the NASA DC3 archive as of September 25, 2013. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg01-falcon_merge_YYYYMMdd_R1_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This data set is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and data set comments.
d
Current Population Survey (CPS)
search.dataone.org
dataverse.harvard.edu
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/AK4FDD
Dataset updated
Nov 21, 2023
Dataset provided by
Harvard Dataverse
Authors
Damico, Anthony
Description
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
H
National Health and Nutrition Examination Survey (NHANES)
dataverse.harvard.edu
Updated May 30, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anthony Damico (2013). National Health and Nutrition Examination Survey (NHANES) [Dataset]. http://doi.org/10.7910/DVN/IMWQPJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/IMWQPJ
Dataset updated
May 30, 2013
Dataset provided by
Harvard Dataverse
Authors
Anthony Damico
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
analyze the national health and nutrition examination survey (nhanes) with r nhanes is this fascinating survey where doctors and dentists accompany survey interviewers in a little mobile medical center that drives around the country. while the survey folks are interviewing people, the medical professionals administer laboratory tests and conduct a real doctor's examination. the b lood work and medical exam allow researchers like you and me to answer tough questions like, "how many people have diabetes but don't know they have diabetes?" conducting the lab tests and the physical isn't cheap, so a new nhanes data set becomes available once every two years and only includes about twelve thousand respondents. since the number of respondents is so small, analysts often pool multiple years of data together. the replication scripts below give a few different examples of how multiple years of data can be pooled with r. the survey gets conducted by the centers for disease control and prevention (cdc), and generalizes to the united states non-institutional, non-active duty military population. most of the data tables produced by the cdc include only a small number of variables, so importation with the foreign package's read.xport function is pretty straightforward. but that makes merging the appropriate data sets trickier, since it might not be clear what to pull for which variables. for every analysis, start with the table with 'demo' in the name -- this file includes basic demographics, weighting, and complex sample survey design variables. since it's quick to download the files directly from the cdc's ftp site, there's no massive ftp download automation script. this new github repository co ntains five scripts: 2009-2010 interview only - download and analyze.R download, import, save the demographics and health insurance files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the interview weights run a series of pretty generic analyses on the health insurance ques tions 2009-2010 interview plus laboratory - download and analyze.R download, import, save the demographics and cholesterol files onto your local computer load both files, limit them to the variables needed for the analysis, merge them together perform a few example variable recodes create the complex sample survey object, using the mobile examination component (mec) weights perform a direct-method age-adjustment and matc h figure 1 of this cdc cholesterol brief replicate 2005-2008 pooled cdc oral examination figure.R download, import, save, pool, recode, create a survey object, run some basic analyses replicate figure 3 from this cdc oral health databrief - the whole barplot replicate cdc publications.R download, import, save, pool, merge, and recode the demographics file plus cholesterol laboratory, blood pressure questionnaire, and blood pressure laboratory files match the cdc's example sas and sudaan syntax file's output for descriptive means match the cdc's example sas and sudaan synta x file's output for descriptive proportions match the cdc's example sas and sudaan syntax file's output for descriptive percentiles replicate human exposure to chemicals report.R (user-contributed) download, import, save, pool, merge, and recode the demographics file plus urinary bisphenol a (bpa) laboratory files log-transform some of the columns to calculate the geometric means and quantiles match the 2007-2008 statistics shown on pdf page 21 of the cdc's fourth edition of the report click here to view these five scripts for more detail about the national health and nutrition examination survey (nhanes), visit: the cdc's nhanes homepage the national cancer institute's page of nhanes web tutorials notes: nhanes includes interview-only weights and interview + mobile examination component (mec) weights. if you o nly use questions from the basic interview in your analysis, use the interview-only weights (the sample size is a bit larger). i haven't really figured out a use for the interview-only weights -- nhanes draws most of its power from the combination of the interview and the mobile examination component variables. if you're only using variables from the interview, see if you can use a data set with a larger sample size like the current population (cps), national health interview survey (nhis), or medical expenditure panel survey (meps) instead. confidential to sas, spss, stata, sudaan users: why are you still riding around on a donkey after we've invented the internal combustion engine? time to transition to r. :D
g
Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...
datasearch.gesis.org
openicpsr.org
Updated Feb 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaplan, Jacob (2020). Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program Data: Property Stolen and Recovered (Supplement to Return A) 1960-2017 [Dataset]. http://doi.org/10.3886/E105403V3
Explore at:
Unique identifier
https://doi.org/10.3886/E105403V3
Dataset updated
Feb 19, 2020
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Kaplan, Jacob
Description
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Z
BRAINTEASER ALS and MS Datasets
data.niaid.nih.gov
Updated Feb 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aidos, Helena (2025). BRAINTEASER ALS and MS Datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8083180
Explore at:
Dataset updated
Feb 12, 2025
Dataset provided by
Di Camillo, Barbara
Madeira, Sara C.
de Carvalho, Mamede
García Dominguez, Jose Manuel
Gromicho, Marta
Fariselli, Piero
Tavazzi, Eleonora
Aidos, Helena
Dagliati, Arianna
Longato, Enrico
Marchesin, Stefano
Tavazzi, Erica
Menotti, Laura
Manera, Umberto
Birolo, Giovanni
Cavalla, Paola
Trescato, Isotta
Chiò, Adriano
Guazzo, Alessandro
Faggioli, Guglielmo
Silvello, Gianmaria
Di Nunzio, Giorgio Maria
Bergamaschi, Roberto
Bosoni, Pietro
Vettoretti, Martina
Ferro, Nicola
Description
BRAINTEASER (Bringing Artificial Intelligence home for a better care of amyotrophic lateral sclerosis and multiple sclerosis) is a data science project that seeks to exploit the value of big data, including those related to health, lifestyle habits, and environment, to support patients with Amyotrophic Lateral Sclerosis (ALS) and Multiple Sclerosis (MS) and their clinicians. Taking advantage of cost-efficient sensors and apps, BRAINTEASER will integrate large, clinical datasets that host both patient-generated and environmental data.

As part of its activities, BRAINTEASER organized three open evaluation challenges on Intelligent Disease Progression Prediction (iDPP), iDPP@CLEF 2022, iDPP@CLEF 2023, and iDPP@CLEF 2024 co-located with the Conference and Labs of the Evaluation Forum (CLEF).

The goal of iDPP@CLEF is to design and develop an evaluation infrastructure for AI algorithms able to:

better describe disease mechanisms;

stratify patients according to their phenotype assessed all over the disease evolution;

predict disease progression in a probabilistic, time-dependent fashion.

The iDPP@CLEF challenges relied on retrospective and prospective ALS and MS patient data made available by the clinical partners of the BRAINTEASER consortium.

Retrospective Dataset

We release three retrospective datasets, one for ALS and two for MS. The two retrospective MS datasets, one consisting of clinical data only and one with clinical data and environmental/pollution data.

The retrospective datasets contain data about 2,204 ALS patients (static variables, ALSFRS-R questionnaires, spirometry tests, environmental/pollution data) and 1,792 MS patients (static variables, EDSS scores, evoked potentials, relapses, MRIs). A subset of 280 MS patients contains environmental and pollution data.

More in detail, the BRAINTEASER project retrospective datasets were derived from the merging of already existing datasets obtained by the clinical centers involved in the BRAINTEASER Project.

The ALS dataset was obtained by the merge and homogenisation of the Piemonte and Valle d’Aosta Registry for Amyotrophic Lateral Sclerosis (PARALS, Chiò et al., 2017) and the Lisbon ALS clinic (CENTRO ACADÉMICO DE MEDICINA DE LISBOA, Centro Hospitalar Universitário de Lisboa-Norte, Hospital de Santa Maria, Lisbon, Portugal,) dataset. Both datasets were initiated in 1995 and are currently maintained by researchers of the ALS Regional Expert Centre (CRESLA), University of Turin, and of the CENTRO ACADÉMICO DE MEDICINA DE LISBOA-Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa. They include demographic and clinical data, comprehending both static and dynamic variables.

The MS dataset was obtained from the Pavia MS clinical dataset, which was started in 1990 and contains demographic and clinical information that is continuously updated by the researchers of the Institute and the Turin MS clinic dataset (Department of Neurosciences and Mental Health, Neurology Unit 1, Città della Salute e della Scienza di Torino.

Retrospective environmental data are accessible at various scales at the individual subject level. Thus, environmental data have been retrieved at different scales:

To gather macroscale air pollution data we’ve leveraged data coming from public monitoring stations that cover the whole extension of the involved countries, namely the European Air Quality Portal;

data from a network of air quality sensors (PurpleAir - Outdoor Air Quality Monitor / PurpleAir PA-II) installed in different points of the city of Pavia (Italy) were extracted as well. In both cases, environmental data were previously publicly available. In order to merge environmental data with individual subject locations we leverage postcodes (postcodes of the station for the pollutant detection and postcodes of subject address). Data were merged following an anonymization procedure based on hash keys. Environmental exposure trajectories have been pre-processed and aggregated in order to avoid fine temporal and spatial granularities. Thus, individual exposure information could not disclose personal addresses.

The retrospective datasets are shared in two formats:

RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO);

CSV, as shared during the iDPP@CLEF 2022 and 2023 challenges, split into training and test.

Each format corresponds to a specific folder in the datasets, where a dedicated README file provides further details on the datasets. Note that the ALS dataset is split into multiple ZIP files due to the size of the environmental data.

Prospective Dataset

For the iDPP@CLEF 2024 challenge, the datasets contain prospective data about 86 ALS patients (static variables, ALSFRS-R questionnaires compiled by clinicians or patients using the BRAINTEASER mobile application, sensors data).

The prospective datasets are shared in two formats:

RDF (serialized in Turtle) modeled according to the BRAINTEASER Ontology (BTO);

CSV, as shared during the iDPP@CLEF 2024 challenge, split into training and test.

Each format corresponds to a specific folder in the datasets, where a dedicated README file provides further details on the datasets. Note that the MS dataset is split into multiple ZIP files due to the size of the environmental data.

The BRAINTEASER Data Sharing Policy section below reports the details for requesting access to the datasets.
h
Data from: A High Statistics Measurement of the Proton Structure Functions...
hepdata.net
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A High Statistics Measurement of the Proton Structure Functions F(2) (x, Q**2) and R from Deep Inelastic Muon Scattering at High Q**2 [Dataset]. http://doi.org/10.17182/hepdata.12557.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.12557.v1
Description
CERN-SPS. NA4/BCDMS collaboration. Plab 100 - 280 GEV/C. These are data from the BCDMS collaboration on F2 and R=SIG(L)/SIG(T) with a hydrogen target. The statistics are very large (1.8 million events). The ranges of X,Q**2 are 0.06& lt;X& lt;0.8 and 7& lt;Q**2& lt;260 GeV**2. The F2 data show a distinct difference from the data on F2 proton taken by the EMC.. The publication lists values of F2 corresponding to R=0 and R=R(QCD) at each of the four energies, 100, 120, 200 and 280 GeV. As well as the statistical errors also given are 5 factors representing the effects of estimated systematic errors on F2 associated with (1) beam momentum calibration, (2) magnetic field calibration, (3) spectrometer resolution, (4) detector and trigger inefficiencies, and (5) relative normalisation uncertainty of data taken from external and internal targets. This record contains our attempt to merge these data at different energies using the statistical errors as weight factors. The final one-sigma systematic errors given here have been calculated using a prescription from the authors involving calculation of new merged F2 values for each of the systematic errors applied individually, and the combining in quadrature the differences in the new merged F2 values and the original F2. The individual F2 values at each energy are given in separate database records (& lt;a href=http://durpdg.dur.ac.uk/scripts/reacsearch.csh/TESTREAC/red+3021& gt; RED = 3021 & lt;/a& gt;). PLAB=100 GeV/c. These are the data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a hydrogen target. The statistics are very large (1.8 million events). The ranges of X, Q**2 are 0.06& lt;X& lt;0.8 and 7& lt;Q**2& lt;260 GeV**2. The F2 data show a distinct difference from the data on F2 proton taken by the EMC. In the preprint are listed values of F2 corresponding to R=0 and R=R(QCD) at each of the four energies, 100, 120, 200 and 280 GeV. Also listed are 5 systematic errors associated with beam momentum calibration, magnetic field calibration, spectrometer resolution, detector and trigger inefficiencies and relative normalisationuncertainty.. The sytematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. Themethod involves taking the quadratic sum of the errors from each source.. The record (& lt;a href=http://durpdg.dur.ac.uk/scripts/reacsearch.csh/TESTREAC/red+3019& gt; RED = 3019 & lt;/a& gt;) contains our attempt to merge these data at different energies using the statistical errors as weight factors. PLAB=120 GeV/c. These are the data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a hydrogen target. The statistics are very large (1.8 million events). The ranges of X, Q**2 are 0.06& lt;X& lt;0.8 and 7& lt;Q**2& lt;260 GeV**2. The F2 data show a distinct difference from the data on F2 proton taken by the EMC. In the preprint are listed values of F2 corresponding to R=0 and R=R(QCD) at each of the four energies, 100, 120, 200 and 280 GeV. Also listed are 5 systematic errors associated with beam momentum calibration, magnetic field calibration, spectrometer resolution, detector and trigger inefficiencies and relative normalisationuncertainty. The sytematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. Themethod involves taking the quadratic sum of the errors from each source. The record (& lt;a href=http://durpdg.dur.ac.uk/scripts/reacsearch.csh/TESTREAC/red+3019& gt; RED = 3019 & lt;/a& gt;) contains our attempt to merge these data at different energies using the statistical errors as weight factors. PLAB=200 GeV/c. These are the data from the BCDMS Collaboration on F2 and R=SIG(L)/SIG(T) with a hydrogen target. The statistics are very large (1.8 million events). The ranges of X, Q**2 are 0.06& lt;X& lt;0.8 and 7& lt;Q**2& lt;260 GeV**2. The F2 data show a distinct difference from the data on F2 proton taken by the EMC. In the preprint are listed values of F2 corresponding to R=0 and R=R(QCD) at each of the four energies, 100, 120, 200 and 280 GeV. Also listed are 5 systematic errors associated with beam momentum calibration, magnetic field calibration, spectrometer resolution, detector and trigger inefficiencies and relative normalisationuncertainty. The sytematic error shown in the tables is a result of combining together the 5 individual errors according to a prescription provided by the authors. Themethod involves taking the quadratic sum of the errors from each source. The record (& lt;a href=http://durpdg.dur.ac.uk/scripts/reacsearch.csh/TESTREAC/red+3019& gt; RED = 3019 & lt;/a& gt;) contains our attempt to merge these data at different energies using the statistical errors as weight factors. PLAB=280 GeV/c. These are the data...

Facebook

Twitter

Click to copy link

Link copied

Cite

Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva (2025). MERGE Dataset [Dataset]. http://doi.org/10.5281/zenodo.13939205

MERGE Dataset

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.13939205

Dataset updated

Feb 7, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The MERGE dataset is a collection of audio, lyrics, and bimodal datasets for conducting research on Music Emotion Recognition. A complete version is provided for each modality. The audio datasets provide 30-second excerpts for each sample, while full lyrics are provided in the relevant datasets. The amount of available samples in each dataset is the following:

MERGE Audio Complete: 3554
MERGE Audio Balanced: 3232
MERGE Lyrics Complete: 2568
MERGE Lyrics Balanced: 2400
MERGE Bimodal Complete: 2216
MERGE Bimodal Balanced: 2000

Additional Contents

Each dataset contains the following additional files:

av_values: File containing the arousal and valence values for each sample sorted by their identifier;
tvt_dataframes: Train, validate, and test splits for each dataset. Both a 70-15-15 and a 40-30-30 split are provided.

Metadata

A metadata spreadsheet is provided for each dataset with the following information for each sample, if available:

Song (Audio and Lyrics datasets) - Song identifiers. Identifiers starting with MT were extracted from the AllMusic platform, while those starting with A or L were collected from private collections;
Quadrant - Label corresponding to one of the four quadrants from Russell's Circumplex Model;
AllMusic Id - For samples starting with A or L, the matching AllMusic identifier is also provided. This was used to complement the available information for the samples originally obtained from the platform;
Artist - First performing artist or band;
Title - Song title;
Relevance - AllMusic metric representing the relevance of the song in relation to the query used;
Duration - Song length in seconds;
Moods - User-generated mood tags extracted from the AllMusic platform and available in Warriner's affective dictionary;
MoodsAll - User-generated mood tags extracted from the AllMusic platform;
Genres - User-generated genre tags extracted from the AllMusic platform;
Themes - User-generated theme tags extracted from the AllMusic platform;
Styles - User-generated style tags extracted from the AllMusic platform;
AppearancesTrackIDs - All AllMusic identifiers related with a sample;
Sample - Availability of the sample in the AllMusic platform;
SampleURL - URL to the 30-second excerpt in AllMusic;
ActualYear - Year of song release.

Citation

If you use some part of the MERGE dataset in your research, please cite the following article:

Louro, P. L. and Redinho, H. and Santos, R. and Malheiro, R. and Panda, R. and Paiva, R. P. (2024). MERGE - A Bimodal Dataset For Static Music Emotion Recognition. arxiv. URL: https://arxiv.org/abs/2407.06060.

BibTeX:

@misc{louro2024mergebimodaldataset,
title={MERGE -- A Bimodal Dataset for Static Music Emotion Recognition},
author={Pedro Lima Louro and Hugo Redinho and Ricardo Santos and Ricardo Malheiro and Renato Panda and Rui Pedro Paiva},
year={2024},
eprint={2407.06060},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2407.06060},
}

Acknowledgements

This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020.

Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.

Clear search

Close search

Google apps

Main menu

MERGE Dataset

Additional Contents

Metadata

Citation

Acknowledgements

NSF/NCAR GV HIAPER 1 Minute Data Merge

Data from: KORUS-AQ Aircraft Merge Data Files

Merger of BNV-D data (2008 to 2019) and enrichment

NASA DC-8 1 Second Data Merge

Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...

Cleaned NHANES 1988-2018

Multilevel modeling of time-series cross-sectional data reveals the dynamic...

NCES Academic Library Survey Dataset 1996 - 2020 -- alsMERGE_2020.csv

NASA DC-8 SAGAAERO Data Merge

Data supporting the Master thesis "Monitoring von Open Data Praktiken -...

Data from: A dataset to model Levantine landcover and land-use change...

Health and Retirement Study (HRS)

terraceDL: A geomorphology deep learning dataset of agricultural terraces in...

DLR Falcon 1 Second Data Merge

Current Population Survey (CPS)

National Health and Nutrition Examination Survey (NHANES)

Jacob Kaplan's Concatenated Files: Uniform Crime Reporting (UCR) Program...

BRAINTEASER ALS and MS Datasets

Data from: A High Statistics Measurement of the Proton Structure Functions...

MERGE DatasetSee More Versions

Additional Contents

Metadata

Citation

Acknowledgements

MERGE Dataset