27 datasets found
  1. MERGE Dataset

    • zenodo.org
    zip
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva (2025). MERGE Dataset [Dataset]. http://doi.org/10.5281/zenodo.13939205
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The MERGE dataset is a collection of audio, lyrics, and bimodal datasets for conducting research on Music Emotion Recognition. A complete version is provided for each modality. The audio datasets provide 30-second excerpts for each sample, while full lyrics are provided in the relevant datasets. The amount of available samples in each dataset is the following:

    • MERGE Audio Complete: 3554
    • MERGE Audio Balanced: 3232
    • MERGE Lyrics Complete: 2568
    • MERGE Lyrics Balanced: 2400
    • MERGE Bimodal Complete: 2216
    • MERGE Bimodal Balanced: 2000

    Additional Contents

    Each dataset contains the following additional files:

    • av_values: File containing the arousal and valence values for each sample sorted by their identifier;
    • tvt_dataframes: Train, validate, and test splits for each dataset. Both a 70-15-15 and a 40-30-30 split are provided.

    Metadata

    A metadata spreadsheet is provided for each dataset with the following information for each sample, if available:

    • Song (Audio and Lyrics datasets) - Song identifiers. Identifiers starting with MT were extracted from the AllMusic platform, while those starting with A or L were collected from private collections;
    • Quadrant - Label corresponding to one of the four quadrants from Russell's Circumplex Model;
    • AllMusic Id - For samples starting with A or L, the matching AllMusic identifier is also provided. This was used to complement the available information for the samples originally obtained from the platform;
    • Artist - First performing artist or band;
    • Title - Song title;
    • Relevance - AllMusic metric representing the relevance of the song in relation to the query used;
    • Duration - Song length in seconds;
    • Moods - User-generated mood tags extracted from the AllMusic platform and available in Warriner's affective dictionary;
    • MoodsAll - User-generated mood tags extracted from the AllMusic platform;
    • Genres - User-generated genre tags extracted from the AllMusic platform;
    • Themes - User-generated theme tags extracted from the AllMusic platform;
    • Styles - User-generated style tags extracted from the AllMusic platform;
    • AppearancesTrackIDs - All AllMusic identifiers related with a sample;
    • Sample - Availability of the sample in the AllMusic platform;
    • SampleURL - URL to the 30-second excerpt in AllMusic;
    • ActualYear - Year of song release.

    Citation

    If you use some part of the MERGE dataset in your research, please cite the following article:

    Louro, P. L. and Redinho, H. and Santos, R. and Malheiro, R. and Panda, R. and Paiva, R. P. (2024). MERGE - A Bimodal Dataset For Static Music Emotion Recognition. arxiv. URL: https://arxiv.org/abs/2407.06060.

    BibTeX:

    @misc{louro2024mergebimodaldataset,
    title={MERGE -- A Bimodal Dataset for Static Music Emotion Recognition},
    author={Pedro Lima Louro and Hugo Redinho and Ricardo Santos and Ricardo Malheiro and Renato Panda and Rui Pedro Paiva},
    year={2024},
    eprint={2407.06060},
    archivePrefix={arXiv},
    primaryClass={cs.SD},
    url={https://arxiv.org/abs/2407.06060},
    }

    Acknowledgements

    This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020.

    Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.

  2. KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal

    • data.nasa.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). KORUS-AQ Aircraft Merge Data Files - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/korus-aq-aircraft-merge-data-files-9bba5
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    KORUSAQ_Merge_Data are pre-generated merge data files combining various products collected during the KORUS-AQ field campaign. This collection features pre-generated merge files for the DC-8 aircraft. Data collection for this product is complete.The KORUS-AQ field study was conducted in South Korea during May-June, 2016. The study was jointly sponsored by NASA and Korea’s National Institute of Environmental Research (NIER). The primary objectives were to investigate the factors controlling air quality in Korea (e.g., local emissions, chemical processes, and transboundary transport) and to assess future air quality observing strategies incorporating geostationary satellite observations. To achieve these science objectives, KORUS-AQ adopted a highly coordinated sampling strategy involved surface and airborne measurements including both in-situ and remote sensing instruments.Surface observations provided details on ground-level air quality conditions while airborne sampling provided an assessment of conditions aloft relevant to satellite observations and necessary to understand the role of emissions, chemistry, and dynamics in determining air quality outcomes. The sampling region covers the South Korean peninsula and surrounding waters with a primary focus on the Seoul Metropolitan Area. Airborne sampling was primarily conducted from near surface to about 8 km with extensive profiling to characterize the vertical distribution of pollutants and their precursors. The airborne observational data were collected from three aircraft platforms: the NASA DC-8, NASA B-200, and Hanseo King Air. Surface measurements were conducted from 16 ground sites and 2 ships: R/V Onnuri and R/V Jang Mok.The major data products collected from both the ground and air include in-situ measurements of trace gases (e.g., ozone, reactive nitrogen species, carbon monoxide and dioxide, methane, non-methane and oxygenated hydrocarbon species), aerosols (e.g., microphysical and optical properties and chemical composition), active remote sensing of ozone and aerosols, and passive remote sensing of NO2, CH2O, and O3 column densities. These data products support research focused on examining the impact of photochemistry and transport on ozone and aerosols, evaluating emissions inventories, and assessing the potential use of satellite observations in air quality studies.

  3. f

    Dataset for: Sequential trials in the context of competing risks: concepts...

    • wiley.figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corine Baayen; Christelle Volteau; Cyril Flamant; Paul Blanche (2023). Dataset for: Sequential trials in the context of competing risks: concepts and case study, with R and SAS code [Dataset]. http://doi.org/10.6084/m9.figshare.7991189.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Wiley
    Authors
    Corine Baayen; Christelle Volteau; Cyril Flamant; Paul Blanche
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sequential designs and competing risks methodology are both well established. Their combined use has recently received some attention from a theoretical perspective, but their joint application in practice has been discussed less. The aim of this paper is to provide the applied statistician with a basic understanding of both sequential design theory and competing risks methodology and how to combine them in practice. Relevant references to more detailed theoretical discussions are provided and all discussions are illustrated using a real case study. Extensive R and SAS code is provided in the online supplementary material.

  4. r

    On-street Parking Bays

    • researchdata.edu.au
    • data.melbourne.vic.gov.au
    Updated Mar 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.vic.gov.au (2023). On-street Parking Bays [Dataset]. https://researchdata.edu.au/on-street-parking-bays/2296305
    Explore at:
    Dataset updated
    Mar 7, 2023
    Dataset provided by
    data.vic.gov.au
    Description

    Upcoming Changes: Please note that our parking system is being improved and this dataset may be disrupted. See more information here.\r
    \r
    This dataset contains spatial polygons which represent parking bays across the city. Each bay can also link to it's parking meter, and parking sensor information.\r
    \r
    How the data joins:\r
    \r
    There are three datasets that make up the live parking sensor release. They are the on-street parking bay sensors, on-street parking bays and the on-street car park bay information. \r
    The way the datasets join is as follows. The on-street parking bay sensors join to the on-street parking bays by the marker_id attribute. The on-street parking bay sensors join to the on-street car park bay restrictions by the bay_id attribute. The on-street parking bays and the on-street car park bay information don’t currently join.\r
    \r
    \r
    \r
    Please see City of Melbourne's disclaimer regarding the use of this data. https://data.melbourne.vic.gov.au/stories/s/94s9-uahn

  5. f

    Cleaned NHANES 1988-2018

    • figshare.com
    txt
    Updated Feb 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    figshare
    Authors
    Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

  6. H

    CloudSat/CALIPSO/GOES-16 ABI Joint Dataset

    • dataverse.harvard.edu
    Updated Feb 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Haynes (2022). CloudSat/CALIPSO/GOES-16 ABI Joint Dataset [Dataset]. http://doi.org/10.7910/DVN/LPXYBL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    John Haynes
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset consists of matched CloudSat radar and CALIPSO lidar cloud detection in pressure layers, GOES-16 ABI reflectance, and GFS relative humidity data for a time period from October 2018 through June 2019.

  7. NASA DC-8 1 Minute Data Merge

    • data.ucar.edu
    ascii
    Updated Dec 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Chen; Jennifer R. Olson; Michael Shook (2024). NASA DC-8 1 Minute Data Merge [Dataset]. http://doi.org/10.26023/VM9C-1C16-H003
    Explore at:
    asciiAvailable download formats
    Dataset updated
    Dec 26, 2024
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    Gao Chen; Jennifer R. Olson; Michael Shook
    Time period covered
    May 1, 2012 - Jun 30, 2012
    Area covered
    Description

    This dataset contains NASA DC-8 1 Minute Data Merge data collected during the Deep Convective Clouds and Chemistry Experiment (DC3) from 18 May 2012 through 22 June 2012. This dataset contains updated data provided by NASA. In most cases, variable names have been kept identical to those submitted in the raw data files. However, in some cases, names have been changed (e.g., to eliminate duplication). Units have been standardized throughout the merge. In addition, a "grand merge" has been provided. This includes data from all the individual merged flights throughout the mission. This grand merge will follow the following naming convention: "dc3-mrg60-dc8_merge_YYYYMMdd_R5_thruYYYYMMdd.ict" (with the comment "_thruYYYYMMdd" indicating the last flight date included). This dataset is in ICARTT format. Please see the header portion of the data files for details on instruments, parameters, quality assurance, quality control, contact information, and dataset comments. For more information on updates to this dataset, please see the readme file.

  8. d

    Current Population Survey (CPS)

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damico, Anthony (2023). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Damico, Anthony
    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  9. e

    Merger of BNV-D data (2008 to 2019) and enrichment

    • data.europa.eu
    zip
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick VINCOURT, Merger of BNV-D data (2008 to 2019) and enrichment [Dataset]. https://data.europa.eu/data/datasets/5f1c3eca9d149439e50c740f
    Explore at:
    zip(18530465)Available download formats
    Dataset authored and provided by
    Patrick VINCOURT
    Description

    Merging (in Table R) data published on https://www.data.gouv.fr/fr/datasets/ventes-de-pesticides-par-departement/, and joining two other sources of information associated with MAs: — uses: https://www.data.gouv.fr/fr/datasets/usages-des-produits-phytosanitaires/ — information on the “Biocontrol” status of the product, from document DGAL/SDQSPV/2020-784 published on 18/12/2020 at https://agriculture.gouv.fr/quest-ce-que-le-biocontrole

    All the initial files (.csv transformed into.txt), the R code used to merge data and different output files are collected in a zip. enter image description here NB: 1) “YASCUB” for {year,AMM,Substance_active,Classification,Usage,Statut_“BioConttrol”}, substances not on the DGAL/SDQSPV list being coded NA. 2) The file of biocontrol products shall be cleaned from the duplicates generated by the marketing authorisations leading to several trade names.
    3) The BNVD_BioC_DY3 table and the output file BNVD_BioC_DY3.txt contain the fields {Code_Region,Region,Dept,Code_Dept,Anne,Usage,Classification,Type_BioC,Quantite_substance)}

  10. m

    Data from: A Joint Dataset of Official COVID-19 Reports and the Governance,...

    • data.mendeley.com
    Updated Apr 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcell Tamás Kurbucz (2024). A Joint Dataset of Official COVID-19 Reports and the Governance, Trade and Competitiveness Indicators of World Bank Group Platforms [Dataset]. http://doi.org/10.17632/hzdnxph8vg.7
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Marcell Tamás Kurbucz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The presented cross-sectional dataset can be employed to analyze the governmental, trade, and competitiveness relationships of official COVID-19 reports. It contains 18 COVID-19 variables generated based on the official reports of 138 countries, as well as an additional 2163 governance, trade, and competitiveness indicators from the World Bank Group GovData360 and TCdata360 platforms in a preprocessed form. The current version was compiled on July 27, 2020. Note that this version uses 20-40-60-80-day time windows and the first test data are based on the first country reports on tests.

    Please cite as: • Kurbucz, M. T. (2020). A Joint Dataset of Official COVID-19 Reports and the Governance, Trade and Competitiveness Indicators of World Bank Group Platforms. Data in Brief, 105881. • Kurbucz, M. T., Katona, A. I., Lantos, Z., & Kosztyán, Z. T. (2021). The role of societal aspects in the formation of official COVID-19 reports: A data-driven analysis. International journal of environmental research and public health, 18(4), 1505. • Kurbucz, M. T. (2022). Modeling the social determinants of official COVID-19 reports in the early stages of the pandemic. Journal of Applied Social Science, 16(1), 356-363.

    Data generation: • Data generation (data_generation. Rmd): Datasets were generated with this R Notebook. It can be used to update datasets and customize the data generation process.

    Datasets: • Country data (country_data.txt): Country data. • Metadata (metadata.txt): The metadata of selected GovData360 and TCdata360 indicators. • Joint dataset (joint_dataset.txt): The joint dataset of COVID-19 variables and preprocessed GovData360 and TCdata360 indicators. • Correlation matrix (correlation_matrix.txt): The Kendall rank correlation matrix of the joint dataset.

    Raw data of figures and tables: • Raw data of Fig. 2 (raw_data_fig2.txt): The raw data of Fig. 2. • Raw data of Fig. 3 (raw_data_fig3.txt): The raw data of Fig. 3. • Raw data of Table 1 (raw_data_table1.txt): The raw data of Table 1. • Raw data of Table 2 (raw_data_table2.txt): The raw data of Table 2. • Raw data of Table 3 (raw_data_table3.txt): The raw data of Table 3.

  11. Video game pricing analytics dataset

    • kaggle.com
    Updated Sep 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivi Deveshwar (2023). Video game pricing analytics dataset [Dataset]. https://www.kaggle.com/datasets/shivideveshwar/video-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shivi Deveshwar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The review dataset for 3 video games - Call of Duty : Black Ops 3, Persona 5 Royal and Counter Strike: Global Offensive was taken through a web scrape of SteamDB [https://steamdb.info/] which is a large repository for game related data such as release dates, reviews, prices, and more. In the initial scrape, each individual game has two files - customer reviews (Count: 100 reviews) and price time series data.

    To obtain data on the reviews of the selected video games, we performed web scraping using R software. The customer reviews dataset contains the date that the review was posted and the review text, while the price dataset contains the date that the price was changed and the price on that date. In order to clean and prepare the data we first start by sectioning the data in excel. After scraping, our csv file fits each review in one row with the date. We split the data, separating date and review, allowing them to have separate columns. Luckily scraping the price separated price and date, so after the separating we just made sure that every file had similar column names.

    After, we use R to finish the cleaning. Each game has a separate file for prices and review, so each of the prices is converted into a continuous time series by extending the previously available price for each date. Then the price dataset is combined with its respective in R on the common date column using left join. The resulting dataset for each game contains four columns - game name, date, reviews and price. From there, we allow the user to select the game they would like to view.

  12. o

    Record Linkage using NCSES Datasets

    • explore.openaire.eu
    Updated Apr 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekaterina Levitskaya; Maryah Garner; Rukhshan Mian (2022). Record Linkage using NCSES Datasets [Dataset]. http://doi.org/10.5281/zenodo.6463885
    Explore at:
    Dataset updated
    Apr 15, 2022
    Authors
    Ekaterina Levitskaya; Maryah Garner; Rukhshan Mian
    Description

    This is a Jupyter notebook demonstrating how to join different datasets available in the class using SQL and R. This notebook was developed for the Fall 2021 Applied Data Analytics training facilitated by the National Center for Science and Engineering Statistics (NCSES) and Coleridge Initiative.

  13. c

    Panel Data Preparation and Models for Social Equity of Bridge Management

    • kilthub.cmu.edu
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cari Gandy; Daniel Armanios; Constantine Samaras (2023). Panel Data Preparation and Models for Social Equity of Bridge Management [Dataset]. http://doi.org/10.1184/R1/20643327.v4
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Carnegie Mellon University
    Authors
    Cari Gandy; Daniel Armanios; Constantine Samaras
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository provides code and data used in "Social Equity of Bridge Management" (DOI: 10.1061/JMENEA/MEENG-5265). Both the dataset used in the analysis ("Panel.csv") and the R script to create the dataset ("Panel_Prep.R") are provided. The main results of the paper as well as alternate specifications for the ordered probit with random effects models can be replicated with "Models_OrderedProbit.R". Note that these models take an extensive amount of memory and computational resources. Additionally, we have provided alternate model specifications in the "Robustness" R scripts: binomial probit with random effects, ordered probit without random effects, and Ordinary Least Squares with random effects. An extended version of the supplemental materials is also provided.

  14. f

    Data from: HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE...

    • scielo.figshare.com
    tiff
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo (2023). HOW TO PERFORM A META-ANALYSIS: A PRACTICAL STEP-BY-STEP GUIDE USING R SOFTWARE AND RSTUDIO [Dataset]. http://doi.org/10.6084/m9.figshare.19899537.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    SciELO journals
    Authors
    Diego Ariel de Lima; Camilo Partezani Helito; Lana Lacerda de Lima; Renata Clazzer; Romeu Krause Gonçalves; Olavo Pires de Camargo
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    ABSTRACT Meta-analysis is an adequate statistical technique to combine results from different studies, and its use has been growing in the medical field. Thus, not only knowing how to interpret meta-analysis, but also knowing how to perform one, is fundamental today. Therefore, the objective of this article is to present the basic concepts and serve as a guide for conducting a meta-analysis using R and RStudio software. For this, the reader has access to the basic commands in the R and RStudio software, necessary for conducting a meta-analysis. The advantage of R is that it is a free software. For a better understanding of the commands, two examples were presented in a practical way, in addition to revising some basic concepts of this statistical technique. It is assumed that the data necessary for the meta-analysis has already been collected, that is, the description of methodologies for systematic review is not a discussed subject. Finally, it is worth remembering that there are many other techniques used in meta-analyses that were not addressed in this work. However, with the two examples used, the article already enables the reader to proceed with good and robust meta-analyses. Level of Evidence V, Expert Opinion.

  15. Datasets for Muscle : semi-non negative joint decomposition of multiple...

    • zenodo.org
    zip
    Updated Dec 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muscle Authors; Muscle Authors (2023). Datasets for Muscle : semi-non negative joint decomposition of multiple single cell tensors [Dataset]. http://doi.org/10.5281/zenodo.10408523
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Muscle Authors; Muscle Authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This link contains the processed single cell Hi-C (scHi-C) and DNA methylation datasets of Li et al. 2019 and Liu et al. 2021. in two separate folders. The datasets are used for the Github page of Muscle, which employs a semi-non negative joint tensor decomposition framework for multi-omics data analysis, incorporating scHi-C tensors. The datasets are uploaded in '.qs' format in R (gzipped for the Liu et al. 2021 data), and the chromosome size files are in text format. Further details about the data can be found on the GitHub page. The files

  16. Degree of River Regulation (Spatial Dataset)

    • researchdata.edu.au
    • data.nsw.gov.au
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nsw.gov.au (2024). Degree of River Regulation (Spatial Dataset) [Dataset]. https://researchdata.edu.au/degree-river-regulation-spatial-dataset/3381024
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset provided by
    Government of New South Waleshttp://nsw.gov.au/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This data summarises the results of a spatial analysis to identify significant tributary junctions in rivers, across the NSW Murray-Darling Basin, where inflows from unregulated or less regulated tributaries join heavily regulated rivers. Tributary junctions were characterized in terms of the relative change in the ‘Degree of Regulation’ (DoR) at individual tributary junctions. DoR was calculated as the ratio of the storage capacity of all upstream reservoirs relative to the mean annual runoff. Furthermore, This spatial analysis identifies potential tributary hotspots across the NSW Murray-Darling Basin (MBD).\r \r Rivers often experience major discontinuities in ecological function due to dams, whereby the timing and volume of flow and water chemistry can be significantly altered from upstream to downstream of the dam, impacting ecosystem productivity and aquatic food webs. Tributary inflows from such unregulated catchments can play an important role in mitigating changes in water chemistry below large dams, thereby overcoming the so-called serial discontinuity effect, which describes the impacts of large dams on longitudinal gradients in water chemistry. Because tributary inflows can be rich in nutrients and dissolved carbon, they can lead to ‘priming’ effects, in which biogeochemical processes and ecosystem productivity are enhanced below confluences with more heavily regulated rivers. Yet, there have been few attempts to identify potential priority tributaries that may play a larger role in driving biochemistry and ecosystem function below dams. This spatial analysis identifies significant tributary junctions in rivers, across the NSW Murray-Darling Basin, where inflows from unregulated or less regulated tributaries join heavily regulated rivers.\r \r -----------------------------------\r \r Note: If you would like to ask a question, make any suggestions, or tell us how you are using this dataset, please visit the NSW Water Hub which has an online forum you can join.

  17. n

    Effect of data source on estimates of regional bird richness in northeastern...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk (2021). Effect of data source on estimates of regional bird richness in northeastern United States [Dataset]. http://doi.org/10.5061/dryad.m905qfv0h
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 4, 2021
    Dataset provided by
    Gettysburg College
    Massachusetts Audubon Society
    University of Vermont
    New York State Department of Environmental Conservation
    University of Michigan
    Columbia University
    Agricultural Research Service
    Hebrew University of Jerusalem
    Authors
    Roi Ankori-Karlinsky; Ronen Kadmon; Michael Kalyuzhny; Katherine F. Barnes; Andrew M. Wilson; Curtis Flather; Rosalind Renfrew; Joan Walsh; Edna Guk
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Northeastern United States, United States
    Description

    Standardized data on large-scale and long-term patterns of species richness are critical for understanding the consequences of natural and anthropogenic changes in the environment. The North American Breeding Bird Survey (BBS) is one of the largest and most widely used sources of such data, but so far, little is known about the degree to which BBS data provide accurate estimates of regional richness. Here we test this question by comparing estimates of regional richness based on BBS data with spatially and temporally matched estimates based on state Breeding Bird Atlases (BBA). We expected that estimates based on BBA data would provide a more complete (and therefore, more accurate) representation of regional richness due to their larger number of observation units and higher sampling effort within the observation units. Our results were only partially consistent with these predictions: while estimates of regional richness based on BBA data were higher than those based on BBS data, estimates of local richness (number of species per observation unit) were higher in BBS data. The latter result is attributed to higher land-cover heterogeneity in BBS units and higher effectiveness of bird detection (more species are detected per unit time). Interestingly, estimates of regional richness based on BBA blocks were higher than those based on BBS data even when differences in the number of observation units were controlled for. Our analysis indicates that this difference was due to higher compositional turnover between BBA units, probably due to larger differences in habitat conditions between BBA units and a larger number of geographically restricted species. Our overall results indicate that estimates of regional richness based on BBS data suffer from incomplete detection of a large number of rare species, and that corrections of these estimates based on standard extrapolation techniques are not sufficient to remove this bias. Future applications of BBS data in ecology and conservation, and in particular, applications in which the representation of rare species is important (e.g., those focusing on biodiversity conservation), should be aware of this bias, and should integrate BBA data whenever possible.

    Methods Overview

    This is a compilation of second-generation breeding bird atlas data and corresponding breeding bird survey data. This contains presence-absence breeding bird observations in 5 U.S. states: MA, MI, NY, PA, VT, sampling effort per sampling unit, geographic location of sampling units, and environmental variables per sampling unit: elevation and elevation range from (from SRTM), mean annual precipitation & mean summer temperature (from PRISM), and NLCD 2006 land-use data.

    Each row contains all observations per sampling unit, with additional tables containing information on sampling effort impact on richness, a rareness table of species per dataset, and two summary tables for both bird diversity and environmental variables.

    The methods for compilation are contained in the supplementary information of the manuscript but also here:

    Bird data

    For BBA data, shapefiles for blocks and the data on species presences and sampling effort in blocks were received from the atlas coordinators. For BBS data, shapefiles for routes and raw species data were obtained from the Patuxent Wildlife Research Center (https://databasin.org/datasets/02fe0ebbb1b04111b0ba1579b89b7420 and https://www.pwrc.usgs.gov/BBS/RawData).

    Using ArcGIS Pro© 10.0, species observations were joined to respective BBS and BBA observation units shapefiles using the Join Table tool. For both BBA and BBS, a species was coded as either present (1) or absent (0). Presence in a sampling unit was based on codes 2, 3, or 4 in the original volunteer birding checklist codes (possible breeder, probable breeder, and confirmed breeder, respectively), and absence was based on codes 0 or 1 (not observed and observed but not likely breeding). Spelling inconsistencies of species names between BBA and BBS datasets were fixed. Species that needed spelling fixes included Brewer’s Blackbird, Cooper’s Hawk, Henslow’s Sparrow, Kirtland’s Warbler, LeConte’s Sparrow, Lincoln’s Sparrow, Swainson’s Thrush, Wilson’s Snipe, and Wilson’s Warbler. In addition, naming conventions were matched between BBS and BBA data. The Alder and Willow Flycatchers were lumped into Traill’s Flycatcher and regional races were lumped into a single species column: Dark-eyed Junco regional types were lumped together into one Dark-eyed Junco, Yellow-shafted Flicker was lumped into Northern Flicker, Saltmarsh Sparrow and the Saltmarsh Sharp-tailed Sparrow were lumped into Saltmarsh Sparrow, and the Yellow-rumped Myrtle Warbler was lumped into Myrtle Warbler (currently named Yellow-rumped Warbler). Three hybrid species were removed: Brewster's and Lawrence's Warblers and the Mallard x Black Duck hybrid. Established “exotic” species were included in the analysis since we were concerned only with detection of richness and not of specific species.

    The resultant species tables with sampling effort were pivoted horizontally so that every row was a sampling unit and each species observation was a column. This was done for each state using R version 3.6.2 (R© 2019, The R Foundation for Statistical Computing Platform) and all state tables were merged to yield one BBA and one BBS dataset. Following the joining of environmental variables to these datasets (see below), BBS and BBA data were joined using rbind.data.frame in R© to yield a final dataset with all species observations and environmental variables for each observation unit.

    Environmental data

    Using ArcGIS Pro© 10.0, all environmental raster layers, BBA and BBS shapefiles, and the species observations were integrated in a common coordinate system (North_America Equidistant_Conic) using the Project tool. For BBS routes, 400m buffers were drawn around each route using the Buffer tool. The observation unit shapefiles for all states were merged (separately for BBA blocks and BBS routes and 400m buffers) using the Merge tool to create a study-wide shapefile for each data source. Whether or not a BBA block was adjacent to a BBS route was determined using the Intersect tool based on a radius of 30m around the route buffer (to fit the NLCD map resolution). Area and length of the BBS route inside the proximate BBA block were also calculated. Mean values for annual precipitation and summer temperature, and mean and range for elevation, were extracted for every BBA block and 400m buffer BBS route using Zonal Statistics as Table tool. The area of each land-cover type in each observation unit (BBA block and BBS buffer) was calculated from the NLCD layer using the Zonal Histogram tool.

  18. e

    Replication Data for: The joint evolution of movement and competition...

    • b2find.eudat.eu
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Replication Data for: The joint evolution of movement and competition strategies - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/447fe806-2b8a-551c-8880-5f8ddebe1ce5
    Explore at:
    Dataset updated
    Aug 8, 2024
    Description

    The dataset consists of a single zipped folder, data/, which contains subfolders with the simulation specific data. These subfolders are named sim_SCENARIO_rep_NNN_gro_RMAX, where SCENARIO refers to the three scenarios of the model described in our manuscript, NNN is the replicate number, and RMAX is the maximum cell productivity in that simulation. Each of the subfolders consists of the following: The directory ‘depends/’, which holds the ‘extract.exe’ program. This program is used to extract specific data from the stored simulation output. The ‘sourceMe.R’ file allows the ‘extract.exe’ program to be linked to R, delivering the required data as a list object that can be handled in R. The stored simulation data on agents, as a series of ‘.arc’ and ‘.bin’ files. The ecological snapshots of the landscape, with prey items, foragers, kleptoparasites, and handlers (depending on the scenario), which are stored as PNG files named ‘00NNN.png’, where NNN is a three digit representation of the generation number (e.g. 001 for generation 1). Cumulative sums of the numbers of prey items, foragers, kleptoparasites, the intake from foraging (searching for prey), and the intake from the kleptoparasitic strategy (searching for handlers), on each cell of the landscape, for each of the last 9 generations of the simulation, as ‘NNN.txt’, where LAYER may be ‘items’ (prey items), ‘foragers’ (foragers), ‘klepts’ (kleptoparasites), ‘foragers_intake’ (the intake from the foraging strategy), or ‘klepts_intake’ (the intake due the kleptoparasite strategy. NB: These files are not used in our analyses, and may be ignored. SCENARIO may be one of “foragers” (scenario 1), “obligate” (scenario 2), or “facultative” (scenario 3). NNN may be one of “001”, “002”, or “003”. RMAX may be one of “0.001”, “0.005”, “0.01”, “0.02”, “0.03”, “0.04”, or “0.05”. The manuscript presents results for RMAX = 0.01.

  19. i

    Matrix profile analysis of Dansgaard-Oeschger events in palaeoclimate time...

    • rdm.inesctec.pt
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Matrix profile analysis of Dansgaard-Oeschger events in palaeoclimate time series - Dataset - CKAN [Dataset]. https://rdm.inesctec.pt/dataset/cs-2024-002
    Explore at:
    Dataset updated
    Feb 6, 2024
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset includes all the datafiles and computational notebooks required to reproduce the work reported in the paper “Characterisation of Dansgaard-Oeschger events in palaeoclimate time series using the Matrix Profile”: Input datafiles time series (20-years resolution) of oxygen isotope ratios (δ18O) from NGRIP ice core on the GICC05 time scale (source: https://www.iceandclimate.nbi.ku.dk, DOI: 10.1016/j.quascirev.2014.09.007): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the oxygen isotope concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 time scale (source: https://www.iceandclimate.nbi.ku.dk, DOI: 10.1016/j.quascirev.2014.09.007): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 times scale, artificially shifted by 10 ka (500 data points): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; time series (20-years resolution) of calcium concentration (Ca2+) from NGRIP ice core on the GICC05 times scale, trimmed by 10 ka (500 data points): the 1st columns is the time in ka (10³ years) b2k (before A.D. 2000), and the 2nd column the Ca2+ concentration; Code and computational notebooks R code for visualisation of matrix profile calculations; jupyter notebook (python) containing the matrix profile analysis of the oxygen isotope time series; jupyter notebook (python) containing the matrix profile analysis of the calcium time series; jupyter notebook (python) containing the join matrix profile analysis of oxygen isotope and calcium time series; jupyter notebook (R) for visualisation of matrix profile results of the oxygen isotope time series; jupyter notebook (R) for visualisation of matrix profile results of the calcium time series; jupyter notebook (R) for visualisation of join matrix profile results; Output datafiles matrix profile of the oxygen isotope time series (sub-sequence length of 2,500 years): the 1st column contains the matrix profile value (distance to the nearest sub-sequence), the 2nd column contains the profile index (the zero-based index location of the nearest sub-sequence);

  20. Global Invasive and Alien Traits and Records (GIATAR) dataset

    • zenodo.org
    zip
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ariel Saffer; Ariel Saffer; Thom Worm; Thom Worm (2025). Global Invasive and Alien Traits and Records (GIATAR) dataset [Dataset]. http://doi.org/10.5281/zenodo.15042321
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ariel Saffer; Ariel Saffer; Thom Worm; Thom Worm
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Time period covered
    Jul 30, 2024
    Description

    Monitoring and managing the global spread of invasive and alien species requires accurate spatiotemporal records of species presence and information about the biological characteristics of species of interest including life cycle information, biotic and abiotic constraints and pathways of spread. The Global Invasive and Alien Traits And Records (GIATAR) dataset provides consolidated dated records of invasive and alien presence at the country-scale combined with a suite of biological information about pests of interest in a standardized, machine-readable format. We provide dated presence records for 46,666 alien taxa in 249 countries constituting 827,300 country-taxon pairs, joined with additional biological information for thousands of taxa. GIATAR is designed to be quickly updateable with future data and easy to integrate into ongoing research on global patterns of alien species movement using scripts provided to query and analyze data.

    This publication includes:

    • GIATAR dataset files (dataset)
    • Functions in Python and R to join tables and query data (query_functions)
    • Tutorials and example queries in Python and R (tutorials)

    For more information, please refer to the publication:

    Saffer, Ariel, Thom Worm, Yu Takeuchi, and Ross Meentemeyer. “GIATAR: A Spatio-Temporal Dataset of Global Invasive and Alien Species and Their Traits.” Scientific Data 11, no. 1 (September 11, 2024): 991. https://doi.org/10.1038/s41597-024-03824-w.
    Changes in this version (March 19, 2025):
    • Removed base folder from folder structure
    • Included additional files used to update the database
    • Latest records as of March 9 - 10, 2025
    • Updated species list from EPPO as of February 26, 2025
    For continuous updates to code, please refer to our Github repository: https://github.com/ncsu-landscape-dynamics/GIATAR-dataset
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva (2025). MERGE Dataset [Dataset]. http://doi.org/10.5281/zenodo.13939205
Organization logo

MERGE Dataset

Explore at:
zipAvailable download formats
Dataset updated
Feb 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pedro Lima Louro; Pedro Lima Louro; Hugo Redinho; Hugo Redinho; Ricardo Santos; Ricardo Santos; Ricardo Malheiro; Ricardo Malheiro; Renato Panda; Renato Panda; Rui Pedro Paiva; Rui Pedro Paiva
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The MERGE dataset is a collection of audio, lyrics, and bimodal datasets for conducting research on Music Emotion Recognition. A complete version is provided for each modality. The audio datasets provide 30-second excerpts for each sample, while full lyrics are provided in the relevant datasets. The amount of available samples in each dataset is the following:

  • MERGE Audio Complete: 3554
  • MERGE Audio Balanced: 3232
  • MERGE Lyrics Complete: 2568
  • MERGE Lyrics Balanced: 2400
  • MERGE Bimodal Complete: 2216
  • MERGE Bimodal Balanced: 2000

Additional Contents

Each dataset contains the following additional files:

  • av_values: File containing the arousal and valence values for each sample sorted by their identifier;
  • tvt_dataframes: Train, validate, and test splits for each dataset. Both a 70-15-15 and a 40-30-30 split are provided.

Metadata

A metadata spreadsheet is provided for each dataset with the following information for each sample, if available:

  • Song (Audio and Lyrics datasets) - Song identifiers. Identifiers starting with MT were extracted from the AllMusic platform, while those starting with A or L were collected from private collections;
  • Quadrant - Label corresponding to one of the four quadrants from Russell's Circumplex Model;
  • AllMusic Id - For samples starting with A or L, the matching AllMusic identifier is also provided. This was used to complement the available information for the samples originally obtained from the platform;
  • Artist - First performing artist or band;
  • Title - Song title;
  • Relevance - AllMusic metric representing the relevance of the song in relation to the query used;
  • Duration - Song length in seconds;
  • Moods - User-generated mood tags extracted from the AllMusic platform and available in Warriner's affective dictionary;
  • MoodsAll - User-generated mood tags extracted from the AllMusic platform;
  • Genres - User-generated genre tags extracted from the AllMusic platform;
  • Themes - User-generated theme tags extracted from the AllMusic platform;
  • Styles - User-generated style tags extracted from the AllMusic platform;
  • AppearancesTrackIDs - All AllMusic identifiers related with a sample;
  • Sample - Availability of the sample in the AllMusic platform;
  • SampleURL - URL to the 30-second excerpt in AllMusic;
  • ActualYear - Year of song release.

Citation

If you use some part of the MERGE dataset in your research, please cite the following article:

Louro, P. L. and Redinho, H. and Santos, R. and Malheiro, R. and Panda, R. and Paiva, R. P. (2024). MERGE - A Bimodal Dataset For Static Music Emotion Recognition. arxiv. URL: https://arxiv.org/abs/2407.06060.

BibTeX:

@misc{louro2024mergebimodaldataset,
title={MERGE -- A Bimodal Dataset for Static Music Emotion Recognition},
author={Pedro Lima Louro and Hugo Redinho and Ricardo Santos and Ricardo Malheiro and Renato Panda and Rui Pedro Paiva},
year={2024},
eprint={2407.06060},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2407.06060},
}

Acknowledgements

This work is funded by FCT - Foundation for Science and Technology, I.P., within the scope of the projects: MERGE - DOI: 10.54499/PTDC/CCI-COM/3171/2021 financed with national funds (PIDDAC) via the Portuguese State Budget; and project CISUC - UID/CEC/00326/2020 with funds from the European Social Fund, through the Regional Operational Program Centro 2020.

Renato Panda was supported by Ci2 - FCT UIDP/05567/2020.

Search
Clear search
Close search
Google apps
Main menu