2 datasets found
  1. f

    RMSE and R2 for different data groupings. The first column contains the...

    • plos.figshare.com
    xls
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles A. Price; Todd A. Schroeder; Benjamin Branoff; Humfredo Marcano-Vega; Nicole Pillot-Torres; Morgan Chaudry; Michael Ross; Monica Papeș; Skip Van Bloem (2025). RMSE and R2 for different data groupings. The first column contains the species composition (inter, intra, or species specific) and the second column the data composition for each statistic reported in the remaining columns. Columns 3-5 contain the RMSE (kg), columns 6-8 the R2, and columns 9-11 the % error relative to the mean biomass in the dataset for BSD, D30, and DBH, respectively (see Methods). The subset of the data for trees that had all three measurements is denoted by the terms “Combined 3” and “Site 3” in the Data column. The lowest RMSE value in a row for each metric is in bold. Similarly, the highest R2 for each row is in bold. Values that represent means are underlined. The mean for each column and each grouping is given in the final two rows. The final row contains means for those trees with all three measures. The row above it contains means for all trees. [Dataset]. http://doi.org/10.1371/journal.pone.0323926.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 11, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Charles A. Price; Todd A. Schroeder; Benjamin Branoff; Humfredo Marcano-Vega; Nicole Pillot-Torres; Morgan Chaudry; Michael Ross; Monica Papeș; Skip Van Bloem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RMSE and R2 for different data groupings. The first column contains the species composition (inter, intra, or species specific) and the second column the data composition for each statistic reported in the remaining columns. Columns 3-5 contain the RMSE (kg), columns 6-8 the R2, and columns 9-11 the % error relative to the mean biomass in the dataset for BSD, D30, and DBH, respectively (see Methods). The subset of the data for trees that had all three measurements is denoted by the terms “Combined 3” and “Site 3” in the Data column. The lowest RMSE value in a row for each metric is in bold. Similarly, the highest R2 for each row is in bold. Values that represent means are underlined. The mean for each column and each grouping is given in the final two rows. The final row contains means for those trees with all three measures. The row above it contains means for all trees.

  2. Z

    The Dynamics of Collective Action Corpus

    • nde-dev.biothings.io
    • data.niaid.nih.gov
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor, Marshall A. (2023). The Dynamics of Collective Action Corpus [Dataset]. https://nde-dev.biothings.io/resources?id=zenodo_8414334
    Explore at:
    Dataset updated
    Oct 7, 2023
    Dataset provided by
    Stoltz, Dustin S.
    Taylor, Marshall A.
    Dudley, Jennifer S.K.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

    These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

    We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

    We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

    Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

    We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.

    In R, use the load() function to load the objects dca_dtm and dca_meta. To associate the dca_meta to the dca_dtm , match the "pdf_file" variable indca_meta to the rownames of dca_dtm.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Charles A. Price; Todd A. Schroeder; Benjamin Branoff; Humfredo Marcano-Vega; Nicole Pillot-Torres; Morgan Chaudry; Michael Ross; Monica Papeș; Skip Van Bloem (2025). RMSE and R2 for different data groupings. The first column contains the species composition (inter, intra, or species specific) and the second column the data composition for each statistic reported in the remaining columns. Columns 3-5 contain the RMSE (kg), columns 6-8 the R2, and columns 9-11 the % error relative to the mean biomass in the dataset for BSD, D30, and DBH, respectively (see Methods). The subset of the data for trees that had all three measurements is denoted by the terms “Combined 3” and “Site 3” in the Data column. The lowest RMSE value in a row for each metric is in bold. Similarly, the highest R2 for each row is in bold. Values that represent means are underlined. The mean for each column and each grouping is given in the final two rows. The final row contains means for those trees with all three measures. The row above it contains means for all trees. [Dataset]. http://doi.org/10.1371/journal.pone.0323926.t002

RMSE and R2 for different data groupings. The first column contains the species composition (inter, intra, or species specific) and the second column the data composition for each statistic reported in the remaining columns. Columns 3-5 contain the RMSE (kg), columns 6-8 the R2, and columns 9-11 the % error relative to the mean biomass in the dataset for BSD, D30, and DBH, respectively (see Methods). The subset of the data for trees that had all three measurements is denoted by the terms “Combined 3” and “Site 3” in the Data column. The lowest RMSE value in a row for each metric is in bold. Similarly, the highest R2 for each row is in bold. Values that represent means are underlined. The mean for each column and each grouping is given in the final two rows. The final row contains means for those trees with all three measures. The row above it contains means for all trees.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Sep 11, 2025
Dataset provided by
PLOS ONE
Authors
Charles A. Price; Todd A. Schroeder; Benjamin Branoff; Humfredo Marcano-Vega; Nicole Pillot-Torres; Morgan Chaudry; Michael Ross; Monica Papeș; Skip Van Bloem
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

RMSE and R2 for different data groupings. The first column contains the species composition (inter, intra, or species specific) and the second column the data composition for each statistic reported in the remaining columns. Columns 3-5 contain the RMSE (kg), columns 6-8 the R2, and columns 9-11 the % error relative to the mean biomass in the dataset for BSD, D30, and DBH, respectively (see Methods). The subset of the data for trees that had all three measurements is denoted by the terms “Combined 3” and “Site 3” in the Data column. The lowest RMSE value in a row for each metric is in bold. Similarly, the highest R2 for each row is in bold. Values that represent means are underlined. The mean for each column and each grouping is given in the final two rows. The final row contains means for those trees with all three measures. The row above it contains means for all trees.

Search
Clear search
Close search
Google apps
Main menu