10 datasets found
  1. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

  2. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  3. q

    Large Datasets in R - Plant Phenology & Temperature Data from NEON

    • qubeshub.org
    Updated May 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg (2018). Large Datasets in R - Plant Phenology & Temperature Data from NEON [Dataset]. http://doi.org/10.25334/Q4DQ3F
    Explore at:
    Dataset updated
    May 10, 2018
    Dataset provided by
    QUBES
    Authors
    Megan Jones Patterson; Lee Stanish; Natalie Robinson; Katherine Jones; Cody Flagg
    Description

    This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.

  4. Data and code for the manuscript - The hidden biodiversity knowledge split...

    • zenodo.org
    Updated Apr 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2025). Data and code for the manuscript - The hidden biodiversity knowledge split in biological collections [Dataset]. http://doi.org/10.5281/zenodo.15248066
    Explore at:
    Dataset updated
    Apr 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 19, 2025
    Description

    # General overview

    This repository contains the data and code used in the analysis of the
    manuscript entitled **"The hidden biodiversity knowledge split in biological collections"**.

    # Context

    Ecological and evolutionary processes generate biodiversity, yet how biodiversity data are organized and shared globally can shape our understanding of these processes. We show that name-bearing type specimens—the primary reference for species identity—of all freshwater and brackish fish species are predominantly housed in Global North museums, disconnected from their countries of origin. This geographical divide creates a ‘knowledge split’ with consequences for biodiversity science, particularly in the Global South, where researchers face barriers in studying native species’ name bearers housed abroad. Meanwhile, Global North collections remain flooded with non-native name bearers. We relate this imbalance to historical and socioeconomic factors, which ultimately restricts access to critical taxonomic reference materials and hinders global species documentation. To address this disparity, we call for international initiatives to promote fairer access to biological knowledge, including specimen repatriation, improved accessibility protocols for researchers in countries where specimens originated, and inclusive research partnerships.

    # Repository structure

    ## data

    This folder stores raw and processed data used to perform all the
    analysis presented in this study

    ### raw

    - `flow_period_region_country.csv` a data frame in the long format
    containing the flowing of NBT per regions per per time (50-year time
    frame). Variables:

    - `period` numeric variable representing 50-year time intervals

    - `region_type` character representing the name of the World Bank region
    of the country where the NBT was sourced

    - `country_type` character. A three letter code (alpha-3 ISO3166) representing
    the country of the museum where the NBT was sourced

    - `region_museum` character. Name of the World Bank region of the country
    where the NBT is housed

    - `country_museum` character. A three letter code (alpha-3 ISO3166) representing
    the country of the museum where the NBT is housed

    - `n` numeric. The number of NBT flowing from one country to another

    - `spp_native_distribution.csv` data frame in the long format
    containing the native composition at the country level. Variables:

    - `valid_name` character. The name of a species in the format genus_epithet
    according to the Catalog of Fishes

    - `country_distribution` character. Three letter code (alpha-3 ISO3166)
    indicating the name of the country where a species is native to

    - `region_distribution` character. The name of the region acording with
    World Bank where a species is native to

    - `spp_type_distribution.csv` data frame in the long format containing
    the composition of NBT by country. Variables:


    - `valid_name` character. The name of a species in the format genus_epithet
    according to the Catalog of Fishes

    - `country_distribution` character. Three letter code (alpha-3 ISO3166)
    indicating the name of the country where a species is housed

    - `region_distribution` character. The name of the region acording with
    World Bank where a species is housed

    - `bio-dem_data.csv` data frame with data downloaded from
    [Bio-Dem](https://bio-dem.surge.sh/#awards) containing information
    on biological and social information at the country level. Variables:

    - `country` character. A three letter code (alpha-3 ISO3166) representing
    a country

    - `records` numeric. Total number of species occurrence records from Global
    Biodiverity Facility (GBIF)

    - `records_per_area` numeric. Records per area from gbif

    - `yearsSinceIndependence` numeric. Years since independence for each country

    - `e_migdppc` numeric. GDP per capta

    - `museum_data.csv` data frame with museums' acronyms and the world
    region of each. Variables:

    - `code_museum` character. The acronym (three letter code) of the museum

    - `country_museum` character. A three letter code (alpha-3 ISO3166) representing
    a country

    - `region_museum` character. The name of the region acording with
    World Bank

    ### processed

    - `flow_region.csv` a data frame containing flowing of name bearers among world
    regions and the total number of name bearers derived from the source region

    - `flow_period_region.csv` a data frame with the number of name bearers between
    the world regions per 50-year time frame and the total number of name bearers
    in each time frame for each world region

    - `flow_period_region_prop.csv` a data frame with the number of name bearers,
    the Domestic Contribution and Domestic Retention between the world
    regions in a 50-year time frame - this is not used anymore in downstream analyses

    - `flow_region_prop.csv` data with the total number of species flowing
    between world regions, Domestic Contribution and Domestic Retention - this is no longer used in downstream analyses

    - `flow_country.csv` data frame with flowing information of name bearers among
    countries

    - `df_country_native.csv` data frame with the number of native species
    at the country level

    - `df_country_type.csv` data frame with the number of name bearers at the
    country level

    - `df_all_beta.csv` data frame with values of endemic deficit and non-endemic
    representation at the country level

    ## R

    The letters `D`, `A` and `V` represents scripts for, respectively, data
    processing (D), data analysis (A) and results visualization (V). The
    script sequence to reproduce the workflow is indicated by the numbers at
    the beginning of the name of the script file

    - [`01_D_data_preparation.qmd`](R/01_D_data_preparation.qmd) initial data preparation

    - [`02_A_beta-endemics-countries.qmd`](R/02_A_beta-endemics-countries.qmd) analysis of endemic deficit and non endemic representation. This script is used to calculate `native/endemic deficit` and `non-native/non-endemic representation`

    - [`03_D_data_preparation_models.qmd`](R/03_D_data_preparation_models.qmd) script used to build data frames that will be used in statistical models ([`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd))

    - [`04_A_model_NBTs.qmd`](R/04_A_model_NBTs.qmd) statistical models for the total number of name bearers, endemic deficit and non-endemic representation

    - [`05_V_chord_diagram_Fig1.qmd`](R/05_V_chord_diagram_Fig1.qmd) code used to produce circular flow diagram. This is the Figure 1 of the study

    - [`06_V_world_map_Fig1.qmd`](R/06_V_world_map_Fig1.qmd) code used to produce the world map in the Figure 1 of the main text

    - [08_V_beta_endemics_Fig3.qmd](R/08_V_beta_endemics_Fig3.qmd) code used to build Figure 2 of the main text

    - [`09_V_model_Fig4.qmd`](R/09_V_model_Fig4.qmd) code used to build the Figure 3 of the main text. This is the representation of the results of the models present in the script [04_A_model_NBTs.qmd](R/04_A_model_NBTs.qmd)

    - [`0010_Supplementary_analysis.qmd`](R/0010_Supplementary_analysis.qmd) code to produce all the tables and figures presented in the Supplementary material of this study

    ## output

    ### Figures

    In this folder you will find all figures used in the main text and supplementary material of this study

    `Fig1_flow_circle_plot.png` Figure with circular plots showing the flux of name bearers among regions of the world in a 50-year time window

    `Fig3_turnover_metrics_endemics.png` Cartogram with 3 maps showing the level of endemic deficit
    non-endemic representation and the combination of both metrics in a combined map

    `Fig4_models.png` Figure showing the predictions of the number of name bearers,
    endemic deficit and non-endemic representation for different predictors.
    This is derived from the statistical models

    #### Supp-material

    This folder contains the figures in the Supplementary material

    - `FigS1_native_richness.png` World map with countries coloured according to the number of native species richness according to the Catalog of Fishes

    - `FigS3_turnover_metrics.png` Cartogram with 3 maps showing the level of
    native deficit, non-native representation and the combination of both metrics in a combined map

  5. ceilometer normalized backscatter, mixed layer height derived from the...

    • catalog.data.gov
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). ceilometer normalized backscatter, mixed layer height derived from the backscatter, and radiosondes (t, p, rh, elevation) in Hierarchical Data Format (HDF) [Dataset]. https://catalog.data.gov/dataset/ceilometer-normalized-backscatter-mixed-layer-height-derived-from-the-backscatter-and-radi
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Ceilometer normalized backscatter, mixed layer height derived from the backscatter, and radiosondes (t, p, rh, elevation) in Hierarchical Data Format (HDF). This dataset is not publicly accessible because: Too large. It can be accessed through the following means: Additional data used in this manuscript is available upon request. The request should be made via email to James Szykman, at szykman.jim@epa.gov. The additional data includes ceilometer normalized backscatter, mixed layer height derived from the backscatter, and radiosondes (t, p, rh, elevation) in Hierarchical Data Format (HDF). Format: ceilometer normalized backscatter, mixed layer height derived from the backscatter, and radiosondes (t, p, rh, elevation) in Hierarchical Data Format (HDF). This dataset is associated with the following publication: Knepp, T., J. Szykman, R. Long, R. Duvall, J. Krug, M. Beaver, K. Cavender, K. Kronmiller, M. Wheeler, R. Delgado, R. Hoff, T. Berkoff, E. Olson, R. Clark, D. Wolfe, D. Van Gilst, and D. Neil. Assessment of mixed-layer height estimation from single-wavelength ceilometer profiles. Atmospheric Measurement Techniques. Copernicus Publications, Katlenburg-Lindau, GERMANY, 10: 3963-3983, (2017).

  6. R object containing study data in Phyloseq format

    • figshare.com
    application/gzip
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Cox (2023). R object containing study data in Phyloseq format [Dataset]. http://doi.org/10.6084/m9.figshare.22702000.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 26, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Michael Cox
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R object containing OTU tables and metadata from throat swabs of children in Ecuador.

  7. Explore data formats and ingestion methods

    • kaggle.com
    Updated Feb 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Preda (2021). Explore data formats and ingestion methods [Dataset]. https://www.kaggle.com/datasets/gpreda/iris-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Preda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Why this Dataset

    This dataset brings to you Iris Dataset in several data formats (see more details in the next sections).

    You can use it to test the ingestion of data in all these formats using Python or R libraries. We also prepared Python Jupyter Notebook and R Markdown report that input all these formats:

    Iris Dataset

    Iris Dataset was created by R. A. Fisher and donated by Michael Marshall.

    Repository on UCI site: https://archive.ics.uci.edu/ml/datasets/iris

    Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

    The file downloaded is iris.data and is formatted as a comma delimited file.

    This small data collection was created to help you test your skills with ingesting various data formats.

    Content

    This file was processed to convert the data in the following formats: * csv - comma separated values format * tsv - tab separated values format * parquet - parquet format
    * feather - feather format * parquet.gzip - compressed parquet format * h5 - hdf5 format * pickle - Python binary object file - pickle format * xslx - Excel format
    * npy - Numpy (Python library) binary format * npz - Numpy (Python library) binary compressed format * rds - Rds (R specific data format) binary format

    Acknowledgements

    I would like to acknowledge the work of the creator of the dataset - R. A. Fisher and of the donor - Michael Marshall.

    Inspiration

    Use these data formats to test your skills in ingesting data in various formats.

  8. f

    Data from: [Dataset:] Data from Tree Censuses and Inventories in Panama

    • smithsonian.figshare.com
    zip
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao (2024). [Dataset:] Data from Tree Censuses and Inventories in Panama [Dataset]. http://doi.org/10.5479/data.stri.2016.0622
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 18, 2024
    Dataset provided by
    Smithsonian Tropical Research Institute
    Authors
    Richard Condit; Rolando Pẽrez; Salomõn Aguilar; Suzanne Lao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: These are results from a network of 65 tree census plots in Panama. At each, every individual stem in a rectangular area of specified size is given a unique number and identified to species, then stem diameter measured in one or more censuses. Data from these numerous plots and inventories were collected following the same methods as, and species identity harmonized with, the 50-ha long-term tree census at Barro Colorado Island. Precise location of every site, elevation, and estimated rainfall (for many sites) are also included. These data were gathered over many years, starting in 1994 and continuing to the present, by principal investigators R. Condit, R. Perez, S. Lao, and S. Aguilar. Funding has been provided by many organizations.Description:marenaRecent.full.Rdata5Jan2013.zip: A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format, designed for data analysis. This and all other tables labelled 'full' have one record per individual tree found in that census. Detailed documentations of the 'full' tables is given in RoutputFull.pdf (see component 10 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. These are the best data to use if only a single plot census is needed. marena2cns.full.Rdata5Jan2013.zip: R Analytical Tables of the style 'full' for 44 plots with two censuses: 'marena2cns.full1.rdata' for the first census and 'marena2cns.full2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.full (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed. marena3cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for nine plots with three censuses: 'marena3cns.full1.rdata' for the first census through 'marena2cns.full3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.full (component 2): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed. marena4cns.full.Rdata5Jan2013.zip. R Analytical Tables of the style 'full' for six plots with four censuses: 'marena4cns.full1.rdata' for the first census through 'marena4cns.full4.rdata' for the fourth census. These six plots are a subset of the nine found in marena3cns.full (component 3): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed. marenaRecent.stem.Rdata5Jan2013.zip. A zip archive holding one R Analytical Table, a version of the Marena plots' census data in R format. These are designed for data analysis. This one file, 'marenaRecent.full1.rdata', has data from the latest census at 60 different plots. The table has one record per individual stem, necessary because some individual trees have more than one stem. Detailed documentations of these tables is given in RoutputFull.pdf (see component 11 below); an additional column 'plot' is included because the table includes records from many different locations. Plot coordinates are given in PanamaPlot.txt (component 12 below). These are the best data to use if only a single plot census is needed, and individual stems are desired. marena2cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for 44 plots with two censuses: 'marena2cns.stem1.rdata' for the first census and 'marena3cns.stem2.rdata' for the second census. These 44 plots are a subset of the 60 found in marenaRecent.stem (component 1): the 44 that have been censused two or more times. These are the best data to use if two plot censuses are needed, and individual stems are desired. marena3cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for nine plots with three censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These nine plots are a subset of the 44 found in marena2cns.stem (component 6): the nine that have been censused three or more times. These are the best data to use if three plot censuses are needed, and individual stems are desired. marena4cns.stem.Rdata5Jan2013.zip. R Analytical Tables of the style 'stem' for six plots with four censuses: 'marena3cns.stem1.rdata' for the first census through 'marena3cns.stem3.rdata' for the third census. These six plots are a subset of the nine found in marena3cns.stem (component 7): the six that have been censused four or more times. These are the best data to use if four plot censuses are needed, and individual stems are desired. bci.spptable.rdata. A list of the 1414 species found across all tree plots and inventories in Panama, in R format. The column 'sp' in this table is a code identifying the species in the full census tables (marena.full and marena.stem, components 1-4 and 5-8 above). RoutputFull.pdf: Detailed documentation of the 'full' tables in Rdata format (components 1-4 above). RoutputStem.pdf: Detailed documentation of the 'stem' tables in Rdata format (component 5-8 above). PanamaPlot.txt: Locations of all tree plots and inventories in Panama.

  9. Smooth numbers in large prime gaps

    • zenodo.org
    application/gzip
    Updated Feb 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert M. Guralnick; Robert M. Guralnick; John Shareshian; Russ Woodroofe; Russ Woodroofe; John Shareshian (2023). Smooth numbers in large prime gaps [Dataset]. http://doi.org/10.5281/zenodo.5914768
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Feb 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robert M. Guralnick; Robert M. Guralnick; John Shareshian; Russ Woodroofe; Russ Woodroofe; John Shareshian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset contains numbers from 25 up to 1 quadrillion (1015) that are smooth relative to the gap to the preceding prime. More precisely, we list all numbers n so that

    r + pan

    where r is the largest prime smaller than n - 1, and pa is the largest prime-power divisor of n. The dataset is the result of a 10 day computation using 15 cores on an Intel Xeon system, running code hosted at GitHub (see "Related identifiers"). The GitHub code checks additional conditions when r is n - 2 and n - 1 is a power of 2, but it is easy and quick to check that when (up to 1015) n = 2k + 1, the second largest prime r2 satisfies r2 + pa > n. Thus, this additional check makes no difference in the output.

    Our motivations for computing this data are described in our paper On invariable generation of alternating groups by elements of prime and prime power order (arXiv:2201.12371). Any number n in the range which is not of the given form has the associated alternating group An generated by any element of order r together with any element having a certain cycle structure (and of order pa).

    Description / specification

    The data is stored as compressed text-based input to a computer algebra system, specifically in gzipped GAP format. The file out-k.g.gz holds numbers in the range from (k - 1)⋅1012 to k⋅1012. The first line of each file sets the variable invgen_oversmooth_range to be the range (thus, [(k - 1)⋅1012 .. k⋅1012]). The subsequent lines set invgen_oversmooth to a list of pairs of numbers [n, pa], where n is a smooth number as described above, and pa is the largest prime-power of n. The largest prime preceding n - 1 is given in a GAP comment.

    Thus, the first few lines of out-0.g.gz (when uncompressed) appear as

    invgen_oversmooth_range:=[25..1000000000000];
    invgen_oversmooth := [
     [ 30, 5 ], # bp 23 
     [ 60, 5 ], # bp 53 
     [ 126, 9 ], # bp 113 
     [ 210, 7 ], # bp 199 
     [ 252, 9 ], # bp 241 
     [ 308, 11 ], # bp 293 
     [ 330, 11 ], # bp 317 
     [ 420, 7 ], # bp 409 
    ...

    where [25 .. 1000000000000] is the range considered, and for example "[ 30, 5 ], # bp 23" represents that 23 is the largest prime preceding 30 - 1, 5 is the largest prime-power divisor of 30, and 23 + 5 ≤ 30.

    We created the data in GAP files for ease of inputting into a GAP program in our own use of the data. It is easy to convert the GAP files to another format via standard technique such as regular expression-based search and replace. For example, on macOS or Linux, the following command will convert the list in out-0.g.gz to a CSV file, which it will display on the terminal.

    zcat out_quadrillion/out-0.g.gz | sed -En 's/ \[ ([0-9]+), ([0-9]+) \], # bp ([0-9]+)/\1,\2,\3/gp' | less

  10. d

    Water columns properties measured by CTD sensors during seasonal cruises in...

    • dataone.org
    • search.dataone.org
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seth Danielson; Elizabeth Dobbins (2023). Water columns properties measured by CTD sensors during seasonal cruises in the Gulf of Alaska for the Northern Gulf of Alaska LTER project, 2018 and 2021 [Dataset]. http://doi.org/10.24431/rw1k459
    Explore at:
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Research Workspace
    Authors
    Seth Danielson; Elizabeth Dobbins
    Time period covered
    Jan 1, 2018 - Jan 1, 2021
    Area covered
    Description

    This data set contains measurements of water properties such as temperature, conductivity, chlorophyll fluorescence, PAR, oxygen, beam attenuation, and beam transmission. These measurements were collected by a Seabird 9 CTD and associated sensors on a CTD rosette lowered from a ship at discrete stations during cruises for the Northern Gulf of Alaska Long Term Ecological Research (NGA LTER) project. Three cruises occurred from 2018-2021 - spring, summer, and fall. Ships conducting the cruises include R/V Tiglax (TGX), R/V Sikuliaq (SKQ), and R/V Wolstad (WSD). R/V Sikuliaq provides her own CTD instrument, but for the other, smaller, ships, the instrument is provided by the NGA LTER program. The CTDs have dual temperature and conductivity sensors; from these measurements, salinity and density (as sigma-t) were calculated. CTDs are also outfitted with dual SBE 43 oxygen sensors, a WET Labs C-Star to measure light transmission and attenuation, a WET Labs ECO-AFL/FL to measure chlorophyll fluorescence, and a photosynthetically active radiation (PAR) sensor. Additional instruments include a Deep-SUNA to optically measure nitrate and a LISST to measure particle size and concentration. Another instrument on the rosette (an underwater Vision Profiler UVP whose data is not included here) required a long soak at 30 m that may have impacted depiction of the near-surface stratification. Data from each cruise are presented in 3 formats: vertical profiles of 1 dbar averages in netCDF and CSV formats, and data corresponding to the times of Niskin water bottle sampling in CSV format. Each cruise generates one file of each of these types, with all files of each type collected in a zip file containing all years of data for each data type. Raw sensor voltages are included. Although preliminary data for the SUNA and LISST are included in these files, more complete, internally recorded versions of these data will be made available via the NGA LTER program. each of the These data are part of the Northern Gulf of Alaska Long Term Ecological Research (NGA LTER) program. The LTER program is a National Science Foundation–funded network of 28 sites nationwide that focus on the influence of long-term and large-scale phenomenon on ecosystems. Additional funding for sampling is provided by the North Pacific Research Board (NPRB), the Alaska Ocean Observing System (AOOS), and the Exxon Valdez Oil Spill Trustee Council (EVOS) via the Gulf Watch Alaska program.

  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768

R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart

Explore at:
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Search
Clear search
Close search
Google apps
Main menu