21 datasets found
  1. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  2. w

    Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

    • data.wu.ac.at
    application/unknown
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
    Explore at:
    application/unknownAvailable download formats
    Dataset updated
    Aug 29, 2017
    Dataset provided by
    Department of Energy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

    The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

    This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

    For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

    Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

    For questions about this dataset, contact andy.hoke@nrel.gov.

    If you find this dataset useful, please mention NREL and cite [1] in your work.

    References:

    [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

    [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

    [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

  3. Z

    Data from: Russian Financial Statements Database: A firm-level collection of...

    • data.niaid.nih.gov
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skougarevskiy, Dmitriy (2025). Russian Financial Statements Database: A firm-level collection of the universe of financial statements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14622208
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    Bondarkov, Sergey
    Ledenev, Victor
    Skougarevskiy, Dmitriy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Area covered
    Russia
    Description

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    • 🔓 First open data set with information on every active firm in Russia.

    • 🗂️ First open financial statements data set that includes non-filing firms.

    • 🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    • 📅 Covers 2011-2023 initially, will be continuously updated.

    • 🏗️ Restores as much data as possible through non-invasive data imputation, statement articulation, and harmonization.

    The RFSD is hosted on 🤗 Hugging Face and Zenodo and is stored in a structured, column-oriented, compressed binary format Apache Parquet with yearly partitioning scheme, enabling end-users to query only variables of interest at scale.

    The accompanying paper provides internal and external validation of the data: http://arxiv.org/abs/2501.05841.

    Here we present the instructions for importing the data in R or Python environment. Please consult with the project repository for more information: http://github.com/irlcode/RFSD.

    Importing The Data

    You have two options to ingest the data: download the .parquet files manually from Hugging Face or Zenodo or rely on 🤗 Hugging Face Datasets library.

    Python

    🤗 Hugging Face Datasets

    It is as easy as:

    from datasets import load_dataset import polars as pl

    This line will download 6.6GB+ of all RFSD data and store it in a 🤗 cache folder

    RFSD = load_dataset('irlspbru/RFSD')

    Alternatively, this will download ~540MB with all financial statements for 2023# to a Polars DataFrame (requires about 8GB of RAM)

    RFSD_2023 = pl.read_parquet('hf://datasets/irlspbru/RFSD/RFSD/year=2023/*.parquet')

    Please note that the data is not shuffled within year, meaning that streaming first n rows will not yield a random sample.

    Local File Import

    Importing in Python requires pyarrow package installed.

    import pyarrow.dataset as ds import polars as pl

    Read RFSD metadata from local file

    RFSD = ds.dataset("local/path/to/RFSD")

    Use RFSD_dataset.schema to glimpse the data structure and columns' classes

    print(RFSD.schema)

    Load full dataset into memory

    RFSD_full = pl.from_arrow(RFSD.to_table())

    Load only 2019 data into memory

    RFSD_2019 = pl.from_arrow(RFSD.to_table(filter=ds.field('year') == 2019))

    Load only revenue for firms in 2019, identified by taxpayer id

    RFSD_2019_revenue = pl.from_arrow( RFSD.to_table( filter=ds.field('year') == 2019, columns=['inn', 'line_2110'] ) )

    Give suggested descriptive names to variables

    renaming_df = pl.read_csv('local/path/to/descriptive_names_dict.csv') RFSD_full = RFSD_full.rename({item[0]: item[1] for item in zip(renaming_df['original'], renaming_df['descriptive'])})

    R

    Local File Import

    Importing in R requires arrow package installed.

    library(arrow) library(data.table)

    Read RFSD metadata from local file

    RFSD <- open_dataset("local/path/to/RFSD")

    Use schema() to glimpse into the data structure and column classes

    schema(RFSD)

    Load full dataset into memory

    scanner <- Scanner$create(RFSD) RFSD_full <- as.data.table(scanner$ToTable())

    Load only 2019 data into memory

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scanner <- scan_builder$Finish() RFSD_2019 <- as.data.table(scanner$ToTable())

    Load only revenue for firms in 2019, identified by taxpayer id

    scan_builder <- RFSD$NewScan() scan_builder$Filter(Expression$field_ref("year") == 2019) scan_builder$Project(cols = c("inn", "line_2110")) scanner <- scan_builder$Finish() RFSD_2019_revenue <- as.data.table(scanner$ToTable())

    Give suggested descriptive names to variables

    renaming_dt <- fread("local/path/to/descriptive_names_dict.csv") setnames(RFSD_full, old = renaming_dt$original, new = renaming_dt$descriptive)

    Use Cases

    🌍 For macroeconomists: Replication of a Bank of Russia study of the cost channel of monetary policy in Russia by Mogiliat et al. (2024) — interest_payments.md

    🏭 For IO: Replication of the total factor productivity estimation by Kaukin and Zhemkova (2023) — tfp.md

    🗺️ For economic geographers: A novel model-less house-level GDP spatialization that capitalizes on geocoding of firm addresses — spatialization.md

    FAQ

    Why should I use this data instead of Interfax's SPARK, Moody's Ruslana, or Kontur's Focus?hat is the data period?

    To the best of our knowledge, the RFSD is the only open data set with up-to-date financial statements of Russian companies published under a permissive licence. Apart from being free-to-use, the RFSD benefits from data harmonization and error detection procedures unavailable in commercial sources. Finally, the data can be easily ingested in any statistical package with minimal effort.

    What is the data period?

    We provide financials for Russian firms in 2011-2023. We will add the data for 2024 by July, 2025 (see Version and Update Policy below).

    Why are there no data for firm X in year Y?

    Although the RFSD strives to be an all-encompassing database of financial statements, end users will encounter data gaps:

    We do not include financials for firms that we considered ineligible to submit financial statements to the Rosstat/Federal Tax Service by law: financial, religious, or state organizations (state-owned commercial firms are still in the data).

    Eligible firms may enjoy the right not to disclose under certain conditions. For instance, Gazprom did not file in 2022 and we had to impute its 2022 data from 2023 filings. Sibur filed only in 2023, Novatek — in 2020 and 2021. Commercial data providers such as Interfax's SPARK enjoy dedicated access to the Federal Tax Service data and therefore are able source this information elsewhere.

    Firm may have submitted its annual statement but, according to the Uniform State Register of Legal Entities (EGRUL), it was not active in this year. We remove those filings.

    Why is the geolocation of firm X incorrect?

    We use Nominatim to geocode structured addresses of incorporation of legal entities from the EGRUL. There may be errors in the original addresses that prevent us from geocoding firms to a particular house. Gazprom, for instance, is geocoded up to a house level in 2014 and 2021-2023, but only at street level for 2015-2020 due to improper handling of the house number by Nominatim. In that case we have fallen back to street-level geocoding. Additionally, streets in different districts of one city may share identical names. We have ignored those problems in our geocoding and invite your submissions. Finally, address of incorporation may not correspond with plant locations. For instance, Rosneft has 62 field offices in addition to the central office in Moscow. We ignore the location of such offices in our geocoding, but subsidiaries set up as separate legal entities are still geocoded.

    Why is the data for firm X different from https://bo.nalog.ru/?

    Many firms submit correcting statements after the initial filing. While we have downloaded the data way past the April, 2024 deadline for 2023 filings, firms may have kept submitting the correcting statements. We will capture them in the future releases.

    Why is the data for firm X unrealistic?

    We provide the source data as is, with minimal changes. Consider a relatively unknown LLC Banknota. It reported 3.7 trillion rubles in revenue in 2023, or 2% of Russia's GDP. This is obviously an outlier firm with unrealistic financials. We manually reviewed the data and flagged such firms for user consideration (variable outlier), keeping the source data intact.

    Why is the data for groups of companies different from their IFRS statements?

    We should stress that we provide unconsolidated financial statements filed according to the Russian accounting standards, meaning that it would be wrong to infer financials for corporate groups with this data. Gazprom, for instance, had over 800 affiliated entities and to study this corporate group in its entirety it is not enough to consider financials of the parent company.

    Why is the data not in CSV?

    The data is provided in Apache Parquet format. This is a structured, column-oriented, compressed binary format allowing for conditional subsetting of columns and rows. In other words, you can easily query financials of companies of interest, keeping only variables of interest in memory, greatly reducing data footprint.

    Version and Update Policy

    Version (SemVer): 1.0.0.

    We intend to update the RFSD annualy as the data becomes available, in other words when most of the firms have their statements filed with the Federal Tax Service. The official deadline for filing of previous year statements is April, 1. However, every year a portion of firms either fails to meet the deadline or submits corrections afterwards. Filing continues up to the very end of the year but after the end of April this stream quickly thins out. Nevertheless, there is obviously a trade-off between minimization of data completeness and version availability. We find it a reasonable compromise to query new data in early June, since on average by the end of May 96.7% statements are already filed, including 86.4% of all the correcting filings. We plan to make a new version of RFSD available by July.

    Licence

    Creative Commons License Attribution 4.0 International (CC BY 4.0).

    Copyright © the respective contributors.

    Citation

    Please cite as:

    @unpublished{bondarkov2025rfsd, title={{R}ussian {F}inancial {S}tatements {D}atabase}, author={Bondarkov, Sergey and Ledenev, Victor and Skougarevskiy, Dmitriy}, note={arXiv preprint arXiv:2501.05841}, doi={https://doi.org/10.48550/arXiv.2501.05841}, year={2025}}

    Acknowledgments and Contacts

    Data collection and processing: Sergey Bondarkov, sbondarkov@eu.spb.ru, Viktor Ledenev, vledenev@eu.spb.ru

    Project conception, data validation, and use cases: Dmitriy Skougarevskiy, Ph.D.,

  4. f

    Data and tools for studying isograms

    • figshare.com
    Updated Jul 31, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Breit (2017). Data and tools for studying isograms [Dataset]. http://doi.org/10.6084/m9.figshare.5245810.v1
    Explore at:
    application/x-sqlite3Available download formats
    Dataset updated
    Jul 31, 2017
    Dataset provided by
    figshare
    Authors
    Florian Breit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A collection of datasets and python scripts for extraction and analysis of isograms (and some palindromes and tautonyms) from corpus-based word-lists, specifically Google Ngram and the British National Corpus (BNC).Below follows a brief description, first, of the included datasets and, second, of the included scripts.1. DatasetsThe data from English Google Ngrams and the BNC is available in two formats: as a plain text CSV file and as a SQLite3 database.1.1 CSV formatThe CSV files for each dataset actually come in two parts: one labelled ".csv" and one ".totals". The ".csv" contains the actual extracted data, and the ".totals" file contains some basic summary statistics about the ".csv" dataset with the same name.The CSV files contain one row per data point, with the colums separated by a single tab stop. There are no labels at the top of the files. Each line has the following columns, in this order (the labels below are what I use in the database, which has an identical structure, see section below):

    Label Data type Description

    isogramy int The order of isogramy, e.g. "2" is a second order isogram

    length int The length of the word in letters

    word text The actual word/isogram in ASCII

    source_pos text The Part of Speech tag from the original corpus

    count int Token count (total number of occurences)

    vol_count int Volume count (number of different sources which contain the word)

    count_per_million int Token count per million words

    vol_count_as_percent int Volume count as percentage of the total number of volumes

    is_palindrome bool Whether the word is a palindrome (1) or not (0)

    is_tautonym bool Whether the word is a tautonym (1) or not (0)

    The ".totals" files have a slightly different format, with one row per data point, where the first column is the label and the second column is the associated value. The ".totals" files contain the following data:

    Label

    Data type

    Description

    !total_1grams

    int

    The total number of words in the corpus

    !total_volumes

    int

    The total number of volumes (individual sources) in the corpus

    !total_isograms

    int

    The total number of isograms found in the corpus (before compacting)

    !total_palindromes

    int

    How many of the isograms found are palindromes

    !total_tautonyms

    int

    How many of the isograms found are tautonyms

    The CSV files are mainly useful for further automated data processing. For working with the data set directly (e.g. to do statistics or cross-check entries), I would recommend using the database format described below.1.2 SQLite database formatOn the other hand, the SQLite database combines the data from all four of the plain text files, and adds various useful combinations of the two datasets, namely:• Compacted versions of each dataset, where identical headwords are combined into a single entry.• A combined compacted dataset, combining and compacting the data from both Ngrams and the BNC.• An intersected dataset, which contains only those words which are found in both the Ngrams and the BNC dataset.The intersected dataset is by far the least noisy, but is missing some real isograms, too.The columns/layout of each of the tables in the database is identical to that described for the CSV/.totals files above.To get an idea of the various ways the database can be queried for various bits of data see the R script described below, which computes statistics based on the SQLite database.2. ScriptsThere are three scripts: one for tiding Ngram and BNC word lists and extracting isograms, one to create a neat SQLite database from the output, and one to compute some basic statistics from the data. The first script can be run using Python 3, the second script can be run using SQLite 3 from the command line, and the third script can be run in R/RStudio (R version 3).2.1 Source dataThe scripts were written to work with word lists from Google Ngram and the BNC, which can be obtained from http://storage.googleapis.com/books/ngrams/books/datasetsv2.html and [https://www.kilgarriff.co.uk/bnc-readme.html], (download all.al.gz).For Ngram the script expects the path to the directory containing the various files, for BNC the direct path to the *.gz file.2.2 Data preparationBefore processing proper, the word lists need to be tidied to exclude superfluous material and some of the most obvious noise. This will also bring them into a uniform format.Tidying and reformatting can be done by running one of the following commands:python isograms.py --ngrams --indir=INDIR --outfile=OUTFILEpython isograms.py --bnc --indir=INFILE --outfile=OUTFILEReplace INDIR/INFILE with the input directory or filename and OUTFILE with the filename for the tidied and reformatted output.2.3 Isogram ExtractionAfter preparing the data as above, isograms can be extracted from by running the following command on the reformatted and tidied files:python isograms.py --batch --infile=INFILE --outfile=OUTFILEHere INFILE should refer the the output from the previosu data cleaning process. Please note that the script will actually write two output files, one named OUTFILE with a word list of all the isograms and their associated frequency data, and one named "OUTFILE.totals" with very basic summary statistics.2.4 Creating a SQLite3 databaseThe output data from the above step can be easily collated into a SQLite3 database which allows for easy querying of the data directly for specific properties. The database can be created by following these steps:1. Make sure the files with the Ngrams and BNC data are named “ngrams-isograms.csv” and “bnc-isograms.csv” respectively. (The script assumes you have both of them, if you only want to load one, just create an empty file for the other one).2. Copy the “create-database.sql” script into the same directory as the two data files.3. On the command line, go to the directory where the files and the SQL script are. 4. Type: sqlite3 isograms.db 5. This will create a database called “isograms.db”.See the section 1 for a basic descript of the output data and how to work with the database.2.5 Statistical processingThe repository includes an R script (R version 3) named “statistics.r” that computes a number of statistics about the distribution of isograms by length, frequency, contextual diversity, etc. This can be used as a starting point for running your own stats. It uses RSQLite to access the SQLite database version of the data described above.

  5. Z

    The Dynamics of Collective Action Corpus

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor, Marshall A. (2023). The Dynamics of Collective Action Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8414334
    Explore at:
    Dataset updated
    Oct 7, 2023
    Dataset provided by
    Stoltz, Dustin S.
    Dudley, Jennifer S.K.
    Taylor, Marshall A.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This respository includes two datasets, a Document-Term Matrix and associated metadata, for 17,493 New York Times articles covering protest events, both saved as single R objects.

    These datasets are based on the original Dynamics of Collective Action (DoCA) dataset (Wang and Soule 2012; Earl, Soule, and McCarthy). The original DoCA datset contains variables for protest events referenced in roughly 19,676 New York Times articles reporting on collective action events occurring in the US between 1960 and 1995. Data were collected as part of the Dynamics of Collective Action Project at Stanford University. Research assistants read every page of all daily issues of the New York Times to find descriptions of 23,624 distinct protest events. The text for the news articles were not included in the original DoCA data.

    We attempted to recollect the raw text in a semi-supervised fashion by matching article titles to create the Dynamics of Collective Action Corpus. In addition to hand-checking random samples and hand-collecting some articles (specifically, in the case of false positives), we also used some automated matching processes to ensure the recollected article titles matched their respective titles in the DoCA dataset. The final number of recollected and matched articles is 17,493.

    We then subset the original DoCA dataset to include only rows that match a recollected article. The "20231006_dca_metadata_subset.Rds" contains all of the metadata variables from the original DoCA dataset (see Codebook), with the addition of "pdf_file" and "pub_title" which is the title of the recollected article (and may differ from the "title" variable in the original dataset), for a total of 106 variables and 21,126 rows (noting that a row is a distinct protest events and one article may cover more than one protest event).

    Once collected, we prepared these texts using typical preprocessing procedures (and some less typical procedures, which were necessary given that these were OCRed texts). We followed these steps in this order: We removed headers and footers that were consistent across all digitized stories and any web links or HTML; added a single space before an uppercase letter when it was flush against a lowercase letter to its right (e.g., turning "JohnKennedy'' into "John Kennedy''); removed excess whitespace; converted all characters to the broadest range of Latin characters and then transliterated to ``Basic Latin'' ASCII characters; replaced curly quotes with their ASCII counterparts; replaced contractions (e.g., turned "it's'' into "it is''); removed punctuation; removed capitalization; removed numbers; fixed word kerning; applied a final extra round of whitespace removal.

    We then tokenized them by following the rule that each word is a character string surrounded by a single space. At this step, each document is then a list of tokens. We count each unique token to create a document-term matrix (DTM), where each row is an article, each column is a unique token (occurring at least once in the corpus as a whole), and each cell is the number of times each token occurred in each article. Finally, we removed words (i.e., columns in the DTM) that occurred less than four times in the corpus as a whole or were only a single character in length (likely orphaned characters from the OCRing process). The final DTM has 66,552 unique words, 10,134,304 total tokens and 17,493. The "20231006_dca_dtm.Rds" is a sparse matrix class object from the Matrix R package.

    In R, use the load() function to load the objects dca_dtm and dca_meta. To associate the dca_meta to the dca_dtm , match the "pdf_file" variable indca_meta to the rownames of dca_dtm.

  6. s

    Annual maps of cropland abandonment, land cover, and other derived data for...

    • repository.soilwise-he.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Annual maps of cropland abandonment, land cover, and other derived data for time-series analysis of cropland abandonment [Dataset]. http://doi.org/10.5281/zenodo.5348287
    Explore at:
    Dataset updated
    Apr 2, 2022
    Description

    Open AccessThis archive contains raw annual land cover maps, cropland abandonment maps, and accompanying derived data products to support: Crawford C.L., Yin, H., Radeloff, V.C., and Wilcove, D.S. 2022. Rural land abandonment is too ephemeral to provide major benefits for biodiversity and climate. Science Advances doi.org/10.1126/sciadv.abm8999. An archive of the analysis scripts developed for this project can be found at: https://github.com/chriscra/abandonment_trajectories (https://doi.org/10.5281/zenodo.6383127). Note that the label '_2022_02_07' in many file names refers to the date of the primary analysis. 'dts” or “dt” refer to “data.tables,' large .csv files that were manipulated using the data.table package in R (Dowle and Srinivasan 2021, http://r-datatable.com/). “Rasters” refer to “.tif” files that were processed using the raster and terra packages in R (Hijmans, 2022; https://rspatial.org/terra/; https://rspatial.org/raster). Data files fall into one of four categories of data derived during our analysis of abandonment: observed, potential, maximum, or recultivation. Derived datasets also follow the same naming convention, though are aggregated across sites. These four categories are as follows (using “age_dts” for our site in Shaanxi Province, China as an example): observed abandonment identified through our primary analysis, with a threshold of five years. These files do not have a specific label beyond the description of the file and the date of analysis (e.g., shaanxi_age_2022_02_07.csv); potential abandonment for a scenario without any recultivation, in which abandoned croplands are left abandoned from the year of initial abandonment through the end of the time series, with the label “_potential” (e.g., shaanxi_potential_age_2022_02_07.csv); maximum age of abandonment over the course of the time series, with the label “_max” (e.g., shaanxi_max_age_2022_02_07.csv); recultivation periods, corresponding to the lengths of recultivation periods following abandonment, given the label “_recult” (e.g., shaanxi_recult_age_2022_02_07.csv). This archive includes multiple .zip files, the contents of which are described below: age_dts.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for, as of that year, also referred to as length, duration, etc.), for each year between 1987-2017 for all 11 sites. These maps are stored as .csv files, where each row is a pixel, the first two columns refer to the x and y coordinates (in terms of longitude and latitude), and subsequent columns contain the abandonment age values for an individual year (where years are labeled with 'y' followed by the year, e.g., 'y1987'). Maps are given with a latitude and longitude coordinate reference system. Folder contains observed age, potential age (“_potential”), maximum age (“_max”), and recultivation lengths (“_recult”) for all sites. Maximum age .csv files include only three columns: x, y, and the maximum length (i.e., “max age”, in years) for each pixel throughout the entire time series (1987-2017). Files were produced using the custom functions 'cc_filter_abn_dt(),' “cc_calc_max_age(),' “cc_calc_potential_age(),” and “cc_calc_recult_age();” see '_util/_util_functions.R.' age_rasters.zip - Maps of abandonment age (i.e., how long each pixel has been abandoned for), for each year between 1987-2017 for all 11 sites. Maps are stored as .tif files, where each band corresponds to one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Folder contains observed age, potential age (“_potential”), and maximum age (“_max”) rasters for all sites. Maximum age rasters include just one band (“layer”). These rasters match the corresponding .csv files contained in 'age_dts.zip.” derived_data.zip - summary datasets created throughout this analysis, listed below. diff.zip - .csv files for each of our eleven sites containing the year-to-year lagged differences in abandonment age (i.e., length of time abandoned) for each pixel. The rows correspond to a single pixel of land, and the columns refer to the year the difference is in reference to. These rows do not have longitude or latitude values associated with them; however, rows correspond to the same rows in the .csv files in 'input_data.tables.zip' and 'age_dts.zip.' These files were produced using the custom function 'cc_diff_dt()' (much like the base R function 'diff()'), contained within the custom function 'cc_filter_abn_dt()' (see '_util/_util_functions.R'). Folder contains diff files for observed abandonment, potential abandonment (“_potential”), and recultivation lengths (“_recult”) for all sites. input_dts.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment (https://doi.org/10.1016/j.rse.2020.111873). Like “age_dts,” these maps are stored as .csv files, where each row is a pixel and the first two columns refer to x and y coordinates (in terms of longitude and latitude). Subsequent columns contain the land cover class for an individual year (e.g., 'y1987'). Note that these maps were recoded from Yin et al. 2020 so that land cover classification was consistent across sites (see below). This contains two files for each site: the raw land cover maps from Yin et al. 2020 (after recoding), and a “clean” version produced by applying 5- and 8-year temporal filters to the raw input (see custom function “cc_temporal_filter_lc(),” in “_util/_util_functions.R” and “1_prep_r_to_dt.R”). These files correspond to those in 'input_rasters.zip,' and serve as the primary inputs for the analysis. input_rasters.zip - annual land cover maps for eleven sites with four land cover classes (see below), adapted from Yin et al. 2020 Remote Sensing of Environment. Maps are stored as '.tif' files, where each band corresponds one of the 31 years in our analysis (1987-2017), in ascending order (i.e., the first layer is 1987 and the 31st layer is 2017). Maps are given with a latitude and longitude coordinate reference system. Note that these maps were recoded so that land cover classes matched across sites (see below). Contains two files for each site: the raw land cover maps (after recoding), and a “clean” version that has been processed with 5- and 8-year temporal filters (see above). These files match those in 'input_dts.zip.' length.zip - .csv files containing the length (i.e., age or duration, in years) of each distinct individual period of abandonment at each site. This folder contains length files for observed and potential abandonment, as well as recultivation lengths. Produced using the custom function 'cc_filter_abn_dt()' and “cc_extract_length();” see '_util/_util_functions.R.' derived_data.zip contains the following files: 'site_df.csv' - a simple .csv containing descriptive information for each of our eleven sites, along with the original land cover codes used by Yin et al. 2020 (updated so that all eleven sites in how land cover classes were coded; see below). Primary derived datasets for both observed abandonment (“area_dat”) and potential abandonment (“potential_area_dat”). area_dat - Shows the area (in ha) in each land cover class at each site in each year (1987-2017), along with the area of cropland abandoned in each year following a five-year abandonment threshold (abandoned for >=5 years) or no threshold (abandoned for >=1 years). Produced using custom functions 'cc_calc_area_per_lc_abn()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' persistence_dat - A .csv containing the area of cropland abandoned (ha) for a given 'cohort' of abandoned cropland (i.e., a group of cropland abandoned in the same year, also called 'year_abn') in a specific year. This area is also given as a proportion of the initial area abandoned in each cohort, or the area of each cohort when it was first classified as abandoned at year 5 ('initial_area_abn'). The 'age' is given as the number of years since a given cohort of abandoned cropland was last actively cultivated, and 'time' is marked relative to the 5th year, when our five-year definition first classifies that land as abandoned (and where the proportion of abandoned land remaining abandoned is 1). Produced using custom functions 'cc_calc_persistence()' via 'cc_summarize_abn_dts()'. See scripts 'cluster/2_analyze_abn.R' and '_util/_util_functions.R.' This serves as the main input for our linear models of recultivation (“decay”) trajectories. turnover_dat - A .csv showing the annual gross gain, annual gross loss, and annual net change in the area (in ha) of abandoned cropland at each site in each year of the time series. Produced using custom functions 'cc_calc_abn_diff()' via 'cc_summarize_abn_dts()' (see '_util/_util_functions.R'), implemented in 'cluster/2_analyze_abn.R.' This file is only produced for observed abandonment. Area summary files (for observed abandonment only) area_summary_df - Contains a range of summary values relating to the area of cropland abandonment for each of our eleven sites. All area values are given in hectares (ha) unless stated otherwise. It contains 16 variables as columns, including 1) 'site,' 2) 'total_site_area_ha_2017' - the total site area (ha) in 2017, 3) 'cropland_area_1987' - the area in cropland in 1987 (ha), 4) 'area_abn_ha_2017' -

  7. r

    Australian Public Holidays Dates Machine Readable Dataset

    • researchdata.edu.au
    Updated Feb 24, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of the Prime Minister and Cabinet (2014). Australian Public Holidays Dates Machine Readable Dataset [Dataset]. https://researchdata.edu.au/australian-public-holidays-readable-dataset/2995651
    Explore at:
    Dataset updated
    Feb 24, 2014
    Dataset provided by
    data.gov.au
    Authors
    Department of the Prime Minister and Cabinet
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Australia
    Description

    The Department of the Prime Minister and Cabinet is no longer maintaining this dataset. If you would like to take ownership of this dataset for ongoing maintenance please contact us.\r \r ---\r \r PLEASE READ BEFORE USING\r \r The data format has been updated to align with a tidy data style (http://vita.had.co.nz/papers/tidy-data.html).\r \r The data in this dataset is manually collected and combined in a csv format from the following state and territory portals:\r \r - https://www.cmtedd.act.gov.au/communication/holidays\r - https://www.nsw.gov.au/about-nsw/public-holidays\r - https://nt.gov.au/nt-public-holidays\r - https://www.qld.gov.au/recreation/travel/holidays/public\r - https://www.safework.sa.gov.au/resources/public-holidays\r - https://worksafe.tas.gov.au/topics/laws-and-compliance/public-holidays\r - https://business.vic.gov.au/business-information/public-holidays\r - https://www.commerce.wa.gov.au/labour-relations/public-holidays-western-australia\r \r The data API by default returns only the first 100 records. The JSON response will contain a key that shows the link for the next page of records.\r Alternatively you can view all records by updating the limit on the endpoint or using a query to select all records, i.e. /api/3/action/datastore_search_sql?sql=SELECT * from "{{resource_id}}".\r \r

  8. S

    Global Alien Species First Record Database

    • dataportal.senckenberg.de
    10039630, xlsx
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seebens et al. (2025). Global Alien Species First Record Database [Dataset]. http://doi.org/10.12761/sgn.2016.01.022
    Explore at:
    xlsx(69242), xlsx, 10039630Available download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    Senckenberg - Data Stock (general)
    Authors
    Seebens et al.
    Time period covered
    1500 - 2015
    Description

    The Global Alien Species First Record Database represents a compilation of first records of alien species across taxonomic groups and regions.

    A first record denotes the year of first observation of an alien species in a region. Note that this often differs from the date of first introduction. The database covers all regions (mostly countries and some islands) globally with particularly intense sampling in Europe, North America and Australasia. First records were gathered from various data sources including online databases, scientific publications, reports and personal collections by a team of >45 researchers. A full list of data sources, an analysis of global and continental trends and more details about the data can be found in our open access publication: Seebens et al. (2017) No saturation in the accumulation of alien species worldwide. Nature Communications 8, 14435.

    Note that species names and first records may deviate from the original information, which was necessary to harmonise data files. Original information is provided in the most recent files.

    Note that first records are sampled unevenly in space and time and across taxonomic groups, and thus first records are affected by sampling biases. From our experience, analyses on a continental or global scale are rather robust, while analyses on national levels should be interpreted carefully. For national analyses, we strongly recommend to consult the original data sources to check sampling methods, quality etc individually.

    The first record database will be irregularly updated and the most recent version is indicated by the version number. _Newer Versions_ are accessible via Zenodo_: https://doi.org/10.5281/zenodo.10039630

    Here, we provide several files: (1) The annual number of first records per taxonomic group and continent in an excel file, which represents the aggregated data used for most of the analyses in our paper (Seebens et al. Nat Comm). (2) The R code for the implementation of the invasion model used in the paper. (3) A more detailed data set with the first records of individual species in a region. This data set represents only a subset (~77%) of the full database as some data were not publicly accessible. This data set will be irregularly updated and may differ from the data set used in our paper. All data are free of use for non-commercial purposes with proper citation of Seebens et al. (2017) Nat Comm 8, 14435. (4) A substantially updated version of the First Record Database (vs 1.2) used in our second publication: Seebens et al. (2018) Global rise in emerging alien species results from increased accessibility of new source pools. PNAS 115(10), E2264-E2273.

    Please, do not ask the contact person for data, but download it at Zenodo: https://doi.org/10.5281/zenodo.10039630 - Thanks!

  9. n

    Chapter 3 of the Working Group I Contribution to the IPCC Sixth Assessment...

    • data-search.nerc.ac.uk
    • catalogue.ceda.ac.uk
    Updated Oct 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Chapter 3 of the Working Group I Contribution to the IPCC Sixth Assessment Report - data for Figure 3.21 (v20220613) [Dataset]. https://data-search.nerc.ac.uk/geonetwork/srv/search?keyword=AR6
    Explore at:
    Dataset updated
    Oct 4, 2023
    Description

    Data for Figure 3.21 from Chapter 3 of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6). Figure 3.21 shows the seasonal evolution of observed and simulated Arctic and Antarctic sea ice area (SIA) over 1979-2017. --------------------------------------------------- How to cite this dataset --------------------------------------------------- When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: Eyring, V., N.P. Gillett, K.M. Achuta Rao, R. Barimalala, M. Barreiro Parrillo, N. Bellouin, C. Cassou, P.J. Durack, Y. Kosaka, S. McGregor, S. Min, O. Morgenstern, and Y. Sun, 2021: Human Influence on the Climate System. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 423–552, doi:10.1017/9781009157896.005. --------------------------------------------------- Figure subpanels --------------------------------------------------- The figure has several subplots, but they are unidentified, so the data is stored in the parent directory. --------------------------------------------------- List of data provided --------------------------------------------------- This dataset contains Sea Ice Area anomalies over 1979-2017 relative to the 1979-2000 means from: - Observations (OSISAF, NASA Team, and Bootstrap) - Historical simulations from CMIP5 and CMIP6 multi-model means - Natural only simulations from CMIP5 and CMIP6 multi-model means --------------------------------------------------- Data provided in relation to figure --------------------------------------------------- - arctic files are used for the plots on the left side of the figure - antarctic files are used for the plots on the right side of the figure - _OBS_NASATeam files are used for the first row of the plot - _OBS_Bootstrap are used for the second row of the plot - _OBS_OSISAF are used for the third row of the plot - _ALL_CMIP5 are used in the fourth row of the plot - _ALL_CMIP6 are used in the fifth row of the plot - _NAT_CMIP5 are used in the sixth row of the plot - _NAT_CMIP6 are used in the seventh row of the plot --------------------------------------------------- Notes on reproducing the figure from the provided data --------------------------------------------------- The significance are for the grey dots, it's nan or 1 values. The data has to be overplotted to colored squares. Grey dots indicate multi-model mean anomalies stronger than inter-model spread (beyond ± 1 standard deviation). The coordinates of the data are indices, but in global attributes 'comments' of each file there are relations of indices to months, since months are the y coordinate. --------------------------------------------------- Sources of additional information --------------------------------------------------- The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the report component containing the figure (Chapter 3) - Link to the Supplementary Material for Chapter 3, which contains details on the input data used in Table 3.SM.1 - Link to the code for the figure, archived on Zenodo.

  10. H

    Current Population Survey (CPS)

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 30, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Damico (2013). Current Population Survey (CPS) [Dataset]. http://doi.org/10.7910/DVN/AK4FDD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 30, 2013
    Dataset provided by
    Harvard Dataverse
    Authors
    Anthony Damico
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D

  11. T

    Data from: Special Education Indicators

    • educationtocareer.data.mass.gov
    application/rdfxml +5
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Elementary and Secondary Education (2025). Special Education Indicators [Dataset]. https://educationtocareer.data.mass.gov/w/yamx-769q/default?cur=JpoZeFcQO7_&from=J_bUGOUd86e
    Explore at:
    xml, json, application/rssxml, csv, application/rdfxml, tsvAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset authored and provided by
    Department of Elementary and Secondary Education
    Description

    This dataset contains special education indicators since 2017. It is a long file that contains multiple rows for each district, with rows for different years, comparing students with disabilities, students without disabilities, and all students on a wide range of indicators. Not all indicators are available for all years. For definitions of each indicator, please visit the RADAR Special Education Dashboard.

    Resource Allocation and District Action Reports (RADAR) enable district leaders to compare their staffing, class size, special education services, school performance, and per-pupil spending data with similar districts. They are intended to support districts in making effective strategic decisions as they develop district plans and budgets.

    This dataset is one of five containing the same data that is also published in the RADAR Special Education Dashboard: Special Education Program Characteristics and Student Demographics Special Education Placement Trajectory Students Moving In and Out of Special Education Services Special Education Indicators Special Education Student Progression from High School through Postsecondary Education

    Below is a list of indicators that are included within the dataset. Note: "Student progression from high school through second year of postsecondary education" and "Student progression from high school through postsecondary degree completion" are available for download in this companion dataset. These two indicators are separate from the main Special Education Indicators download since the data are in a different format.

    List of Indicators

    Context

    • Stability rate (enrolled all year)
    • Student Enrollment
    Student Outcomes
    • 4-year cohort graduation rate
    • 5-year cohort graduation rate
    • 9th to 10th grade promotion rate (first-time 9th graders only)
    • Annual dropout rate
    • Chronically absent rate (% of students absent 10% or more each year)
    • Student attendance rate
    • Students absent 10 or more days each year
    • Students suspended in school at least once
    • Students suspended out-of-school at least once
    Assessments (Next Gen MCAS)
    • Average student growth percentiles (SGP) in ELA (Grades 3-8)
    • Average student growth percentiles (SGP) in ELA (Grade 10)
    • Average student growth percentiles (SGP) in math (Grades 3-8)
    • Average student growth percentiles (SGP) in math (Grade 10)
    • Meeting or exceeding expectations on ELA (Grades 3-8)
    • Meeting or exceeding expectations on ELA (Grade 10)
    • Meeting or exceeding expectations on math (Grades 3-8)
    • Meeting or exceeding expectations on math (Grade 10)
    • Meeting or exceeding expectations on science (Grades 5 and 8)
    • Meeting or exceeding expectations on science (Grade 10)
    Assessments (AP and SAT)
    • Jr / Sr AP test takers scoring 3 or above
    • Jr / Sr enrolled in one or more AP / IB courses
    • Jr / Sr who took AP courses and participated in one or more AP tests
    • SAT average score - Mathematics
    • SAT average score - reading
    Program of Study
    • 12th graders passing a full year of mathematics coursework
    • 12th graders passing a full year of science and technology/engineering coursework
    • 9th graders completing and passing all courses
    • High school graduates who completed MassCore
    Postsecondary OutcomesSpecial Education Staff
    • Special education director FTE
    • Special education teachers per 100 SWD
    • Special education paraprofessionals per 100 SWD
    • Special education support staff per 100 SWD

  12. o

    Data from: Marine turtle sightings, strandings and captures in French waters...

    • obis.org
    • gbif.org
    • +1more
    zip
    Updated Apr 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Duke University (2021). Marine turtle sightings, strandings and captures in French waters 1990-2003 [Dataset]. https://obis.org/dataset/fe0652c6-1899-49b5-a013-5aa94b45813f
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2021
    Dataset authored and provided by
    Duke University
    Time period covered
    1990 - 2003
    Area covered
    French
    Description

    Original provider: Matthew Witt, University of Exeter

    Dataset credits: Matthew Witt, University of Exeter

    Abstract: We present data spanning approximately 100 years regarding the spatial and temporal occurrence of marine turtle sightings and strandings in the northeast Atlantic from two public recording schemes and demonstrate potential signals of changing population status. Records of loggerhead (n = 317) and Kemp’s ridley (n = 44) turtles occurring on the European continental shelf were most prevalent during the autumn and winter, when waters were coolest. In contrast, endothermic leatherback turtles (n = 1,668) were most common during the summer. Analysis of the spatial distribution of hard-shell marine turtle sightings and strandings highlights a pattern of decreasing records with increasing latitude. The spatial distribution of sighting and stranding records indicates that arrival in waters of the European continental shelf is most likely driven by North Atlantic current systems. Future patterns of spatial-temporal distribution, gathered from the periphery of juvenile marine turtles habitat range, may allow for a broader assessment of the future impacts of global climate change on species range and population size.

    Purpose: We set out to determine the spatial and temporal trends for sightings, strandings and captures of hard-shell marine turtles in the northeast Atlantic from two recording schemes. One recording scheme (presented here) included marine turtle sightings, strandings and captures occurring in French waters that originated from annual sightings and strandings publications of Duguy and colleagues (Duguy 1990, 1992, 1993, 1994, 1995, 1996, 2004; Duguy et al. 1997a, b, 1999, 2000, 2001, 2002, 2003). Records presented in Duguy publications prior to 2001 contained location descriptions, providing no geographic coordinates with error estimates. Longitude and latitude positions for these events were estimated to be the closest coastal point to the descriptive location. Duguy publications, 2001 onwards, were accompanied by maps displaying the approximate location of sightings and strandings events. These maps were digitized and georeferenced and coordinate positions determined for all appropriate records. Georefenced hard-shell turtle (Lk and Cc) capture/sighting/stranding records from the papers of Duguy for France 1990-2003 (featured in Witt et al. 2007) only includes records that could have coordinates derived from their locational descriptions. The second recording scheme used were records of sightings and strandings of marine turtles in the British Isles obtained from the TURTLE database operated by Marine Environmental Monitoring. Data from the TURTLE database were submitted to EurOBIS and can be viewed on OBIS-SEAMAP: Marine Turtles.

    Supplemental information: Abstract is from Witt et al. 2007; data included in this dataset are a subset of data presented in Witt et al. 2007. References: Duguy, R. 1990. Observations de tortues marines en 1990 (Manche et Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 7:1053–1057. Duguy, R. 1992. Observations de tortues marines en 1991 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:35–37. Duguy, R. 1993. Observations de tortues marines en 1992 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:129–131. Duguy, R. 1994. Observations de tortues marines en 1993 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:235–238. Duguy, R. 1995. Observations de tortues marines en 1994 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:403–406. Duguy, R. 1996. Observations de tortues marines en 1995 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:505–513. Duguy, R. 2004. Observations de tortues marines en 2003 (cotes Atlantiques). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:361–366. Duguy, R., P. Moriniere and A. Meunier. 1997a. Observations de tortues marines en 1997. Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:761–779. Duguy, R., P. Moriniere and M.A. Spano. 1997b. Observations de tortues marines en 1996 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:625–632. Duguy, R., P. Moriniere and A. Meunier. 1999. Observations de tortues marines en 1998 (Atlantique). Annales de la Societe des Sciences Naturelles de la Charente-Maritime:911–924. Duguy, R., P. Moriniere and A. Meunier. 2000. Observations de tortues marines en 1999. Annales de la Societe des Sciences Naturelles de la Charente-Maritime 8:1025–1034. Duguy R, P. Moriniere and A. Meunier. 2001. Observations tortues marines en 2000 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:17–25. Duguy, R., P. Moriniere and A. Meunier. 2002. Observations de tortues marines en 2001 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9. Duguy, R., P. Moriniere and A. Meunier. 2003. Observations de tortues marines en 2002 (Atlantique et Manche). Annales de la Societe des Sciences Naturelles de la Charente-Maritime 9:265–273.

  13. r

    Data from: The Berth Allocation Problem with Channel Restrictions - Datasets...

    • researchdata.edu.au
    • researchdatafinder.qut.edu.au
    Updated 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corry Paul; Bierwirth Christian (2018). The Berth Allocation Problem with Channel Restrictions - Datasets [Dataset]. http://doi.org/10.4225/09/5b306f6511d7c
    Explore at:
    Dataset updated
    2018
    Dataset provided by
    Queensland University of Technology
    Authors
    Corry Paul; Bierwirth Christian
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jul 10, 6 - Dec 9, 27
    Description

    These datatasets relate to the computational study presented in the paper "The Berth Allocation Problem with Channel Restrictions", authored by Paul Corry and Christian Bierwirth. They consist of all the randomly generated problem instances along with the computational results presented in the paper.

    Results across all problem instances assume ship separation parameters of [delta_1, delta_2, delta_3] = [0.25, 0, 0.5].

    Excel Workbook Organisation:

    The data is organised into separate Excel files for each table in the paper, as indicated by the file description. Within each file, each row of data presented (aggregating 10 replications) in the corrsponding table is captured in two worksheets, one with the problem instance data, and the other with generated solution data obtained from several solution methods (described in the paper). For example, row 3 of Tab. 2, will have data for 10 problem instances on worksheet T2R3, and corresponding solution data on T2R3X.

    Problem Instance Data Format:

    On each problem instance worksheet (e.g. T2R3), each row of data corresponds to a different problem instance, and there are 10 replications on each worksheet.

    The first column provides a replication identifier which is referenced on the corresponding solution worksheet (e.g. T2R3X).

    Following this, there are n*(2c+1) columns (n = number of ships, c = number of channel segmenets) with headers p(i)_(j).(k)., where i references the operation (channel transit/berth visit) id, j references the ship id, and k references the index of the operation within the ship. All indexing starts at 0. These columns define the transit or dwell times on each segment. A value of -1 indicates a segment on which a berth allocation must be applied, and hence the dwell time is unkown.

    There are then a further n columns with headers r(j), defining the release times of each ship.

    For ChSP problems, there are a final n colums with headers b(j), defining the berth to be visited by each ship. ChSP problems with fixed berth sequencing enforced have an additional n columns with headers toa(j), indicating the order in which ship j sits within its berth sequence. For BAP-CR problems, these columnns are not present, but replaced by n*m columns (m = number of berths) with headers p(j).(b) defining the berth processing time of ship j if allocated to berth b.

    Solution Data Format:

    Each row of data corresponds to a different solution.

    Column A references the replication identifier (from the corresponding instance worksheet) that the soluion refers to.

    Column B defines the algorithm that was used to generate the solution.

    Column C shows the objective function value (total waiting and excess handling time) obtained.

    Column D shows the CPU time consumed in generating the solution, rounded to the nearest second.

    Column E shows the optimality gap as a proportion. A value of -1 or an empty value indicates that optimality gap is unknown.

    From column F onwards, there are are n*(2c+1) columns with the previously described p(i)_(j).(k). headers. The values in these columns define the entry times at each segment.

    For BAP-CR problems only, following this there are a further 2n columns. For each ship j, there will be columns titled b(j) and p.b(j) defining the berth that was allocated to ship j, and the processing time on that berth respectively.

  14. Z

    Trends in gender homophily in scientific publications (data)

    • data.niaid.nih.gov
    • observatorio-investigacion.unavarra.es
    • +1more
    Updated Apr 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2024). Trends in gender homophily in scientific publications (data) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7958033
    Explore at:
    Dataset updated
    Apr 12, 2024
    Dataset authored and provided by
    Anonymous
    Description

    This dataset contains records of research articles extracted from the Web of Science (WoS) from 1980 to 2019---in total, 15,642 journals, 28,241,100 articles and 111,980,858 authorships across 153 research areas.

    The main dataset (author_address_article_gend_v3.parquet), in Parquet format, contains all the authorships, where an authorship is defined as the tuple article-author. There are 12 variables per authorship (row):

    ut: unique article identifier.

    daisng_id: unique author identifier.

    author_no: author number, as listed in the article.

    country: author country (two-letter ISO code).

    date: publication date.

    gender: gender of the author ("male" or "female"), as provided by the Genderize.io API.

    probability: probability of the gender attribute, as provided by the Genderize.io API.

    count: number of entries for the author first name, as provided by the Genderize.io API.

    jsc: journal subject category.

    field: field of research.

    research_area: area of research.

    n_aut: number of authors in this publication.

    journal: journal name.

    alphabetical: whether the author list for this article is in alphabetical order.

    With the previous dataset, a resampler was applied to generate null homophily values for each year. There are 4 datasets in R Data Serialization (RDS) format:

    null_field.rds: null homophily values per country, year and field of research.

    null_field_comp.rds: null homophily values per year and field of research (only for complete authorships).

    null_research.rds: null homophily values per year and area of research.

    null_research_comp.rds: null homophily values per year and area of research (only for complete authorships).

    All these datasets have the same structure:

    country: country (two-letter ISO code).

    year: year.

    variable: either field or research area name.

    m: average homophily.

    s: homophily std. error.

    Finally, some supplementary files used in the descriptive analysis and methods:

    File null_research_l2019.rds is an example of the output from the resampling algorithm for year 2019.

    File wos_category_to_field.csv is a mapping from WoS categories to more general fields.

    File jcr_if_2020.csv contains the percentiles of the journal impact factor for the JCR 2020.

  15. n

    Data For: Herbarium specimens provide reliable estimates of phenological...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Jun 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tadeo Ramirez-Parada; Isaac Park; Susan Mazer (2022). Data For: Herbarium specimens provide reliable estimates of phenological responsiveness to climate at unparalleled taxonomic and spatiotemporal scales [Dataset]. http://doi.org/10.25349/D9TK64
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 22, 2022
    Dataset provided by
    University of California, Santa Barbara
    Authors
    Tadeo Ramirez-Parada; Isaac Park; Susan Mazer
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Understanding the effects of climate change on the phenological structure of plant communities will require measuring variation in sensitivity among thousands of co-occurring species across regions. Herbarium collections provide vast resources with which to do this, but may also exhibit biases as sources of phenological data. Despite general recognition of these caveats, validation of herbarium-based estimates of phenological sensitivity against estimates obtained using field observations remain rare and limited in scope. Here, we leveraged extensive datasets of herbarium specimens and of field observations from the USA National Phenology Network for 21 species in the United States and, for each species, compared herbarium- and field-based standardized estimates of peak flowering dates and of sensitivity of peak flowering time to geographic and interannual variation in mean spring minimum temperatures (TMIN). We found strong agreement between herbarium- and field-based estimates for standardized peak flowering time (r=0.91, p<0.001) and for the direction and magnitude of sensitivity to both geographic TMIN variation (r=0.88, p <0.001) and interannual TMIN variation (r=0.82, p<0.001). This agreement was robust to substantial differences between datasets in 1) the long-term TMIN conditions observed among collection and phenological monitoring sites and 2) the interannual TMIN conditions observed in the time periods encompassed by both datasets for most species. Our results show that herbarium-based sensitivity estimates are reliable among species spanning a wide diversity of life histories and biomes, demonstrating their utility in a broad range of ecological contexts, and underscoring the potential of herbarium collections to enable phenoclimatic analysis at taxonomic and spatiotemporal scales not yet captured by observational data.

    Methods Phenological data The dataset of field observations consisted of all records of flowering onset and termination available in the USA National Phenology Network database (NPNdb), representing an initial 1,105,764 phenological observations. To ensure the quality of the observational data, we retained only observations for which we could determine that the dates of onset and termination of flowering had an arbitrary maximum error of 14 days. To do this, we filtered the data to include only records for which the date on which the first open flower on an individual was observed was preceded by an observation of the same individual without flowers no more than 14 days prior, and for which the date on which the last flower was recorded was followed by an observation of the same individual without flowers no more than 14 days later. After filtering, field observations in our data had an average maximum error of 6.4 days for the onset of flowering, and of 6.6 days for the termination of flowering. The herbarium dataset was constructed using an initial 894,392 digital herbarium specimen records archived by 72 herbaria across North America. We excluded from analysis all specimens not explicitly recorded as being in flower, or for which GPS coordinates or dates of collection were not available. We further filtered both datasets by only retaining species that were found in both datasets and that were represented by observations at a minimum of 15 unique sites in the NPN dataset. For each species, and to more closely match the geographic ranges covered by each dataset, we filtered the herbarium dataset to include only specimens within the range of latitudes and longitudes represented by the field observations in the NPN data. Finally, we retained only species represented by 70 or more herbarium specimens to ensure sufficient sample sizes for phenoclimatic modeling. This procedure identified a final set of 21 native species represented in 3,243 field observations across 1,406 unique site-year combinations, and a final sample of 5,405 herbarium specimens across 4,906 unique site-year combinations. For the herbarium dataset, sample sizes ranged from 69 unique sites and 74 specimens for Prosopis velutina, to 1,323 unique sites containing 1,368 specimens for Achillea millefolium. Sample sizes in the NPN dataset ranged from 15 unique sites with 74 observations for Impatiens capensis, 108 unique sites with 321 observations for Cornus florida. These 21 species represented 15 families and 17 genera, spanning a diverse range of life-history strategies and growth forms, including evergreen and deciduous shrubs and trees (e.g., Quercus agrifolia and Tilia americana, respectively), as well as herbaceous perennials (e.g., Achillea millefolium) and annuals (e.g., Impatiens capensis). Our focal species covered a wide variety of biomes and regions including Western deserts (e.g., Fouquieria splendens), Mediterranean shrublands and oak woodlands (e.g., Baccharis pilularis, Quercus agrifolia), and Eastern deciduous forests (e.g., Quercus rubra, Tilia Americana). To estimate flowering dates in the herbarium dataset, we employed the day of year of collection (henceforth ‘DOY’) of each specimen collected while in flower as a proxy. Herbarium specimens in flower could have been collected at any point between the onset and termination of their flowering period and botanists may preferentially collect individuals in their flowering peak for many species. Therefore, herbarium specimen collection dates are more likely to reflect peak flowering dates than flowering onset dates. To maximize the phenological equivalence of the field and herbarium datasets, we used the median date between onset and termination of flowering for each individual in each year in the NPN data as a proxy for peak flowering time. Due to the maximum error of 14 days for flowering onset and termination dates in the NPN dataset, median flowering dates also had a maximum error of 14 days, with an average maximum error among observations of 6.5 days. To account for the artificial DOY discontinuity between December 31st (DOY = 365 or 366 in a leap year) to January 1st (DOY = 1), we converted DOY in both datasets into a circular variable using an Azimuthal correction. Climate data Daily minimum temperatures mediate key developmental processes including the break of dormancy, floral induction, and anthesis. Therefore, we used minimum surface temperatures averaged over the three months leading up to (and including) the mean flowering month for each species (hereafter ‘TMIN’) as the climatic correlate of flowering time in this study; consequently, the specific months over which temperatures were averaged varied among species. Using TMIN calculated over different time periods instead (e.g., during spring for all species) did not qualitatively affect our results. Then, we partitioned variation among sites into spatial and temporal components, characterizing TMIN for each observation by the long-term mean TMIN at its site of collection (henceforth ‘TMIN normals’), and by the deviation between its TMIN in the year of collection (for the three-month window of interest) and its long-term mean TMIN (henceforth ‘TMIN anomalies’). For each site, we obtained a monthly time series of TMIN from January, 1901, and December, 2016, using ClimateNA v6.30, a software package that interpolates 4km2 resolution climate data from PRISM Climate Group from Oregon State University, (http://prism.oregonstate.edu) to generate elevation-adjusted climate estimates. To calculate TMIN normals, we averaged observed TMIN for the three months leading up to the mean flowering date of each species across all years between 1901 and 2016 for each site. TMIN anomalies relative to long-term conditions were calculated by subtracting TMIN normals from observed TMIN conditions in the year of collection. Therefore, positive and negative values of the anomalies respectively reflect warmer-than-average and colder-than-average conditions in a given year. Analysis We also provide R code to reproduce all results presented in the main text and the supplemental materials of our study. This code includes 1) all steps necessary to merge herbarium and field data into a single dataset ready for analysis, 2) the formulation and specification of the varying-intercepts and varying-slopes Bayesian model used to generate herbarium- vs. field-based estimates of phenology and its sensitivity to TMINsp, 3) the steps required to process the output of the Bayesian model and to obtain all metrics required for the analyses in the paper, and 4) the code used to generate each figure. Contributing Herbaria Data used in this study was contributed by the Yale Peabody Museum of Natural History, the George Safford Torrey Herbarium at the University of Connecticut, the Acadia University Herbarium, the Chrysler Herbarium at Rutgers University, the University of Montreal Herbarium, the Harvard University Herbarium, the Albion Hodgdon Herbarium at the University of New Hampshire, the Academy of Natural Sciences of Drexel University, the Jepson Herbarium at the University of California-Berkeley, the University of California-Berkeley Sagehen Creek Field Station Herbarium, the California Polytechnic State University Herbarium, the University of Santa Cruz Herbarium, the Black Hills State University Herbarium, the Luther College Herbarium, the Minot State University Herbarium, the Tarleton State University Herbarium, the South Dakota State University Herbarium, the Pittsburg State University Herbarium, the Montana State University-Billings Herbarium, the Sul Ross University Herbarium, the Fort Hays State University Herbarium, the Utah State University Herbarium, the Brigham Young University Herbarium, the Eastern Nevada Landscape Coalition Herbarium, the University of Nevada Herbarium, the Natural History Museum of Utah, the Western Illinois University Herbarium, the Eastern Illinois University Herbarium, the Northern Illinois University Herbarium, the Morton Arboretum Herbarium, the Chicago Botanic Garden

  16. HyG: A hydraulic geometry dataset derived from historical stream gage...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Feb 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh (2024). HyG: A hydraulic geometry dataset derived from historical stream gage measurements across the conterminous United States [Dataset]. http://doi.org/10.5281/zenodo.10425392
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 26, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas L. Enzminger; J. Toby Minear; Ben Livneh; Thomas L. Enzminger; J. Toby Minear; Ben Livneh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Contiguous United States, United States
    Description

    Regional- and continental-scale models predicting variations in the magnitude and timing of streamflow are important tools for forecasting water availability as well as flood inundation extent and associated damages. Such models must define the geometry of stream channels through which flow is routed. These channel parameters, such as width, depth, and hydraulic resistance, exhibit substantial variability in natural systems. While hydraulic geometry relationships have been extensively studied in the United States, they remain unquantified for thousands of stream reaches across the country. Consequently, large-scale hydraulic models frequently take simplistic approaches to channel geometry parameterization. Over-simplification of channel geometries directly impacts the accuracy of streamflow estimates, with knock-on effects for water resource and hazard prediction.

    Here, we present a hydraulic geometry dataset derived from long-term measurements at U.S. Geological Survey (USGS) stream gages across the conterminous United States (CONUS). This dataset includes (a) at-a-station hydraulic geometry parameters following the methods of Leopold and Maddock (1953), (b) at-a-station Manning's n calculated from the Manning equation, (c) daily discharge percentiles, and (d) downstream hydraulic geometry regionalization parameters based on HUC4 (Hydrologic Unit Code 4). This dataset is referenced in Heldmyer et al. (2022); further details and implications for CONUS-scale hydrologic modeling are available in that article (https://doi.org/10.5194/hess-26-6121-2022).

    At-a-station Hydraulic Geometry

    We calculated hydraulic geometry parameters using historical USGS field measurements at individual station locations. Leopold and Maddock (1953) derived the following power law relationships:

    \(w={aQ^b}\)

    \(d=cQ^f\)

    \(v=kQ^m\)

    where Q is discharge, w is width, d is depth, v is velocity, and a, b, c, f, k, and m are at-a-station hydraulic geometry (AHG) parameters. We downloaded the complete record of USGS field measurements from the USGS NWIS portal (https://waterdata.usgs.gov/nwis/measurements). This raw dataset includes 4,051,682 individual measurements from a total of 66,841 stream gages within CONUS. Quantities of interest in AHG derivations are Q, w, d, and v. USGS field measurements do not include d--we therefore calculated d using d=A/w, where A is measured channel area. We applied the following quality control (QC) procedures in order to ensure the robustness of AHG parameters derived from the field data:

    1. We considered only measurements which reported Q, v, w and A.
    2. For each gage, we excluded measurements older than the most recent five years, so as to minimize the effects of long-term channel evolution on observed hydraulic geometry relationships.
    3. We excluded gages for which measured Q disagreed with the product of measured velocity and measured area by more than 5%. Gages for which \( Q eq vA\) are often tidally influenced and therefore may not conform to expected channel geometry relationships.
    4. Q, v, w, and d from field measurements at each gage were log-transformed. We performed robust linear regressions on the relationships between log(Q) and log(w), log(v), and log(d). AHG parameters were derived from the regressed explanatory variables.
      1. We applied an iterative outlier detection procedure to the linear regression residuals. Values of log-transformed w, v, and d residuals falling outside a three median absolute deviation (MAD) envelope were excluded. Regression coefficients were recalculated and the outlier detection procedure was reapplied until no new outliers were detected.
      2. Gages for which one or more regression had p-values >0.05 were excluded, as the relationships between log-transformed Q and w, v, or d lacked statistical significance.
      3. Gages were omitted if regressed AHG parameters did not fulfill two additional relationships derived by Leopold and Maddock: \(b+f+m=1{\displaystyle \pm }0.1\) and \(a{\displaystyle \times }c{\displaystyle \times }k=1{\displaystyle \pm }0.1\).
    5. If the number of field measurements for a given gage was less than 10, either initially or after individual measurements were removed via steps 1-4, the gage was excluded from further analysis.

    Application of the QC procedures described above removed 55,328 stream gages, many of which were short-term campaign gages at which very few field measurements had been recorded. We derived AHG parameters for the remaining 11,513 gages which passed our QC.

    At-a-station Manning's n

    We calculated hydraulic resistance at each gage location by solving Manning's equation for Manning's n, given by

    \(n = {{R^{2/3}S^{1/2}} \over v}\)

    where v is velocity, R is hydraulic radius and S is longitudinal slope. We used smoothed reach-scale longitudinal slopes from the NHDPlusv2 (National Hydrography Dataset Plus, version 2) ElevSlope data product. We note that NHDPlusv2 contains a minimum slope constraint of 10-5 m/m--no reach may have a slope less than this value. Furthermore, NHDPlusv2 lacks slope values for certain reaches. As such, we could not calculate Manning's n for every gage, and some Manning's n values we report may be inaccurate due to the NHDPlusv2 minimum slope constraint. We report two Manning's n values, both of which take stream depth as an approximation for R. The first takes the median stream depth and velocity measurements from the USGS's database of manual flow measurements for each gage. The second uses stream depth and velocity calculated for a 50th percentile discharge (Q50; see below). Approximating R as stream depth is an assumption which is generally considered valid if the width-to-depth ratio of the stream is greater than 10which was the case for the vast majority of field measurements. Thus, we report two Manning's n values for each gage, which are each intended to approximately represent median flow conditions.

    Daily discharge percentiles

    We downloaded full daily discharge records from 16,947 USGS stream gages through the NWIS online portal. The data includes records from both operational and retired gages. Records for operational gages were truncated at the end of the 2018 water year (September 30, 2018) in order to avoid use of preliminary data. To ensure the robustness of daily discharge percentiles, we applied the following QC:

    1. For a given gage, we removed blocks of missing discharge values longer than 6 months. These long blocks of missing data generally correspond to intervals in which a gage was temporarily decommissioned for maintenance.
    2. A gage was omitted from further analysis if its discharge record was less than 10 years (3,652 days) long, and/or less than 90% complete (>10% missing values after removal of long blocks in step 1.

    We calculated discharge percentiles for each of the 10,871 gages which passed QC. Discharge percentiles were calculated at increments of 1% between Q1 and Q5, increments of 5% (e.g. Q10, Q15, Q20, etc.) between Q5 and Q95, increments of 1% between Q95 and Q99, and increments of 0.1% between Q99 and Q100 in order to provide higher resolution at the lowest and highest flows, which occur much less frequently.

    HG Regionalization

    We regionalized AHG parameters from gage locations to all stream reaches in the conterminous United States. This downstream hydraulic geometry regionalization was performed using all gages with AHG parameters in each HUC4, as opposed to traditional downstream hydraulic geometry--which involves interpolation of parameters of interest to ungaged reaches on individual streams. We performed linear regressions on log-transformed drainage area and Q at a number of flow percentiles as follows:

    \(log(Q_i) = \beta_1log(DA) + \beta_0\)

    where Qi is streamflow at percentile i, DA is drainage area and \(\beta_1\) and \(\beta_0\) are regression parameters. We report \(\beta_1\), \(\beta_0\) , and the r2 value of the regression relationship for Q percentiles Q10, Q25, Q50, Q75, Q90, Q95, Q99, and Q99.9. Further discussion and additional analysis of HG regionalization are presented in Heldmyer et al. (2022).

    Dataset description

    We present the HyG dataset in a comma-separated value (csv) format. Each row corresponds to a different USGS stream gage. Information in the dataset includes gage ID (column 1), gage location in latitude and longitude (columns 2-3), gage drainage area (from USGS; column 4), longitudinal slope of the gage's stream reach (from NHDPlusv2; column 5), AHG parameters derived from field measurements (columns 6-11), Manning's n calculated from median measured flow conditions (column 12), Manning's n calculated from Q50 (column 13), Q percentiles (columns 14-51), HG regionalization parameters and r2 values (columns 52-75), and geospatial information for the HUC4 in which the gage is located (from USGS; columns 76-87). Users are advised to exercise caution when opening the dataset. Certain software, including Microsoft Excel and Python, may drop the leading zeros in USGS gage IDs and HUC4 IDs if these columns are not explicitly imported as strings.

    Errata

    In version 1, drainage area was mistakenly reported in cubic meters but labeled in cubic kilometers. This error has been corrected in version 2.

  17. Z

    Food and Agriculture Biomass Input–Output (FABIO) database

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    Updated Jun 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kuschnig, Nikolas (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
    Explore at:
    Dataset updated
    Jun 8, 2022
    Dataset provided by
    Kuschnig, Nikolas
    Bruckner, Martin
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

    The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

    The database consists of the following main components, in compressed .rds format:

    Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

    Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

    X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

    L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

    E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

    mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

    mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

    A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

    Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

    Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

    How to cite:

    To cite FABIO work please refer to this paper:

    Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

    License:

    This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

    Known issues:

    The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.

  18. Z

    Data from: Spatio-temporal dynamics of attacks around deaths of wolves: A...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chamaillé-Jammes, Simon (2025). Spatio-temporal dynamics of attacks around deaths of wolves: A statistical assessment of lethal control efficiency in France [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12772867
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset provided by
    Duchamp, Christophe
    Grente, Oksana
    Chamaillé-Jammes, Simon
    Drouet-Hoguet, Nolwenn
    Gimenez, Olivier
    Opitz, Thomas
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    France
    Description

    This repository contains the supplementary materials (Supplementary_figures.docx, Supplementary_tables.docx) of the manuscript: "Spatio-temporal dynamics of attacks around deaths of wolves: A statistical assessment of lethal control efficiency in France". This repository also provides the R codes and datasets necessary to run the analyses described in the manuscript.

    The R datasets with suffix "_a" have anonymous spatial coordinates to respect confidentiality. Therefore, the preliminary preparation of the data is not provided in the public codes. These datasets, all geolocated and necessary to the analyses, are:

    Attack_sf_a.RData: 19,302 analyzed wolf attacks on sheep

    ID: unique ID of the attack

    DATE: date of the attack

    PASTURE: the related pasture ID from "Pasture_sf_a" where the attack is located

    STATUS: column resulting from the preparation and the attribution of attacks to pastures (part 2.2.4 of the manuscript); not shown here to respect confidentiality

    Pasture_sf_a.RData: 4987 analyzed pastures grazed by sheep

    ID: unique ID of the pasture

    CODE: Official code in the pastoral census

    FLOCK_SIZE: maximum annual number of sheep grazing in the pasture

    USED_MONTHS: months for which the pasture is grazed by sheep

    Removal_sf_a.RData: 232 analyzed single wolf removal or groups of wolf removals

    ID: unique ID of the removal

    OVERLAP: are they single removal ("non-interacting" in the manuscript => "NO" here), or not ("interacting" in the manuscrit, here "SIMULTANEOUS" for removals occurring during the same operation or "NON-SIMULTANEOUS" if not).

    DATE_MIN: date of the single removal or date of the first removal of a group

    DATE_MAX: date of the single removal or date of the last removal of a group

    CLASS: administrative type of the removal according to definitions from 2.1 part of the manuscript

    SEX: sex or sexes of the removed wolves if known

    AGE: class age of the removed wolves if known

    BREEDER: breeding status of the removed female wolves, "Yes" for female breeder, "No" for female non-breeder. Males are "No" by default, when necropsied; dead individuals with NA were not found.

    SEASON: season of the removal, as defined in part 2.3.4 of the manuscript

    MASSIF: mountain range attributed to the removal, as defined in part 2.3.4 of the manuscript

    Area_to_exclude_sf_a.RData: one row for each mountain range, corresponding to the area where removal controls of the mountain range could not be sampled, as defined in part 2.3.6 of the manuscript

    These datasets were used to run the following analyses codes:

    Code 1 : The file Kernel_wolf_culling_attacks_p.R contains the before-after analyses.

    We start by delimiting the spatio-temporal buffer for each row of the "Removal_sf_a.RData" dataset.

    We identify the attacks from "Attack_sf_a.RData" within each buffer, giving the data frame "Buffer_df" (one row per attack)

    We select the pastures from "Pasture_sf_a.RData" within each buffer, giving the data frame "Buffer_sf" (one row per removal)

    We calculate the spatial correction

    We spatially slice each buffer into 200 rings, giving the data frame "Ring_sf" (one row per ring)

    We add the total pastoral area of the ring of the attack ("SPATIAL_WEIGHT"), for each attack of each buffer, within Buffer_df ("Buffer_df.RData")

    We calculate the pastoral correction

    We create the pastoral matrix for each removal, giving a matrix of 200 rows (one for each ring) and 180 columns (one for each day, 90 days before the removal date and 90 day after the removal date), with the total pastoral area in use by sheep for each corresponding cell of the matrix (one element per removal, "Pastoral_matrix_lt.RData")

    We simulate, for each removal, the random distribution of the attacks from "Buffer_df.RData" according to "Pastoral_matrix_lt.RData". The process is done 100 times (one element per simulation, "Buffer_simulation_lt.RData").

    We estimate the attack intensities

    We classified the removals into 20 subsets, according to part 2.3.4 of the manuscript ("Variables_lt.RData") (one element per subset)

    We perform, for each subset, the kernel estimations with the observed attacks ("Kernel_lt.RData"), with the simulated attacks ("Kernel_simulation_lt.RData") and we correct the first kernel computations with the second ("Kernel_controlled_lt.RData") (one element per subset).

    We calculate the trend of attack intensities, for each subset, that compares the total attack intensity before and after the removals (part 2.3.5 of the manuscript), giving "Trends_intensities_df.RData". (one row per subset)

    We calculate the trend of attack intensities, for each subset, along the spatial axis, three times, one for each time analysis scale. This gives "Shift_df" (one row per ring and per time analysis scale.

    Code 2 : The file Control_removals_p.R contains the control-impact analyses.

    It starts with the simulation of 100 removal control sets ("Control_sf_lt_a.RData") from the real set of removals ("Removal_sf_a.RData"), that is done with the function "Control_fn" (l. 92).

    The rest of the analyses follows the same process as in the first code "Kernel_wolf_culling_attacks_p.R", in order to apply the before-after analyses to each control set. All objects have the same structure as before, except that they are now a list, with one resulting element per control set. These objects have "control" in their names (not to be confused with "controlled" which refers to the pastoral correction already applied in the first code).

    The code is also applied again, from l. 92 to l. 433, this time for the real set of removals (l. 121) - with "Simulated = FALSE" (l. 119). We could not simply use the results from the first code because the set of removals is restricted to removals attributed to mountain ranges only. There are 2 resulting objects: "Kernel_real_lt.RData" (observed real trends) and "Kernel_controlled_real_lt.RData" (real trends corrected for pastoral use).

    The part of the code from line 439 to 524 relates to the calculations of the trends (for the real set and the control sets), as in the first code, giving "Trends_intensities_real_df.RData" and "Trends_intensities_control_lt.RData".

    The part of the code from line 530 to 588 relates to the calculation of the 95% confidence intervals and the means of the intensity trends for each subset based on the results of the 100 control sets (Trends_intensities_mean_control_df.RData, Trends_intensities_CImin_control_df.RData and Trends_intensities_CImax_control_df.RData). This will be used to test the significativity of the real trends. This comparison is done right after, l. 595-627, and gives the data frame "Trends_comparison_df.RData".

    Code 3 : The file Figures.R produces part of the figures from the manuscript:

    "Dataset map": figure 1

    "Buffer": figure 2 (then pasted in powerpoint)

    "Kernel construction": figure 5 (then pasted in powerpoint)

    "Trend distributions": figure 7

    "Kernels": part of figures 10 and S2

    "Attack shifts": figure 9 and S1

    "Significant": figure 8

  19. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo (2024). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Dataset]. http://doi.org/10.5281/zenodo.1188976
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven R. Livingstone; Steven R. Livingstone; Frank A. Russo; Frank A. Russo
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Description

    The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). The dataset contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

    The RAVDESS was developed by Dr Steven R. Livingstone, who now leads the Affective Data Science Lab, and Dr Frank A. Russo who leads the SMART Lab.

    Citing the RAVDESS

    The RAVDESS is released under a Creative Commons Attribution license, so please cite the RAVDESS if it is used in your work in any form. Published academic papers should use the academic paper citation for our PLoS1 paper. Personal works, such as machine learning projects/blog posts, should provide a URL to this Zenodo page, though a reference to our PLoS1 paper would also be appreciated.

    Academic paper citation

    Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391.

    Personal use citation

    Include a link to this Zenodo page - https://zenodo.org/record/1188976

    Commercial Licenses

    Commercial licenses for the RAVDESS can be purchased. For more information, please visit our license page of fees, or contact us at ravdess@gmail.com.

    Contact Information

    If you would like further information about the RAVDESS, to purchase a commercial license, or if you experience any issues downloading files, please contact us at ravdess@gmail.com.

    Example Videos

    Watch a sample of the RAVDESS speech and song videos.

    Emotion Classification Users

    If you're interested in using machine learning to classify emotional expressions with the RAVDESS, please see our new RAVDESS Facial Landmark Tracking data set [Zenodo project page].

    Construction and Validation

    Full details on the construction and perceptual validation of the RAVDESS are described in our PLoS ONE paper - https://doi.org/10.1371/journal.pone.0196391.

    The RAVDESS contains 7356 files. Each file was rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained adult research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity, interrater reliability, and test-retest intrarater reliability were reported. Validation data is open-access, and can be downloaded along with our paper from PLoS ONE.

    Contents

    Audio-only files

    Audio-only files of all actors (01-24) are available as two separate zip files (~200 MB each):

    • Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440.
    • Song file (Audio_Song_Actors_01-24.zip, 198 MB) contains 1012 files: 44 trials per actor x 23 actors = 1012.

    Audio-Visual and Video-only files

    Video files are provided as separate zip downloads for each actor (01-24, ~500 MB each), and are split into separate speech and song downloads:

    • Speech files (Video_Speech_Actor_01.zip to Video_Speech_Actor_24.zip) collectively contains 2880 files: 60 trials per actor x 2 modalities (AV, VO) x 24 actors = 2880.
    • Song files (Video_Song_Actor_01.zip to Video_Song_Actor_24.zip) collectively contains 2024 files: 44 trials per actor x 2 modalities (AV, VO) x 23 actors = 2024.

    File Summary

    In total, the RAVDESS collection includes 7356 files (2880+2024+1440+1012 files).

    File naming convention

    Each of the 7356 RAVDESS files has a unique filename. The filename consists of a 7-part numerical identifier (e.g., 02-01-06-01-02-01-12.mp4). These identifiers define the stimulus characteristics:

    Filename identifiers

    • Modality (01 = full-AV, 02 = video-only, 03 = audio-only).
    • Vocal channel (01 = speech, 02 = song).
    • Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad, 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
    • Emotional intensity (01 = normal, 02 = strong). NOTE: There is no strong intensity for the 'neutral' emotion.
    • Statement (01 = "Kids are talking by the door", 02 = "Dogs are sitting by the door").
    • Repetition (01 = 1st repetition, 02 = 2nd repetition).
    • Actor (01 to 24. Odd numbered actors are male, even numbered actors are female).


    Filename example: 02-01-06-01-02-01-12.mp4

    1. Video-only (02)
    2. Speech (01)
    3. Fearful (06)
    4. Normal intensity (01)
    5. Statement "dogs" (02)
    6. 1st Repetition (01)
    7. 12th Actor (12)
    8. Female, as the actor ID number is even.

    License information

    The RAVDESS is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, CC BY-NC-SA 4.0

    Commercial licenses for the RAVDESS can also be purchased. For more information, please visit our license fee page, or contact us at ravdess@gmail.com.

    Related Data sets

  20. Z

    SCAR Southern Ocean Diet and Energetics Database

    • data.niaid.nih.gov
    • data.aad.gov.au
    • +3more
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Committee on Antarctic Research (2023). SCAR Southern Ocean Diet and Energetics Database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5072527
    Explore at:
    Dataset updated
    Jul 24, 2023
    Dataset authored and provided by
    Scientific Committee on Antarctic Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southern Ocean
    Description

    Information related to diet and energy flow is fundamental to a diverse range of Antarctic and Southern Ocean biological and ecosystem studies. This metadata record describes a database of such information being collated by the SCAR Expert Groups on Antarctic Biodiversity Informatics (EG-ABI) and Birds and Marine Mammals (EG-BAMM) to assist the scientific community in this work. It includes data related to diet and energy flow from conventional (e.g. gut content) and modern (e.g. molecular) studies, stable isotopes, fatty acids, and energetic content. It is a product of the SCAR community and open for all to participate in and use.

    Data have been drawn from published literature, existing trophic data collections, and unpublished data. The database comprises five principal tables, relating to (i) direct sampling methods of dietary assessment (e.g. gut, scat, and bolus content analyses, stomach flushing, and observed predation), (ii) stable isotopes, (iii) lipids, (iv) DNA-based diet assessment, and (v) energetics values. The schemas of these tables are described below, and a list of the sources used to populate the tables is provided with the data.

    A range of manual and automated checks were used to ensure that the entered data were as accurate as possible. These included visual checking of transcribed values, checking of row or column sums against known totals, and checking for values outside of allowed ranges. Suspicious entries were re-checked against original source.

    Notes on names: Names have been validated against the World Register of Marine Species (http://www.marinespecies.org/). For uncertain taxa, the most specific taxonomic name has been used (e.g. prey reported in a study as "Pachyptila sp." will appear here as "Pachyptila"; "Cephalopods" will appear as "Cephalopoda"). Uncertain species identifications (e.g. "Notothenia rossii?" or "Gymnoscopelus cf. piabilis") have been assigned the genus name (e.g. "Notothenia", "Gymnoscopelus"). Original names have been retained in a separate column to allow future cross-checking. WoRMS identifiers (APHIA_ID numbers) are given where possible.

    Grouped prey data in the diet sample table need to be handled with a bit of care. Papers commonly report prey statistics aggregated over groups of prey - e.g. one might give the diet composition by individual cephalopod prey species, and then an overall record for all cephalopod prey. The PREY_IS_AGGREGATE column identifies such records. This allows us to differentiate grouped data like this from unidentified prey items from a certain prey group - for example, an unidentifiable cephalopod record would be entered as Cephalopoda (the scientific name), with "N" in the PREY_IS_AGGREGATE column. A record that groups together a number of cephalopod records, possibly including some unidentifiable cephalopods, would also be entered as Cephalopoda, but with "Y" in the PREY_IS_AGGREGATE column. See the notes on PREY_IS_AGGREGATE, below.

    There are two related R packages that provide data access and functionality for working with these data. See the package home pages for more information: https://github.com/SCAR/sohungry and https://github.com/SCAR/solong.

    Data table schemas

    Sources data table

    • SOURCE_ID: The unique identifier of this source

    • DETAILS: The bibliographic details for this source (e.g. "Hindell M (1988) The diet of the royal penguin Eudyptes schlegeli at Macquarie Island. Emu 88:219–226")

    • NOTES: Relevant notes about this source – if it’s a published paper, this is probably the abstract

    • DOI: The DOI of the source (paper or dataset), in the form "10.xxxx/yyyy"

    Diet data table

    • RECORD_ID: The unique identifier of this record

    • SOURCE_ID: The identifier of the source study from which this record was obtained (see corresponding entry in the sources data table)

    • SOURCE_DETAILS, SOURCE_DOI: The details and DOI of the source, copied from the sources data table for convenience

    • ORIGINAL_RECORD_ID: The identifier of this data record in its original source, if it had one

    • LOCATION: The name of the location at which the data was collected

    • WEST: The westernmost longitude of the sampling region, in decimal degrees (negative values for western hemisphere longitudes)

    • EAST: The easternmost longitude of the sampling region, in decimal degrees (negative values for western hemisphere longitudes)

    • SOUTH: The southernmost latitude of the sampling region, in decimal degrees (negative values for southern hemisphere latitudes)

    • NORTH: The northernmost latitude of the sampling region, in decimal degrees (negative values for southern hemisphere latitudes)

    • ALTITUDE_MIN: The minimum altitude of the sampling region, in metres

    • ALTITUDE_MAX: The maximum altitude of the sampling region, in metres

    • DEPTH_MIN: The shallowest depth of the sampling, in metres

    • DEPTH_MAX: The deepest depth of the sampling, in metres

    • OBSERVATION_DATE_START: The start of the sampling period

    • OBSERVATION_DATE_END: The end of the sampling period. If sampling was carried out over multiple seasons (e.g. during January of 2002 and January of 2003), this will be the first and last dates (in this example, from 1-Jan-2002 to 31-Jan-2003)

    • PREDATOR_NAME: The name of the predator. This may differ from predator_name_original if, for example, taxonomy has changed since the original publication, if the original publication had spelling errors or used common (not scientific) names

    • PREDATOR_NAME_ORIGINAL: The name of the predator, as it appeared in the original source

    • PREDATOR_APHIA_ID: The numeric identifier of the predator in the WoRMS taxonomic register

    • PREDATOR_WORMS_RANK, PREDATOR_WORMS_KINGDOM, PREDATOR_WORMS_PHYLUM, PREDATOR_WORMS_CLASS, PREDATOR_WORMS_ORDER, PREDATOR_WORMS_FAMILY, PREDATOR_WORMS_GENUS: The taxonomic details of the predator, from the WoRMS taxonomic register

    • PREDATOR_GROUP_SOKI: A descriptive label of the group to which the predator belongs (currently used in the Southern Ocean Knowledge and Information wiki, http://soki.aq)

    • PREDATOR_LIFE_STAGE: Life stage of the predator, e.g. "adult", "chick", "larva", "juvenile". Note that if a food sample was taken from an adult animal, but that food was destined for a juvenile, then the life stage will be "juvenile" (this is common with seabirds feeding chicks)

    • PREDATOR_BREEDING_STAGE: Stage of the breeding season of the predator, if applicable, e.g. "brooding", "chick rearing", "nonbreeding", "posthatching"

    • PREDATOR_SEX: Sex of the predator: "male", "female", "both", or "unknown"

    • PREDATOR_SAMPLE_COUNT: The number of predators for which data are given. If (say) 50 predators were caught but only 20 analysed, this column will contain 20. For scat content studies, this will be the number of scats analysed

    • PREDATOR_SAMPLE_ID: The identifier of the predator(s). If predators are being reported at the individual level (i.e. PREDATOR_SAMPLE_COUNT = 1) then PREDATOR_SAMPLE_ID is the individual animal ID. Alternatively, if the data values being entered here are from a group of predators, then the PREDATOR_SAMPLE_ID identifies that group of predators. PREDATOR_SAMPLE_ID values are unique within a source (i.e. SOURCE_ID, PREDATOR_SAMPLE_ID pairs are globally unique). Rows with the same SOURCE_ID and PREDATOR_SAMPLE_ID values relate to the same predator individual or group of individuals, and so can be combined (e.g. for prey diversity analyses). Subsamples are indicated by a decimal number S.nnn, where S is the parent PREDATOR_SAMPLE_ID, and nnn (001-999) is the subsample number. Studies will sometimes report detailed prey information for a large sample, but then report prey information for various subsamples of that sample (e.g. broken down by predator sex, or sampling season). In the simplest case, the diet of each predator will be reported only once in the study, and in this scenario the PREDATOR_SAMPLE_ID values will simply be 1 to N (for N predators).

    • PREDATOR_SIZE_MIN, PREDATOR_SIZE_MAX, PREDATOR_SIZE_MEAN, PREDATOR_SIZE_SD: The minimum, maximum, mean, and standard deviation of the size of the predators in the sample

    • PREDATOR_SIZE_UNITS: The units of size (e.g. "mm")

    • PREDATOR_SIZE_NOTES: Notes on the predator size information, including a definition of what the size value represents (e.g. "total length", "standard length")

    • PREDATOR_MASS_MIN, PREDATOR_MASS_MAX, PREDATOR_MASS_MEAN, PREDATOR_MASS_SD: The minimum, maximum, mean, and standard deviation of the mass of the predators in the sample

    • PREDATOR_MASS_UNITS: The units of mass (e.g. "g", "kg")

    • PREDATOR_MASS_NOTES: Notes on the predator mass information, including a definition of what the mass value represents

    • PREY_NAME: The scientific name of the prey item (corrected, if necessary)

    • PREY_NAME_ORIGINAL: The name of the prey item, as it appeared in the original source

    PREY_APHIA_ID: The numeric identifier of the prey in the WoRMS taxonomic register

    • PREY_WORMS_RANK, PREY_WORMS_KINGDOM, PREY_WORMS_PHYLUM, PREY_WORMS_CLASS, PREY_WORMS_ORDER, PREY_WORMS_FAMILY, PREY_WORMS_GENUS: The taxonomic details of the prey, from the WoRMS taxonomic register

    • PREY_GROUP_SOKI: A descriptive label of the group to which the prey belongs (currently used in the Southern Ocean Knowledge and Information wiki, http://soki.aq)

    • PREY_IS_AGGREGATE: "Y" indicates that this row is an aggregation of other rows in this data source. For example, a study might give a number of individual squid species records, and then an overall squid record that encompasses the individual records. Use the PREY_IS_AGGREGATE information to avoid double-counting during analyses

    • PREY_LIFE_STAGE: Life stage of the prey (e.g. "adult", "chick", "larva")

    • PREY_SEX: The sex of the prey ("male", "female", "both", or "unknown"). Note that this is generally "unknown"

    • PREY_SAMPLE_COUNT: The number of prey individuals from which size and mass measurements were made (note: this is NOT the total number of individuals of

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Organization logo

Film Circulation dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
csv, png, binAvailable download formats
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.


Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

Search
Clear search
Close search
Google apps
Main menu