100+ datasets found
  1. d

    Ecological Concerns Data Dictionary - Ecological Concerns data dictionary

    • catalog.data.gov
    • fisheries.noaa.gov
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (Point of Contact, Custodian) (2025). Ecological Concerns Data Dictionary - Ecological Concerns data dictionary [Dataset]. https://catalog.data.gov/dataset/ecological-concerns-data-dictionary-ecological-concerns-data-dictionary2
    Explore at:
    Dataset updated
    May 24, 2025
    Dataset provided by
    (Point of Contact, Custodian)
    Description

    Evaluating the status of threatened and endangered salmonid populations requires information on the current status of the threats (e.g., habitat, hatcheries, hydropower, and invasives) and the risk of extinction (e.g., status and trend in the Viable Salmonid Population criteria). For salmonids in the Pacific Northwest, threats generally result in changes to physical and biological characteristics of freshwater habitat. These changes are often described by terms like "limiting factors" or "habitat impairment." For example, the condition of freshwater habitat directly impacts salmonid abundance and population spatial structure by affecting carrying capacity and the variability and accessibility of rearing and spawning areas. Thus, one way to assess or quantify threats to ESUs and populations is to evaluate whether the ecological conditions on which fish depend is improving, becoming more degraded, or remains unchanged. In the attached spreadsheets, we have attempted to consistently record limiting factors and threats across all populations and ESUs to enable comparison to other datasets (e.g., restoration projects) in a consistent way. Limiting factors and threats (LF/T) identified in salmon recovery plans were translated in a common language using an ecological concerns data dictionary (see "Ecological Concerns" tab in the attached spreadsheets) (a data dictionaries defines the wording, meaning and scope of categories). The ecological concerns data dictionary defines how different elements are related, such as the relationships between threats, ecological concerns and life history stages. The data dictionary includes categories for ecological dynamics and population level effects such as "reduced genetic fitness" and "behavioral changes." The data dictionary categories are meant to encompass the ecological conditions that directly impact salmonids and can be addressed directly or indirectly by management (habitat restoration, hatchery reform, etc.) actions. Using the ecological concerns data dictionary enables us to more fully capture the range of effects of hydro, hatchery, and invasive threats as well as habitat threat categories. The organization and format of the data dictionary was also chosen so the information we record can be easily related to datasets we already posses (e.g., restoration data). Data Dictionary.

  2. Open-Source GitHub Repos: Stars, Issues & PRs

    • kaggle.com
    zip
    Updated Sep 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Mebarek Mecheter (2024). Open-Source GitHub Repos: Stars, Issues & PRs [Dataset]. https://www.kaggle.com/datasets/mohammedmecheter/open-source-github-repos-stars-issues-and-prs
    Explore at:
    zip(17491462 bytes)Available download formats
    Dataset updated
    Sep 6, 2024
    Authors
    Mohammed Mebarek Mecheter
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Introduction to the Data and Fetching Process

    This dataset comprises detailed information about GitHub repositories, issues, and pull requests, collected using the GitHub API. The data includes repository metadata (such as stars, forks, and open issues), along with historical data on issues and pull requests (PRs), including their creation, closure, and merging timelines.

    Repositories Data Dictionary

    This dataset contains information about GitHub repositories, including metadata such as stars, forks, and activity status.

    Column NameData TypeDescription
    idobjectUnique identifier for the repository.
    nameobjectName of the repository (e.g., "docker").
    full_nameobjectFull name of the repository (e.g., "prometheus/alertmanager").
    descriptionobjectDescription of the repository, may be empty.
    starsint64Number of stars the repository has.
    forksint64Number of times the repository has been forked.
    open_issuesint64Number of open issues in the repository.
    created_atdatetimeDate and time when the repository was created.
    updated_atdatetimeDate and time when the repository was last updated.
    size_categoryobjectCategorization of the repository based on the number of stars (micro, small, medium, large, mega).
    staleboolBoolean flag indicating if the repository is "stale" (hasn't been updated in over 6 months).
    stars_per_forkfloat64Number of stars per fork (calculated).
    stars_per_issuefloat64Number of stars per open issue (calculated).
    contributor_per_starfloat64Number of contributors per star (calculated).
    total_contributorsint64Total number of contributors from issues and pull requests.

    Issues Data Dictionary

    This dataset contains details of issues raised in the repositories, including information about their creation, closing, and state.

    Column NameData TypeDescription
    idobjectUnique identifier for the issue.
    created_atdatetimeDate and time when the issue was created.
    updated_atdatetimeDate and time when the issue was last updated.
    closed_atdatetimeDate and time when the issue was closed (optional, null if open).
    numberint64Issue number in the GitHub repository.
    repositoryobjectThe repository that the issue belongs to (name).
    stateobjectCurrent state of the issue (either "open" or "closed").
    titleobjectTitle of the issue.
    resolution_time_daysfloat64Number of days taken to resolve the issue (calculated, -1 for unresolved issues).

    Pull Requests Data Dictionary

    This dataset contains information about pull requests (PRs) in the repositories, including metadata such as their state, creation, closing, and merging time.

    Column NameData TypeDescription
    idobjectUnique identifier for the pull request.
    created_atdatetimeDate and time when the pull request was created.
    updated_atdatetimeDate and time when the pull request was last updated.
    closed_atdatetimeDate and time when the pull request was closed (optional, null if open).
    merged_atdatetimeDate and time when the pull request was merged (optional, null if not merge...
  3. Evaluating Family Engagement in Child Welfare: A Primer for Evaluators on...

    • catalog.data.gov
    • data.virginia.gov
    Updated Sep 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Administration for Children and Families (2025). Evaluating Family Engagement in Child Welfare: A Primer for Evaluators on Key Issues in Definition, Measurement, and Outcomes [Dataset]. https://catalog.data.gov/dataset/evaluating-family-engagement-in-child-welfare-a-primer-for-evaluators-on-key-issues-in-def
    Explore at:
    Dataset updated
    Sep 8, 2025
    Dataset provided by
    Administration for Children and Families
    Description

    ACF Children Bureau resource Metadata-only record linking to the original dataset. Open original dataset below.

  4. Data from: Detroit Area Study, 1956: Orientation on Moral Issues in a...

    • icpsr.umich.edu
    ascii, delimited, sas +2
    Updated Jul 28, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Angell, Robert C.; Kahn, Robert; Weiss, Robert (2010). Detroit Area Study, 1956: Orientation on Moral Issues in a Metropolis and The Meaning of Work [Dataset]. http://doi.org/10.3886/ICPSR07320.v1
    Explore at:
    stata, spss, sas, ascii, delimitedAvailable download formats
    Dataset updated
    Jul 28, 2010
    Dataset provided by
    Inter-university Consortium for Political and Social Researchhttps://www.icpsr.umich.edu/web/pages/
    Authors
    Angell, Robert C.; Kahn, Robert; Weiss, Robert
    License

    https://www.icpsr.umich.edu/web/ICPSR/studies/7320/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/7320/terms

    Time period covered
    1956
    Area covered
    Detroit, United States, Michigan
    Description

    This study of 797 adults in the Detroit metropolitan area provides information on their attitudes toward work and their motivations for working, as well as their orientation toward many social and political issues. The study was a combination of two separate studies: ORIENTATION ON MORAL ISSUES IN A METROPOLIS by Robert Angell, and THE MEANING OF WORK by Robert Kahn and Robert Weiss. Respondents were asked about the importance of work in their life, the things in their job that made them feel important, the things they wanted from their job that it did not provide, the other areas of their life that made them feel useful, and the people in their lives that influenced their choice of occupation. A number of questions that focused on women working outside the home probed respondents' feelings about how a husband was affected by a working wife, and if there were kinds of jobs that women should not have. Other questions probed respondents' views about what the United States should do in the event of an attack by the Soviet Union on a western European country, a parent not allowing a child to recite the Pledge of Allegiance in school, the proposed racial integration of schools, appointment or election of government officials, effecting changes in the United States Constitution, trial by a jury or a judge, ways to effect world peace, the most important problem for the United States in the future, and a Communist revolution in a Latin American country. Additional items explored respondents' opinion of the Detroit newspapers and the Detroit newspaper strike, and their satisfaction with their neighborhood. Respondents were also asked about their political party preference, as well as their use and ownership of telephones. Demographic variables specify age, sex, race, education, place of birth, marital status, number of children, nationality, religious preferences, occupation, family income, length of residence in the Detroit area, home ownership, length of time at present residence, and class identification.

  5. Texas Railroads Data Dictionary

    • gis-txdot.opendata.arcgis.com
    • geoportal-mpo.opendata.arcgis.com
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Texas Department of Transportation (2025). Texas Railroads Data Dictionary [Dataset]. https://gis-txdot.opendata.arcgis.com/documents/56f5fe7f64b84522842b3315363a3708
    Explore at:
    Dataset updated
    Mar 29, 2025
    Dataset authored and provided by
    Texas Department of Transportationhttp://txdot.gov/
    Area covered
    Texas
    Description

    Programmatically generated Data Dictionary document detailing the Texas Railroads service.

        The PDF contains service metadata and a complete list of data fields.
        For any questions or issues related to the document, please contact the data owner of the service identified in the PDF and Credits of this portal item.
    
    
      Related Links
      Texas Railroads Service URL
      Texas Railroads Portal Item
    
  6. Cafe Sales - Dirty Data for Cleaning Training

    • kaggle.com
    zip
    Updated Jan 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Mohamed (2025). Cafe Sales - Dirty Data for Cleaning Training [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/cafe-sales-dirty-data-for-cleaning-training
    Explore at:
    zip(113510 bytes)Available download formats
    Dataset updated
    Jan 17, 2025
    Authors
    Ahmed Mohamed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dirty Cafe Sales Dataset

    Overview

    The Dirty Cafe Sales dataset contains 10,000 rows of synthetic data representing sales transactions in a cafe. This dataset is intentionally "dirty," with missing values, inconsistent data, and errors introduced to provide a realistic scenario for data cleaning and exploratory data analysis (EDA). It can be used to practice cleaning techniques, data wrangling, and feature engineering.

    File Information

    • File Name: dirty_cafe_sales.csv
    • Number of Rows: 10,000
    • Number of Columns: 8

    Columns Description

    Column NameDescriptionExample Values
    Transaction IDA unique identifier for each transaction. Always present and unique.TXN_1234567
    ItemThe name of the item purchased. May contain missing or invalid values (e.g., "ERROR").Coffee, Sandwich
    QuantityThe quantity of the item purchased. May contain missing or invalid values.1, 3, UNKNOWN
    Price Per UnitThe price of a single unit of the item. May contain missing or invalid values.2.00, 4.00
    Total SpentThe total amount spent on the transaction. Calculated as Quantity * Price Per Unit.8.00, 12.00
    Payment MethodThe method of payment used. May contain missing or invalid values (e.g., None, "UNKNOWN").Cash, Credit Card
    LocationThe location where the transaction occurred. May contain missing or invalid values.In-store, Takeaway
    Transaction DateThe date of the transaction. May contain missing or incorrect values.2023-01-01

    Data Characteristics

    1. Missing Values:

      • Some columns (e.g., Item, Payment Method, Location) may contain missing values represented as None or empty cells.
    2. Invalid Values:

      • Some rows contain invalid entries like "ERROR" or "UNKNOWN" to simulate real-world data issues.
    3. Price Consistency:

      • Prices for menu items are consistent but may have missing or incorrect values introduced.

    Menu Items

    The dataset includes the following menu items with their respective price ranges:

    ItemPrice($)
    Coffee2
    Tea1.5
    Sandwich4
    Salad5
    Cake3
    Cookie1
    Smoothie4
    Juice3

    Use Cases

    This dataset is suitable for: - Practicing data cleaning techniques such as handling missing values, removing duplicates, and correcting invalid entries. - Exploring EDA techniques like visualizations and summary statistics. - Performing feature engineering for machine learning workflows.

    Cleaning Steps Suggestions

    To clean this dataset, consider the following steps: 1. Handle Missing Values: - Fill missing numeric values with the median or mean. - Replace missing categorical values with the mode or "Unknown."

    1. Handle Invalid Values:

      • Replace invalid entries like "ERROR" and "UNKNOWN" with NaN or appropriate values.
    2. Date Consistency:

      • Ensure all dates are in a consistent format.
      • Fill missing dates with plausible values based on nearby records.
    3. Feature Engineering:

      • Create new columns, such as Day of the Week or Transaction Month, for further analysis.

    License

    This dataset is released under the CC BY-SA 4.0 License. You are free to use, share, and adapt it, provided you give appropriate credit.

    Feedback

    If you have any questions or feedback, feel free to reach out through the dataset's discussion board on Kaggle.

  7. Z

    Dataset: A Systematic Literature Review on the topic of High-value datasets

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič (2023). Dataset: A Systematic Literature Review on the topic of High-value datasets [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944424
    Explore at:
    Dataset updated
    Jun 23, 2023
    Dataset provided by
    University of Zagreb
    University of Tartu
    University of the Aegean
    Gdańsk University of Technology
    Authors
    Anastasija Nikiforova; Nina Rizun; Magdalena Ciesielska; Charalampos Alexopoulos; Andrea Miletič
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study ("Towards High-Value Datasets determination for data-driven development: a systematic literature review") conducted by Anastasija Nikiforova (University of Tartu), Nina Rizun, Magdalena Ciesielska (Gdańsk University of Technology), Charalampos Alexopoulos (University of the Aegean) and Andrea Miletič (University of Zagreb) It being made public both to act as supplementary data for "Towards High-Value Datasets determination for data-driven development: a systematic literature review" paper (pre-print is available in Open Access here -> https://arxiv.org/abs/2305.10234) and in order for other researchers to use these data in their own work.

    The protocol is intended for the Systematic Literature review on the topic of High-value Datasets with the aim to gather information on how the topic of High-value datasets (HVD) and their determination has been reflected in the literature over the years and what has been found by these studies to date, incl. the indicators used in them, involved stakeholders, data-related aspects, and frameworks. The data in this dataset were collected in the result of the SLR over Scopus, Web of Science, and Digital Government Research library (DGRL) in 2023.

    Methodology

    To understand how HVD determination has been reflected in the literature over the years and what has been found by these studies to date, all relevant literature covering this topic has been studied. To this end, the SLR was carried out to by searching digital libraries covered by Scopus, Web of Science (WoS), Digital Government Research library (DGRL).

    These databases were queried for keywords ("open data" OR "open government data") AND ("high-value data*" OR "high value data*"), which were applied to the article title, keywords, and abstract to limit the number of papers to those, where these objects were primary research objects rather than mentioned in the body, e.g., as a future work. After deduplication, 11 articles were found unique and were further checked for relevance. As a result, a total of 9 articles were further examined. Each study was independently examined by at least two authors.

    To attain the objective of our study, we developed the protocol, where the information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information.

    Test procedure Each study was independently examined by at least two authors, where after the in-depth examination of the full-text of the article, the structured protocol has been filled for each study. The structure of the survey is available in the supplementary file available (see Protocol_HVD_SLR.odt, Protocol_HVD_SLR.docx) The data collected for each study by two researchers were then synthesized in one final version by the third researcher.

    Description of the data in this data set

    Protocol_HVD_SLR provides the structure of the protocol Spreadsheets #1 provides the filled protocol for relevant studies. Spreadsheet#2 provides the list of results after the search over three indexing databases, i.e. before filtering out irrelevant studies

    The information on each selected study was collected in four categories: (1) descriptive information, (2) approach- and research design- related information, (3) quality-related information, (4) HVD determination-related information

    Descriptive information
    1) Article number - a study number, corresponding to the study number assigned in an Excel worksheet 2) Complete reference - the complete source information to refer to the study 3) Year of publication - the year in which the study was published 4) Journal article / conference paper / book chapter - the type of the paper -{journal article, conference paper, book chapter} 5) DOI / Website- a link to the website where the study can be found 6) Number of citations - the number of citations of the article in Google Scholar, Scopus, Web of Science 7) Availability in OA - availability of an article in the Open Access 8) Keywords - keywords of the paper as indicated by the authors 9) Relevance for this study - what is the relevance level of the article for this study? {high / medium / low}

    Approach- and research design-related information 10) Objective / RQ - the research objective / aim, established research questions 11) Research method (including unit of analysis) - the methods used to collect data, including the unit of analy-sis (country, organisation, specific unit that has been ana-lysed, e.g., the number of use-cases, scope of the SLR etc.) 12) Contributions - the contributions of the study 13) Method - whether the study uses a qualitative, quantitative, or mixed methods approach? 14) Availability of the underlying research data- whether there is a reference to the publicly available underly-ing research data e.g., transcriptions of interviews, collected data, or explanation why these data are not shared? 15) Period under investigation - period (or moment) in which the study was conducted 16) Use of theory / theoretical concepts / approaches - does the study mention any theory / theoretical concepts / approaches? If any theory is mentioned, how is theory used in the study?

    Quality- and relevance- related information
    17) Quality concerns - whether there are any quality concerns (e.g., limited infor-mation about the research methods used)? 18) Primary research object - is the HVD a primary research object in the study? (primary - the paper is focused around the HVD determination, sec-ondary - mentioned but not studied (e.g., as part of discus-sion, future work etc.))

    HVD determination-related information
    19) HVD definition and type of value - how is the HVD defined in the article and / or any other equivalent term? 20) HVD indicators - what are the indicators to identify HVD? How were they identified? (components & relationships, “input -> output") 21) A framework for HVD determination - is there a framework presented for HVD identification? What components does it consist of and what are the rela-tionships between these components? (detailed description) 22) Stakeholders and their roles - what stakeholders or actors does HVD determination in-volve? What are their roles? 23) Data - what data do HVD cover? 24) Level (if relevant) - what is the level of the HVD determination covered in the article? (e.g., city, regional, national, international)

    Format of the file .xls, .csv (for the first spreadsheet only), .odt, .docx

    Licenses or restrictions CC-BY

    For more info, see README.txt

  8. H

    Replication Data for: Policy Diffusion: The Issue-Definition Stage

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Dec 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fabrizio Gilardi; Charles R. Shipan; Bruno Wüest (2021). Replication Data for: Policy Diffusion: The Issue-Definition Stage [Dataset]. http://doi.org/10.7910/DVN/QEMNP1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Fabrizio Gilardi; Charles R. Shipan; Bruno Wüest
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QEMNP1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.2/customlicense?persistentId=doi:10.7910/DVN/QEMNP1

    Area covered
    United States
    Description

    We put forward a new approach to studying issue definition within the context of policy diffusion. Most studies of policy diffusion---which is the process by which policymaking in one government affects policymaking in other governments---have focused on policy adoptions. We shift the focus to an important but neglected aspect of this process: the issue-definition stage. We use topic models to estimate how policies are framed during this stage and how these frames are predicted by prior policy adoptions. Focusing on smoking restriction in U.S. states, our analysis draws upon an original dataset of over 52,000 paragraphs from newspapers covering 49 states between 1996 and 2013. We find that frames regarding the policy's concrete implications are predicted by prior adoptions in other states, while frames regarding its normative justifications are not. Our approach and findings open the way for a new perspective to studying policy diffusion in many different areas.

  9. r

    Supplementary data for a study on a data driven learning approach for the...

    • resodate.org
    • leopard.tu-braunschweig.de
    Updated Jul 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erik Tute (2021). Supplementary data for a study on a data driven learning approach for the assessment of data quality [Dataset]. http://doi.org/10.24355/dbbs.084-202107061200-0
    Explore at:
    Dataset updated
    Jul 6, 2021
    Dataset provided by
    Technische Universität Braunschweig
    Universitätsbibliothek Braunschweig
    Authors
    Erik Tute
    Description

    This data is 100% artificial (no real patients involved). This dataset was generated to explore the application of simple machine learning to learn knowledge about how simple statistical measures about a dataset (e.g. mean value for variable, value counts etc.) can indicate data quality issues. The generated dummy data, outcome data and MM-results are available in the folder "dummy data, outcome data, MM-results". The numbers of triggered DQ-issue rules are available in folder "triggered issue rules". The MM-results used for machine learning can be found in file "MM-results_export_for_machine_learningresult_exports.csv".

  10. Dataset for "A fast dictionary-learning-based classification scheme using...

    • figshare.com
    bin
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saeed Mohseni seh deh (2025). Dataset for "A fast dictionary-learning-based classification scheme using undercomplete dictionaries" [Dataset]. http://doi.org/10.6084/m9.figshare.29367389.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 19, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Saeed Mohseni seh deh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a preprocessed version of the publicly available MNIST handwritten digit dataset, formatted for use in the research paper "A fast dictionary-learning-based classification scheme using undercomplete dictionaries".The data has been converted into vector form and sorted into .mat files by class label, ranging from 0 to 9. The files are formated as training and testing where the training data has X_train as vectorized images and Y_train as the corresponding labels and X_test and Y_test as the image and labels for testing dataset.**Contents:**X_train_vector_sort_MNISTY_train_MNISTX_test_vector_MNISTY_test_MNIST**Usage:**The dataset is intended for direct use with the code available at:https://github.com/saeedmohseni97/fast-udl-classification

  11. Data from: Bibliographic dataset characterizing studies that use online...

    • zenodo.org
    • portalcientifico.unav.edu
    • +1more
    bin, csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis (2020). Bibliographic dataset characterizing studies that use online biodiversity databases [Dataset]. http://doi.org/10.5281/zenodo.2589439
    Explore at:
    csv, binAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Joan E. Ball-Damerow; Joan E. Ball-Damerow; Laura Brenskelle; Laura Brenskelle; Narayani Barve; Narayani Barve; Raphael LaFrance; Pamela S. Soltis; Petra Sierwald; Petra Sierwald; Rüdiger Bieler; Rüdiger Bieler; Arturo Ariño; Arturo Ariño; Robert Guralnick; Robert Guralnick; Raphael LaFrance; Pamela S. Soltis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes bibliographic information for 501 papers that were published from 2010-April 2017 (time of search) and use online biodiversity databases for research purposes. Our overarching goal in this study is to determine how research uses of biodiversity data developed during a time of unprecedented growth of online data resources. We also determine uses with the highest number of citations, how online occurrence data are linked to other data types, and if/how data quality is addressed. Specifically, we address the following questions:

    1.) What primary biodiversity databases have been cited in published research, and which

    databases have been cited most often?

    2.) Is the biodiversity research community citing databases appropriately, and are

    the cited databases currently accessible online?

    3.) What are the most common uses, general taxa addressed, and data linkages, and how

    have they changed over time?

    4.) What uses have the highest impact, as measured through the mean number of citations

    per year?

    5.) Are certain uses applied more often for plants/invertebrates/vertebrates?

    6.) Are links to specific data types associated more often with particular uses?

    7.) How often are major data quality issues addressed?

    8.) What data quality issues tend to be addressed for the top uses?

    Relevant papers for this analysis include those that use online and openly accessible primary occurrence records, or those that add data to an online database. Google Scholar (GS) provides full-text indexing, which was important to identify data sources that often appear buried in the methods section of a paper. Our search was therefore restricted to GS. All authors discussed and agreed upon representative search terms, which were relatively broad to capture a variety of databases hosting primary occurrence records. The terms included: “species occurrence” database (8,800 results), “natural history collection” database (634 results), herbarium database (16,500 results), “biodiversity database” (3,350 results), “primary biodiversity data” database (483 results), “museum collection” database (4,480 results), “digital accessible information” database (10 results), and “digital accessible knowledge” database (52 results)--note that quotations are used as part of the search terms where specific phrases are needed in whole. We downloaded all records returned by each search (or the first 500 if there were more) into a Zotero reference management database. About one third of the 2500 papers in the final dataset were relevant. Three of the authors with specialized knowledge of the field characterized relevant papers using a standardized tagging protocol based on a series of key topics of interest. We developed a list of potential tags and descriptions for each topic, including: database(s) used, database accessibility, scale of study, region of study, taxa addressed, research use of data, other data types linked to species occurrence data, data quality issues addressed, authors, institutions, and funding sources. Each tagged paper was thoroughly checked by a second tagger.

    The final dataset of tagged papers allow us to quantify general areas of research made possible by the expansion of online species occurrence databases, and trends over time. Analyses of this data will be published in a separate quantitative review.

  12. d

    LNWB Ch03 Data Processes - data management plan

    • search.dataone.org
    • hydroshare.org
    Updated Dec 5, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill (2021). LNWB Ch03 Data Processes - data management plan [Dataset]. https://search.dataone.org/view/sha256%3Aa7eac4a8f4655389d5169cbe06562ea14e88859d2c4b19a633a0610ca07a329f
    Explore at:
    Dataset updated
    Dec 5, 2021
    Dataset provided by
    Hydroshare
    Authors
    Christina Bandaragoda; Bracken Capen; Joanne Greenberg; Mary Dumas; Peter Gill
    Description

    Overview: The Lower Nooksack Water Budget Project involved assembling a wide range of existing data related to WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. This Data Management Plan provides an overview of the data sets, formats and collaboration environment that was used to develop the project. Use of a plan during development of the technical work products provided a forum for the data development and management to be conducted with transparent methods and processes. At project completion, the Data Management Plan provides an accessible archive of the data resources used and supporting information on the data storage, intended access, sharing and re-use guidelines.

    One goal of the Lower Nooksack Water Budget project is to make this “usable technical information” as accessible as possible across technical, policy and general public users. The project data, analyses and documents will be made available through the WRIA 1 Watershed Management Project website http://wria1project.org. This information is intended for use by the WRIA 1 Joint Board and partners working to achieve the adopted goals and priorities of the WRIA 1 Watershed Management Plan.

    Model outputs for the Lower Nooksack Water Budget are summarized by sub-watersheds (drainages) and point locations (nodes). In general, due to changes in land use over time and changes to available streamflow and climate data, the water budget for any watershed needs to be updated periodically. Further detailed information about data sources is provided in review packets developed for specific technical components including climate, streamflow and groundwater level, soils and land cover, and water use.

    Purpose: This project involves assembling a wide range of existing data related to the WRIA 1 and specifically the Lower Nooksack Subbasin, updating existing data sets and generating new data sets. Data will be used as input to various hydrologic, climatic and geomorphic components of the Topnet-Water Management (WM) model, but will also be available to support other modeling efforts in WRIA 1. Much of the data used as input to the Topnet model is publicly available and maintained by others, (i.e., USGS DEMs and streamflow data, SSURGO soils data, University of Washington gridded meteorological data). Pre-processing is performed to convert these existing data into a format that can be used as input to the Topnet model. Post-processing of Topnet model ASCII-text file outputs is subsequently combined with spatial data to generate GIS data that can be used to create maps and illustrations of the spatial distribution of water information. Other products generated during this project will include documentation of methods, input by WRIA 1 Joint Board Staff Team during review and comment periods, communication tools developed for public engagement and public comment on the project.

    In order to maintain an organized system of developing and distributing data, Lower Nooksack Water Budget project collaborators should be familiar with standards for data management described in this document, and the following issues related to generating and distributing data: 1. Standards for metadata and data formats 2. Plans for short-term storage and data management (i.e., file formats, local storage and back up procedures and security) 3. Legal and ethical issues (i.e., intellectual property, confidentiality of study participants) 4. Access policies and provisions (i.e., how the data will be made available to others, any restrictions needed) 5. Provisions for long-term archiving and preservation (i.e., establishment of a new data archive or utilization of an existing archive) 6. Assigned data management responsibilities (i.e., persons responsible for ensuring data Management, monitoring compliance with the Data Management Plan)

    This resource is a subset of the LNWB Ch03 Data Processes Collection Resource.

  13. Z

    Conceptualization of public data ecosystems

    • data.niaid.nih.gov
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasija, Nikiforova; Martin, Lnenicka (2024). Conceptualization of public data ecosystems [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_13842001
    Explore at:
    Dataset updated
    Sep 26, 2024
    Dataset provided by
    University of Hradec Králové
    University of Tartu
    Authors
    Anastasija, Nikiforova; Martin, Lnenicka
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains data collected during a study "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems" conducted by Martin Lnenicka (University of Hradec Králové, Czech Republic), Anastasija Nikiforova (University of Tartu, Estonia), Mariusz Luterek (University of Warsaw, Warsaw, Poland), Petar Milic (University of Pristina - Kosovska Mitrovica, Serbia), Daniel Rudmark (Swedish National Road and Transport Research Institute, Sweden), Sebastian Neumaier (St. Pölten University of Applied Sciences, Austria), Karlo Kević (University of Zagreb, Croatia), Anneke Zuiderwijk (Delft University of Technology, Delft, the Netherlands), Manuel Pedro Rodríguez Bolívar (University of Granada, Granada, Spain).

    As there is a lack of understanding of the elements that constitute different types of value-adding public data ecosystems and how these elements form and shape the development of these ecosystems over time, which can lead to misguided efforts to develop future public data ecosystems, the aim of the study is: (1) to explore how public data ecosystems have developed over time and (2) to identify the value-adding elements and formative characteristics of public data ecosystems. Using an exploratory retrospective analysis and a deductive approach, we systematically review 148 studies published between 1994 and 2023. Based on the results, this study presents a typology of public data ecosystems and develops a conceptual model of elements and formative characteristics that contribute most to value-adding public data ecosystems, and develops a conceptual model of the evolutionary generation of public data ecosystems represented by six generations called Evolutionary Model of Public Data Ecosystems (EMPDE). Finally, three avenues for a future research agenda are proposed.

    This dataset is being made public both to act as supplementary data for "Understanding the development of public data ecosystems: from a conceptual model to a six-generation model of the evolution of public data ecosystems ", Telematics and Informatics*, and its Systematic Literature Review component that informs the study.

    Description of the data in this data set

    PublicDataEcosystem_SLR provides the structure of the protocol

    Spreadsheet#1 provides the list of results after the search over three indexing databases and filtering out irrelevant studies

    Spreadsheets #2 provides the protocol structure.

    Spreadsheets #3 provides the filled protocol for relevant studies.

    The information on each selected study was collected in four categories:(1) descriptive information,(2) approach- and research design- related information,(3) quality-related information,(4) HVD determination-related information

    Descriptive Information

    Article number

    A study number, corresponding to the study number assigned in an Excel worksheet

    Complete reference

    The complete source information to refer to the study (in APA style), including the author(s) of the study, the year in which it was published, the study's title and other source information.

    Year of publication

    The year in which the study was published.

    Journal article / conference paper / book chapter

    The type of the paper, i.e., journal article, conference paper, or book chapter.

    Journal / conference / book

    Journal article, conference, where the paper is published.

    DOI / Website

    A link to the website where the study can be found.

    Number of words

    A number of words of the study.

    Number of citations in Scopus and WoS

    The number of citations of the paper in Scopus and WoS digital libraries.

    Availability in Open Access

    Availability of a study in the Open Access or Free / Full Access.

    Keywords

    Keywords of the paper as indicated by the authors (in the paper).

    Relevance for our study (high / medium / low)

    What is the relevance level of the paper for our study

    Approach- and research design-related information

    Approach- and research design-related information

    Objective / Aim / Goal / Purpose & Research Questions

    The research objective and established RQs.

    Research method (including unit of analysis)

    The methods used to collect data in the study, including the unit of analysis that refers to the country, organisation, or other specific unit that has been analysed such as the number of use-cases or policy documents, number and scope of the SLR etc.

    Study’s contributions

    The study’s contribution as defined by the authors

    Qualitative / quantitative / mixed method

    Whether the study uses a qualitative, quantitative, or mixed methods approach?

    Availability of the underlying research data

    Whether the paper has a reference to the public availability of the underlying research data e.g., transcriptions of interviews, collected data etc., or explains why these data are not openly shared?

    Period under investigation

    Period (or moment) in which the study was conducted (e.g., January 2021-March 2022)

    Use of theory / theoretical concepts / approaches? If yes, specify them

    Does the study mention any theory / theoretical concepts / approaches? If yes, what theory / concepts / approaches? If any theory is mentioned, how is theory used in the study? (e.g., mentioned to explain a certain phenomenon, used as a framework for analysis, tested theory, theory mentioned in the future research section).

    Quality-related information

    Quality concerns

    Whether there are any quality concerns (e.g., limited information about the research methods used)?

    Public Data Ecosystem-related information

    Public data ecosystem definition

    How is the public data ecosystem defined in the paper and any other equivalent term, mostly infrastructure. If an alternative term is used, how is the public data ecosystem called in the paper?

    Public data ecosystem evolution / development

    Does the paper define the evolution of the public data ecosystem? If yes, how is it defined and what factors affect it?

    What constitutes a public data ecosystem?

    What constitutes a public data ecosystem (components & relationships) - their "FORM / OUTPUT" presented in the paper (general description with more detailed answers to further additional questions).

    Components and relationships

    What components does the public data ecosystem consist of and what are the relationships between these components? Alternative names for components - element, construct, concept, item, helix, dimension etc. (detailed description).

    Stakeholders

    What stakeholders (e.g., governments, citizens, businesses, Non-Governmental Organisations (NGOs) etc.) does the public data ecosystem involve?

    Actors and their roles

    What actors does the public data ecosystem involve? What are their roles?

    Data (data types, data dynamism, data categories etc.)

    What data do the public data ecosystem cover (is intended / designed for)? Refer to all data-related aspects, including but not limited to data types, data dynamism (static data, dynamic, real-time data, stream), prevailing data categories / domains / topics etc.

    Processes / activities / dimensions, data lifecycle phases

    What processes, activities, dimensions and data lifecycle phases (e.g., locate, acquire, download, reuse, transform, etc.) does the public data ecosystem involve or refer to?

    Level (if relevant)

    What is the level of the public data ecosystem covered in the paper? (e.g., city, municipal, regional, national (=country), supranational, international).

    Other elements or relationships (if any)

    What other elements or relationships does the public data ecosystem consist of?

    Additional comments

    Additional comments (e.g., what other topics affected the public data ecosystems and their elements, what is expected to affect the public data ecosystems in the future, what were important topics by which the period was characterised etc.).

    New papers

    Does the study refer to any other potentially relevant papers?

    Additional references to potentially relevant papers that were found in the analysed paper (snowballing).

    Format of the file.xls, .csv (for the first spreadsheet only), .docx

    Licenses or restrictionsCC-BY

    For more info, see README.txt

  14. Science Education Research Topic Modeling Dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, html +2
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph (2024). Science Education Research Topic Modeling Dataset [Dataset]. http://doi.org/10.5281/zenodo.4094974
    Explore at:
    bin, txt, html, text/x-pythonAvailable download formats
    Dataset updated
    Oct 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Tor Ole B. Odden; Tor Ole B. Odden; Alessandro Marin; Alessandro Marin; John L. Rudolph; John L. Rudolph
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset contains scraped and processed text from roughly 100 years of articles published in the Wiley journal Science Education (formerly General Science Quarterly). This text has been cleaned and filtered in preparation for analysis using natural language processing techniques, particularly topic modeling with latent Dirichlet allocation (LDA). We also include a Jupyter Notebook illustrating how one can use LDA to analyze this dataset and extract latent topics from it, as well as analyze the rise and fall of those topics over the history of the journal.

    The articles were downloaded and scraped in December of 2019. Only non-duplicate articles with a listed author (according to the CrossRef metadata database) were included, and due to missing data and text recognition issues we excluded all articles published prior to 1922. This resulted in 5577 articles in total being included in the dataset. The text of these articles was then cleaned in the following way:

    • We removed duplicated text from each article: prior to 1969, articles in the journal were published in a magazine format in which the end of one article and the beginning of the next would share the same page, so we developed an automated detection of article beginnings and endings that was able to remove any duplicate text.
    • We removed the reference sections of the articles, as well headings (in all caps) such as “ABSTRACT”.
    • We reunited any partial words that were separated due to line breaks, text recognition issues, or British vs. American spellings (for example converting “per cent” to “percent”)
    • We removed all numbers, symbols, special characters, and punctuation, and lowercased all words.
    • We removed all stop words, which are words without any semantic meaning on their own—“the”, “in,” “if”, “and”, “but”, etc.—and all single-letter words.
    • We lemmatized all words, with the added step of including a part-of-speech tagger so our algorithm would only aggregate and lemmatize words from the same part of speech (e.g., nouns vs. verbs).
    • We detected and create bi-grams, sets of words that frequently co-occur and carry additional meaning together. These words were combined with an underscore: for example, “problem_solving” and “high_school”.

    After filtering, each document was then turned into a list of individual words (or tokens) which were then collected and saved (using the python pickle format) into the file scied_words_bigrams_V5.pkl.

    In addition to this file, we have also included the following files:

    1. SciEd_paper_names_weights.pkl: A file containing limited metadata (title, author, year published, and DOI) for each of the papers, in the same order as they appear within the main datafile. This file also includes the weights assigned by an LDA model used to analyze the data
    2. Science Education LDA Notebook.ipynb: A notebook file that replicates our LDA analysis, with a written explanation of all of the steps and suggestions on how to explore the results.
    3. Supporting files for the notebook. These include the requirements, the README, a helper script with functions for plotting that were too long to include in the notebook, and two HTML graphs that are embedded into the notebook.

    This dataset is shared under the terms of the Wiley Text and Data Mining Agreement, which allows users to share text and data mining output for non-commercial research purposes. Any questions or comments can be directed to Tor Ole Odden, t.o.odden@fys.uio.no.

  15. d

    Data from: Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 -...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Nov 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Variable Terrestrial GPS Telemetry Detection Rates: Parts 1 - 7—Data [Dataset]. https://catalog.data.gov/dataset/variable-terrestrial-gps-telemetry-detection-rates-parts-1-7data
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    Studies utilizing Global Positioning System (GPS) telemetry rarely result in 100% fix success rates (FSR). Many assessments of wildlife resource use do not account for missing data, either assuming data loss is random or because a lack of practical treatment for systematic data loss. Several studies have explored how the environment, technological features, and animal behavior influence rates of missing data in GPS telemetry, but previous spatially explicit models developed to correct for sampling bias have been specified to small study areas, on a small range of data loss, or to be species-specific, limiting their general utility. Here we explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use. We also evaluate patterns in missing data that relate to potential animal activities that change the orientation of the antennae and characterize home-range probability of GPS detection for 4 focal species; cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Part 1, Positive Openness Raster (raster dataset): Openness is an angular measure of the relationship between surface relief and horizontal distance. For angles less than 90 degrees it is equivalent to the internal angle of a cone with its apex at a DEM location, and is constrained by neighboring elevations within a specified radial distance. 480 meter search radius was used for this calculation of positive openness. Openness incorporates the terrain line-of-sight or viewshed concept and is calculated from multiple zenith and nadir angles-here along eight azimuths. Positive openness measures openness above the surface, with high values for convex forms and low values for concave forms (Yokoyama et al. 2002). We calculated positive openness using a custom python script, following the methods of Yokoyama et. al (2002) using a USGS National Elevation Dataset as input. Part 2, Northern Arizona GPS Test Collar (csv): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. The model training data are provided here for fix attempts by hour. This table can be linked with the site location shapefile using the site field. Part 3, Probability Raster (raster dataset): Bias correction in GPS telemetry datasets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix aquistion. We found terrain exposure and tall overstory vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The models predictive ability was evaluated using two independent datasets from stationary test collars of different make/model, fix interval programing, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. We evaluated GPS telemetry datasets by comparing the mean probability of a successful GPS fix across study animals home-ranges, to the actual observed FSR of GPS downloaded deployed collars on cougars (Puma concolor), desert bighorn sheep (Ovis canadensis nelsoni), Rocky Mountain elk (Cervus elaphus ssp. nelsoni) and mule deer (Odocoileus hemionus). Comparing the mean probability of acquisition within study animals home-ranges and observed FSRs of GPS downloaded collars resulted in a approximatly 1:1 linear relationship with an r-sq= 0.68. Part 4, GPS Test Collar Sites (shapefile): Bias correction in GPS telemetry data-sets requires a strong understanding of the mechanisms that result in missing data. We tested wildlife GPS collars in a variety of environmental conditions to derive a predictive model of fix acquisition. We found terrain exposure and tall over-story vegetation are the primary environmental features that affect GPS performance. Model evaluation showed a strong correlation (0.924) between observed and predicted fix success rates (FSR) and showed little bias in predictions. The model's predictive ability was evaluated using two independent data-sets from stationary test collars of different make/model, fix interval programming, and placed at different study sites. No statistically significant differences (95% CI) between predicted and observed FSRs, suggest changes in technological factors have minor influence on the models ability to predict FSR in new study areas in the southwestern US. Part 5, Cougar Home Ranges (shapefile): Cougar home-ranges were calculated to compare the mean probability of a GPS fix acquisition across the home-range to the actual fix success rate (FSR) of the collar as a means for evaluating if characteristics of an animal’s home-range have an effect on observed FSR. We estimated home-ranges using the Local Convex Hull (LoCoH) method using the 90th isopleth. Data obtained from GPS download of retrieved units were only used. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose as additional 10% of data. Comparisons with home-range mean probability of fix were also used as a reference for assessing if the frequency animals use areas of low GPS acquisition rates may play a role in observed FSRs. Part 6, Cougar Fix Success Rate by Hour (csv): Cougar GPS collar fix success varied by hour-of-day suggesting circadian rhythms with bouts of rest during daylight hours may change the orientation of the GPS receiver affecting the ability to acquire fixes. Raw data of overall fix success rates (FSR) and FSR by hour were used to predict relative reductions in FSR. Data only includes direct GPS download datasets. Satellite delivered data was omitted from the analysis for animals where the collar was lost or damaged because satellite delivery tends to lose approximately an additional 10% of data. Part 7, Openness Python Script version 2.0: This python script was used to calculate positive openness using a 30 meter digital elevation model for a large geographic area in Arizona, California, Nevada and Utah. A scientific research project used the script to explore environmental effects on GPS fix acquisition rates across a wide range of environmental conditions and detection rates for bias correction of terrestrial GPS-derived, large mammal habitat use.

  16. GEDI L2A Elevation and Height Metrics Data Global Footprint Level V002

    • data.nasa.gov
    • datasets.ai
    • +6more
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). GEDI L2A Elevation and Height Metrics Data Global Footprint Level V002 [Dataset]. https://data.nasa.gov/dataset/gedi-l2a-elevation-and-height-metrics-data-global-footprint-level-v002-d2f2d
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Global Ecosystem Dynamics Investigation (GEDI) mission aims to characterize ecosystem structure and dynamics to enable radically improved quantification and understanding of the Earth’s carbon cycle and biodiversity. The GEDI instrument produces high resolution laser ranging observations of the 3-dimensional structure of the Earth. GEDI is attached to the International Space Station (ISS) and collects data globally between 51.6° N and 51.6° S latitudes at the highest resolution and densest sampling of any light detection and ranging (lidar) instrument in orbit to date. Each GEDI Version 2 granule encompasses one-fourth of an ISS orbit and includes georeferenced metadata to allow for spatial querying and subsetting.The GEDI instrument was removed from the ISS and placed into storage on March 17, 2023. No data were acquired during the hibernation period from March 17, 2023, to April 24, 2024. GEDI has since been reinstalled on the ISS and resumed operations as of April 26, 2024.The purpose of the GEDI Level 2A Geolocated Elevation and Height Metrics product (GEDI02_A) is to provide waveform interpretation and extracted products from each GEDI01_B received waveform, including ground elevation, canopy top height, and relative height (RH) metrics. The methodology for generating the GEDI02_A product datasets is adapted from the Land, Vegetation, and Ice Sensor (LVIS) algorithm. The GEDI02_A product is provided in HDF5 format and has a spatial resolution (average footprint) of 25 meters.The GEDI02_A data product contains 156 layers for each of the eight beams, including ground elevation, canopy top height, relative return energy metrics (e.g., canopy vertical structure), and many other interpreted products from the return waveforms. Additional information for the layers can be found in the GEDI Level 2A Dictionary.Known Issues Data acquisition gaps: GEDI data acquisitions were suspended on December 19, 2019 (2019 Day 353) and resumed on January 8, 2020 (2020 Day 8). Incorrect Reference Ground Track (RGT) number in the filename for select GEDI files: GEDI Science Data Products for six orbits on August 7, 2020, and November 12, 2021, had the incorrect RGT number in the filename. There is no impact to the science data, but users should reference this document for the correct RGT numbers. Known Issues: Section 8 of the User Guide provides additional information on known issues.Improvements/Changes from Previous Versions Metadata has been updated to include spatial coordinates. Granule size has been reduced from one full ISS orbit (~5.83 GB) to four segments per orbit (~1.48 GB). Filename has been updated to include segment number and version number. Improved geolocation for an orbital segment. Added elevation from the SRTM digital elevation model for comparison. Modified the method to predict an optimum algorithm setting group per laser shot. Added additional land cover datasets related to phenology, urban infrastructure, and water persistence. Added selected_mode_flag dataset to root beam group using selected algorithm. Removed shots when the laser is not firing.* Modified file name to include segment number and dataset version.

  17. Data from: GEDI L1B Geolocated Waveform Data Global Footprint Level V002

    • data.nasa.gov
    • catalog.data.gov
    Updated Apr 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). GEDI L1B Geolocated Waveform Data Global Footprint Level V002 [Dataset]. https://data.nasa.gov/dataset/gedi-l1b-geolocated-waveform-data-global-footprint-level-v002-8fe3b
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    The Global Ecosystem Dynamics Investigation (GEDI) mission aims to characterize ecosystem structure and dynamics to enable radically improved quantification and understanding of the Earth’s carbon cycle and biodiversity. The GEDI instrument produces high resolution laser ranging observations of the 3-dimensional structure of the Earth. GEDI is attached to the International Space Station (ISS) and collects data globally between 51.6° N and 51.6° S latitudes at the highest resolution and densest sampling of any light detection and ranging (lidar) instrument in orbit to date. Each GEDI Version 2 granule encompasses one-fourth of an ISS orbit and includes georeferenced metadata to allow for spatial querying and subsetting.The GEDI instrument was removed from the ISS and placed into storage on March 17, 2023. No data were acquired during the hibernation period from March 17, 2023, to April 24, 2024. GEDI has since been reinstalled on the ISS and resumed operations as of April 26, 2024.The GEDI Level 1B Geolocated Waveforms product (GEDI01_B) provides geolocated corrected and smoothed waveforms, geolocation parameters, and geophysical corrections for each laser shot for all eight GEDI beams. GEDI01_B data are created by geolocating the GEDI01_A raw waveform data. The GEDI01_B product is provided in HDF5 format and has a spatial resolution (average footprint) of 25 meters.The GEDI01_B data product contains 85 layers for each of the eight beams including the geolocated corrected and smoothed waveform datasets and parameters and the accompanying ancillary, geolocation, and geophysical correction. Additional information can be found in the GEDI L1B Product Data Dictionary.Known Issues Data acquisition gaps: GEDI data acquisitions were suspended on December 19, 2019 (2019 Day 353) and resumed on January 8, 2020 (2020 Day 8). Incorrect Reference Ground Track (RGT) number in the filename for select GEDI files: GEDI Science Data Products for six orbits on August 7, 2020, and November 12, 2021, had the incorrect RGT number in the filename. There is no impact to the science data, but users should reference this document for the correct RGT numbers. Known Issues: Section 8 of the User Guide provides additional information on known issues.Improvements/Changes from Previous Versions Metadata has been updated to include spatial coordinates. Granule size has been reduced from one full ISS orbit (~7.99 GB) to four segments per orbit (~2.00 GB). Filename has been updated to include segment number and version number. Improved geolocation for an orbital segment. Added elevation from the SRTM digital elevation model for comparison. Modified the method to predict an optimum algorithm setting group per laser shot. Added additional land cover datasets related to phenology, urban infrastructure, and water persistence. Added selected_mode_flag dataset to root beam group using selected algorithm. Removed shots when the laser is not firing.* Modified file name to include segment number and dataset version.

  18. Z

    Complete Rxivist dataset of scraped biology preprint data

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdill, Richard J.; Blekhman, Ran (2023). Complete Rxivist dataset of scraped biology preprint data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2529922
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset provided by
    University of Minnesota
    Authors
    Abdill, Richard J.; Blekhman, Ran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    rxivist.org allowed readers to sort and filter the tens of thousands of preprints posted to bioRxiv and medRxiv. Rxivist used a custom web crawler to index all papers posted to those two websites; this is a snapshot of Rxivist the production database. The version number indicates the date on which the snapshot was taken. See the included "README.md" file for instructions on how to use the "rxivist.backup" file to import data into a PostgreSQL database server.

    Please note this is a different repository than the one used for the Rxivist manuscript—that is in a separate Zenodo repository. You're welcome (and encouraged!) to use this data in your research, but please cite our paper, now published in eLife.

    Previous versions are also available pre-loaded into Docker images, available at blekhmanlab/rxivist_data.

    Version notes:

    2023-03-01

    The final Rxivist data upload, more than four years after the first and encompassing 223,541 preprints posted to bioRxiv and medRxiv through the end of February 2023.

    2020-12-07***

    In addition to bioRxiv preprints, the database now includes all medRxiv preprints as well.

    The website where a preprint was posted is now recorded in a new field in the "articles" table, called "repo".

    We've significantly refactored the web crawler to take advantage of developments with the bioRxiv API.

    The main difference is that preprints flagged as "published" by bioRxiv are no longer recorded on the same schedule that download metrics are updated: The Rxivist database should now record published DOI entries the same day bioRxiv detects them.

    Twitter metrics have returned, for the most part. Improvements with the Crossref Event Data API mean we can once again tally daily Twitter counts for all bioRxiv DOIs.

    The "crossref_daily" table remains where these are recorded, and daily numbers are now up to date.

    Historical daily counts have also been re-crawled to fill in the empty space that started in October 2019.

    There are still several gaps that are more than a week long due to missing data from Crossref.

    We have recorded available Crossref Twitter data for all papers with DOI numbers starting with "10.1101," which includes all medRxiv preprints. However, there appears to be almost no Twitter data available for medRxiv preprints.

    The download metrics for article id 72514 (DOI 10.1101/2020.01.30.927871) were found to be out of date for February 2020 and are now correct. This is notable because article 72514 is the most downloaded preprint of all time; we're still looking into why this wasn't updated after the month ended.

    2020-11-18

    Publication checks should be back on schedule.

    2020-10-26

    This snapshot fixes most of the data issues found in the previous version. Indexed papers are now up to date, and download metrics are back on schedule. The check for publication status remains behind schedule, however, and the database may not include published DOIs for papers that have been flagged on bioRxiv as "published" over the last two months. Another snapshot will be posted in the next few weeks with updated publication information.

    2020-09-15

    A crawler error caused this snapshot to exclude all papers posted after about August 29, with some papers having download metrics that were more out of date than usual. The "last_crawled" field is accurate.

    2020-09-08

    This snapshot is misconfigured and will not work without modification; it has been replaced with version 2020-09-15.

    2019-12-27

    Several dozen papers did not have dates associated with them; that has been fixed.

    Some authors have had two entries in the "authors" table for portions of 2019, one profile that was linked to their ORCID and one that was not, occasionally with almost identical "name" strings. This happened after bioRxiv began changing author names to reflect the names in the PDFs, rather than the ones manually entered into their system. These database records are mostly consolidated now, but some may remain.

    2019-11-29

    The Crossref Event Data API remains down; Twitter data is unavailable for dates after early October.

    2019-10-31

    The Crossref Event Data API is still experiencing problems; the Twitter data for October is incomplete in this snapshot.

    The README file has been modified to reflect changes in the process for creating your own DB snapshots if using the newly released PostgreSQL 12.

    2019-10-01

    The Crossref API is back online, and the "crossref_daily" table should now include up-to-date tweet information for July through September.

    About 40,000 authors were removed from the author table because the name had been removed from all preprints they had previously been associated with, likely because their name changed slightly on the bioRxiv website ("John Smith" to "J Smith" or "John M Smith"). The "author_emails" table was also modified to remove entries referring to the deleted authors. The web crawler is being updated to clean these orphaned entries more frequently.

    2019-08-30

    The Crossref Event Data API, which provides the data used to populate the table of tweet counts, has not been fully functional since early July. While we are optimistic that accurate tweet counts will be available at some point, the sparse values currently in the "crossref_daily" table for July and August should not be considered reliable.

    2019-07-01

    A new "institution" field has been added to the "article_authors" table that stores each author's institutional affiliation as listed on that paper. The "authors" table still has each author's most recently observed institution.

    We began collecting this data in the middle of May, but it has not been applied to older papers yet.

    2019-05-11

    The README was updated to correct a link to the Docker repository used for the pre-built images.

    2019-03-21

    The license for this dataset has been changed to CC-BY, which allows use for any purpose and requires only attribution.

    A new table, "publication_dates," has been added and will be continually updated. This table will include an entry for each preprint that has been published externally for which we can determine a date of publication, based on data from Crossref. (This table was previously included in the "paper" schema but was not updated after early December 2018.)

    Foreign key constraints have been added to almost every table in the database. This should not impact any read behavior, but anyone writing to these tables will encounter constraints on existing fields that refer to other tables. Most frequently, this means the "article" field in a table will need to refer to an ID that actually exists in the "articles" table.

    The "author_translations" table has been removed. This was used to redirect incoming requests for outdated author profile pages and was likely not of any functional use to others.

    The "README.md" file has been renamed "1README.md" because Zenodo only displays a preview for the file that appears first in the list alphabetically.

    The "article_ranks" and "article_ranks_working" tables have been removed as well; they were unused.

    2019-02-13.1

    After consultation with bioRxiv, the "fulltext" table will not be included in further snapshots until (and if) concerns about licensing and copyright can be resolved.

    The "docker-compose.yml" file was added, with corresponding instructions in the README to streamline deployment of a local copy of this database.

    2019-02-13

    The redundant "paper" schema has been removed.

    BioRxiv has begun making the full text of preprints available online. Beginning with this version, a new table ("fulltext") is available that contains the text of preprints that have been processed already. The format in which this information is stored may change in the future; any digression will be noted here.

    This is the first version that has a corresponding Docker image.

  19. Z

    Public Utility Data Liberation Project (PUDL) Data Release

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella (2025). Public Utility Data Liberation Project (PUDL) Data Release [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3653158
    Explore at:
    Dataset updated
    Feb 14, 2025
    Dataset provided by
    Catalyst Cooperative
    Authors
    Selvans, Zane A.; Gosnell, Christina M.; Sharpe, Austen; Norman, Bennett; Schira, Zach; Lamb, Katherine; Xia, Dazhong; Belfer, Ella
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PUDL v2025.2.0 Data Release

    This is our regular quarterly release for 2025Q1. It includes updates to all the datasets that are published with quarterly or higher frequency, plus initial verisons of a few new data sources that have been in the works for a while.

    One major change this quarter is that we are now publishing all processed PUDL data as Apache Parquet files, alongside our existing SQLite databases. See Data Access for more on how to access these outputs.

    Some potentially breaking changes to be aware of:

    In the EIA Form 930 – Hourly and Daily Balancing Authority Operations Report a number of new energy sources have been added, and some old energy sources have been split into more granular categories. See Changes in energy source granularity over time.

    We are now running the EPA’s CAMD to EIA unit crosswalk code for each individual year starting from 2018, rather than just 2018 and 2021, resulting in more connections between these two datasets and changes to some sub-plant IDs. See the note below for more details.

    Many thanks to the organizations who make these regular updates possible! Especially GridLab, RMI, and the ZERO Lab at Princeton University. If you rely on PUDL and would like to help ensure that the data keeps flowing, please consider joining them as a PUDL Sustainer, as we are still fundraising for 2025.

    New Data

    EIA 176

    Add a couple of semi-transformed interim EIA-176 (natural gas sources and dispositions) tables. They aren’t yet being written to the database, but are one step closer. See #3555 and PRs #3590, #3978. Thanks to @davidmudrauskas for moving this dataset forward.

    Extracted these interim tables up through the latest 2023 data release. See #4002 and #4004.

    EIA 860

    Added EIA 860 Multifuel table. See #3438 and #3946.

    FERC 1

    Added three new output tables containing granular utility accounting data. See #4057, #3642 and the table descriptions in the data dictionary:

    out_ferc1_yearly_detailed_income_statements

    out_ferc1_yearly_detailed_balance_sheet_assets

    out_ferc1_yearly_detailed_balance_sheet_liabilities

    SEC Form 10-K Parent-Subsidiary Ownership

    We have added some new tables describing the parent-subsidiary company ownership relationships reported in the SEC’s Form 10-K, Exhibit 21 “Subsidiaries of the Registrant”. Where possible these tables link the SEC filers or their subsidiary companies to the corresponding EIA utilities. This work was funded by a grant from the Mozilla Foundation. Most of the ML models and data preparation took place in the mozilla-sec-eia repository separate from the main PUDL ETL, as it requires processing hundreds of thousands of PDFs and the deployment of some ML experiment tracking infrastructure. The new tables are handed off as nearly finished products to the PUDL ETL pipeline. Note that these are preliminary, experimental data products and are known to be incomplete and to contain errors. Extracting data tables from unstructured PDFs and the SEC to EIA record linkage are necessarily probabalistic processes.

    See PRs #4026, #4031, #4035, #4046, #4048, #4050 and check out the table descriptions in the PUDL data dictionary:

    out_sec10k_parents_and_subsidiaries

    core_sec10k_quarterly_filings

    core_sec10k_quarterly_exhibit_21_company_ownership

    core_sec10k_quarterly_company_information

    Expanded Data Coverage

    EPA CEMS

    Added 2024 Q4 of CEMS data. See #4041 and #4052.

    EPA CAMD EIA Crosswalk

    In the past, the crosswalk in PUDL has used the EPA’s published crosswalk (run with 2018 data), and an additional crosswalk we ran with 2021 EIA 860 data. To ensure that the crosswalk reflects updates in both EIA and EPA data, we re-ran the EPA R code which generates the EPA CAMD EIA crosswalk with 4 new years of data: 2019, 2020, 2022 and 2023. Re-running the crosswalk pulls the latest data from the CAMD FACT API, which results in some changes to the generator and unit IDs reported on the EPA side of the crosswalk, which feeds into the creation of core_epa_assn_eia_epacamd.

    The changes only result in the addition of new units and generators in the EPA data, with no changes to matches at the plant level. However, the updates to generator and unit IDs have resulted in changes to the subplant IDs - some EIA boilers and generators which previously had no matches to EPA data have now been matched to EPA unit data, resulting in an overall reduction in the number of rows in the core_epa_assn_eia_epacamd_subplant_ids table. See issues #4039 and PR #4056 for a discussion of the changes observed in the course of this update.

    EIA 860M

    Added EIA 860m through December 2024. See #4038 and #4047.

    EIA 923

    Added EIA 923 monthly data through September 2024. See #4038 and #4047.

    EIA Bulk Electricity Data

    Updated the EIA Bulk Electricity data to include data published up through 2024-11-01. See #4042 and PR #4051.

    EIA 930

    Updated the EIA 930 data to include data published up through the beginning of February 2025. See #4040 and PR #4054. 10 new energy sources were added and 3 were retired; see Changes in energy source granularity over time for more information.

    Bug Fixes

    Fix an accidentally swapped set of starting balance / ending balance column rename parameters in the pre-2021 DBF derived data that feeds into core_ferc1_yearly_other_regulatory_liabilities_sched278. See issue #3952 and PRs #3969, #3979. Thanks to @yolandazzz13 for making this fix.

    Added preliminary data validation checks for several FERC 1 tables that were missing it #3860.

    Fix spelling of Lake Huron and Lake Saint Clair in out_vcerare_hourly_available_capacity_factor and related tables. See issue #4007 and PR #4029.

    Quality of Life Improvements

    We added a sources parameter to pudl.metadata.classes.DataSource.from_id() in order to make it possible to use the pudl-archiver repository to archive datasets that won’t necessarily be ingested into PUDL. See this PUDL archiver issue and PRs #4003 and #4013.

    Other PUDL v2025.2.0 Resources

    PUDL v2025.2.0 Data Dictionary

    PUDL v2025.2.0 Documentation

    PUDL in the AWS Open Data Registry

    PUDL v2025.2.0 in a free, public AWS S3 bucket: s3://pudl.catalyst.coop/v2025.2.0/

    PUDL v2025.2.0 in a requester-pays GCS bucket: gs://pudl.catalyst.coop/v2025.2.0/

    Zenodo archive of the PUDL GitHub repo for this release

    PUDL v2025.2.0 release on GitHub

    PUDL v2025.2.0 package in the Python Package Index (PyPI)

    Contact Us

    If you're using PUDL, we would love to hear from you! Even if it's just a note to let us know that you exist, and how you're using the software or data. Here's a bunch of different ways to get in touch:

    Follow us on GitHub

    Use the PUDL Github issue tracker to let us know about any bugs or data issues you encounter

    GitHub Discussions is where we provide user support.

    Watch our GitHub Project to see what we're working on.

    Email us at hello@catalyst.coop for private communications.

    On Mastodon: @CatalystCoop@mastodon.energy

    On BlueSky: @catalyst.coop

    On Twitter: @CatalystCoop

    Connect with us on LinkedIn

    Play with our data and notebooks on Kaggle

    Combine our data with ML models on HuggingFace

    Learn more about us on our website: https://catalyst.coop

    Subscribe to our announcements list for email updates.

  20. Reaction times and other skewed distributions: problems with the mean and...

    • figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guillaume Rousselet; Rand Wilcox (2023). Reaction times and other skewed distributions: problems with the mean and the median [Dataset]. http://doi.org/10.6084/m9.figshare.6911924.v4
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Guillaume Rousselet; Rand Wilcox
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Reproducibility package for the article:Reaction times and other skewed distributions: problems with the mean and the medianGuillaume A. Rousselet & Rand R. Wilcoxpreprint: https://psyarxiv.com/3y54rdoi: 10.31234/osf.io/3y54rThis package contains all the code and data to reproduce the figures and analyses in the article.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(Point of Contact, Custodian) (2025). Ecological Concerns Data Dictionary - Ecological Concerns data dictionary [Dataset]. https://catalog.data.gov/dataset/ecological-concerns-data-dictionary-ecological-concerns-data-dictionary2

Ecological Concerns Data Dictionary - Ecological Concerns data dictionary

Explore at:
Dataset updated
May 24, 2025
Dataset provided by
(Point of Contact, Custodian)
Description

Evaluating the status of threatened and endangered salmonid populations requires information on the current status of the threats (e.g., habitat, hatcheries, hydropower, and invasives) and the risk of extinction (e.g., status and trend in the Viable Salmonid Population criteria). For salmonids in the Pacific Northwest, threats generally result in changes to physical and biological characteristics of freshwater habitat. These changes are often described by terms like "limiting factors" or "habitat impairment." For example, the condition of freshwater habitat directly impacts salmonid abundance and population spatial structure by affecting carrying capacity and the variability and accessibility of rearing and spawning areas. Thus, one way to assess or quantify threats to ESUs and populations is to evaluate whether the ecological conditions on which fish depend is improving, becoming more degraded, or remains unchanged. In the attached spreadsheets, we have attempted to consistently record limiting factors and threats across all populations and ESUs to enable comparison to other datasets (e.g., restoration projects) in a consistent way. Limiting factors and threats (LF/T) identified in salmon recovery plans were translated in a common language using an ecological concerns data dictionary (see "Ecological Concerns" tab in the attached spreadsheets) (a data dictionaries defines the wording, meaning and scope of categories). The ecological concerns data dictionary defines how different elements are related, such as the relationships between threats, ecological concerns and life history stages. The data dictionary includes categories for ecological dynamics and population level effects such as "reduced genetic fitness" and "behavioral changes." The data dictionary categories are meant to encompass the ecological conditions that directly impact salmonids and can be addressed directly or indirectly by management (habitat restoration, hatchery reform, etc.) actions. Using the ecological concerns data dictionary enables us to more fully capture the range of effects of hydro, hatchery, and invasive threats as well as habitat threat categories. The organization and format of the data dictionary was also chosen so the information we record can be easily related to datasets we already posses (e.g., restoration data). Data Dictionary.

Search
Clear search
Close search
Google apps
Main menu