Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By US Open Data Portal, data.gov [source]
This Kaggle dataset showcases the groundbreaking research undertaken by the GRACEnet program, which is attempting to better understand and minimize greenhouse gas (GHG) emissions from agro-ecosystems in order to create a healthier world for all. Through multi-location field studies that utilize standardized protocols – combined with models, producers, and policy makers – GRACEnet seeks to: typify existing production practices, maximize C sequestration, minimize net GHG emissions, and meet sustainable production goals. This Kaggle dataset allows us to evaluate the impact of different management systems on factors such as carbon dioxide and nitrous oxide emissions, C sequestration levels, crop/forest yield levels – plus additional environmental effects like air quality etc. With this data we can start getting an idea of the ways that agricultural policies may be influencing our planet's ever-evolving climate dilemma
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Step 1: Familiarize yourself with the columns in this dataset. In particular, pay attention to Spreadsheet tab description (brief description of each spreadsheet tab), Element or value display name (name of each element or value being measured), Description (detailed description), Data type (type of data being measured) Unit (unit of measurement for the data) Calculation (calculation used to determine a value or percentage) Format (format required for submitting values), Low Value and High Value (range for acceptable entries).
Step 2: Familiarize yourself with any additional information related to calculations. Most calculations made use of accepted best estimates based on standard protocols defined by GRACEnet. Every calculation was described in detail and included post-processing steps such as quality assurance/quality control changes as well as measurement uncertainty assessment etc., as available sources permit relevant calculations were discussed collaboratively between all participating partners at every level where they felt necessary. All terms were rigorously reviewed before all partners agreed upon any decision(s). A range was established when several assumptions were needed or when there was a high possibility that samples might fall outside previously accepted ranges associated with standard protocol conditions set up at GRACEnet Headquarters laboratories resulting due to other external factors like soil type, climate etc,.
Step 3: Determine what types of operations are allowed within each spreadsheet tab (.csv file). For example on some tabs operations like adding an entire row may be permitted but using formulas is not permitted since all non-standard manipulations often introduce errors into an analysis which is why users are encouraged only add new rows/columns provided it is seen fit for their specific analysis operations like fill blank cells by zeros or delete rows/columns made redundant after standard filtering process which have been removed earlier from different tabs should be avoided since these nonstandard changes create unverified extra noise which can bias your results later on during robustness testing processes related to self verification process thereby creating erroneous output results also such action also might result into additional FET values due API's specially crafted excel documents while selecting two ways combo box therefore
- Analyzing and comparing the environmental benefits of different agricultural management practices, such as crop yields and carbon sequestration rates.
- Developing an app or other mobile platform to help farmers find management practices that maximize carbon sequestration and minimize GHG emissions in their area, based on their specific soil condition and climate data.
- Building an AI-driven model to predict net greenhouse gas emissions and C sequestration from potential weekly/monthly production plans across different regions in the world, based on optimal allocation of resources such as fertilizers, equipment, water etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the ...
Facebook
TwitterThis is the Boston Housing Dataset, copied from: https://www.kaggle.com/datasets/vikrishnan/boston-house-prices
Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town
CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per 10 000 USD PTRATIO pupil-teacher ratio by town B 1000 (Bk - 0.63)^2 where Bk is the proportion of black people by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1000's Missing values: None
Duplicate entries: None
This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
It has then been amended to include multiple different correlations:
Directly Derived Features - New features created by applying direct transformations to existing features. For example a scaled version of another (e.g., CRIM_dup_2 = CRIM * 2), or adding some noise to an existing feature (e.g., RM_noisy = RM + random_noise).
Linear Combinations - Combining existing features linearly. For instance, a feature that is a weighted sum of several other features (e.g., weighted_feature = 0.5 * CRIM + 0.3 * NOX + 0.2 * RM).
Polynomial Features - Creating polynomial transformations of existing features. For example, square or cube a feature (e.g., AGE_squared = AGE^2). These will have a predictable correlation with their original feature.
Interaction Terms - Generating features that are the product of two existing features. Revealing interactions between variables (e.g., TAX_RAD_interaction = TAX * RAD).
Duplicate Features with Variations: Duplicate some existing features and add small variations. For example, copy a feature and add a random small value to each entry (e.g., LSTAT_varied = LSTAT + small_random_value).
These have been done by taking the dataset in python and transforming it, for example:
``import pandas as pd import random import numpy as np
original_columns = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"]
for col_name in original_columns: # Linear Combinations other_cols = random.sample([c for c in original_columns if c != col_name], 2) df[f"{col_name}_linear_combo"] = 0.5 * df[col_name] + 0.3 * df[other_cols[0]] + 0.2 * df[other_cols[1]]
# Polynomial Features
df[f"{col_name}_squared"] = df[col_name] ** 2
# Interaction Terms
other_col = random.choice([c for c in original_columns if c != col_name])
df[f"{col_name}_{other_col}_interaction"] = df[col_name] * df[other_col]
# Duplicate Features with Variations
df[f"{col_name}_varied"] = df[col_name] + (np.random.rand(df.shape[0]) * 0.05)
print(df) ``
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Information The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144803 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design. The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation. Structure and content of the dataset Dataset structure ChEMBL ID PubChem ID IUPHAR ID Target Activity type Assay type Unit Mean C (0) ... Mean PC (0) ... Mean B (0) ... Mean I (0) ... Mean PD (0) ... Activity check annotation Ligand names Canonical SMILES C ... Structure check Source The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file. Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format. Column content: ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases Target: biological target of the molecule expressed as the HGNC gene symbol Activity type: for example, pIC50 Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified Unit: unit of bioactivity measurement Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence no comment: bioactivity values are within one log unit; check activity data: bioactivity values are not within one log unit; only one data point: only one value was available, no comparison and no range calculated; no activity value: no precise numeric activity value was available; no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration Ligand names: all unique names contained in the five source databases are listed Canonical SMILES columns: Molecular structure of the compound from each database Structure check: To denote matching or differing compound structures in different source databases match: molecule structures are the same between different sources; no match: the structures differ; 1 source: no structure comparison is possible, because the molecule comes from only one source database. Source: From which databases the data come from
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Citation metrics are widely used and misused. We have created a publicly available database of top-cited scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator (c-score). Separate data are shown for career-long and, separately, for single recent year impact. Metrics with and without self-citations and ratio of citations to citing papers are given and data on retracted papers (based on Retraction Watch database) as well as citations to/from retracted papers have been added in the most recent iteration. Scientists are classified into 22 scientific fields and 174 sub-fields according to the standard Science-Metrix classification. Field- and subfield-specific percentiles are also provided for all scientists with at least 5 papers. Career-long data are updated to end-of-2023 and single recent year data pertain to citations received during calendar year 2023. The selection is based on the top 100,000 scientists by c-score (with and without self-citations) or a percentile rank of 2% or above in the sub-field. This version (7) is based on the August 1, 2024 snapshot from Scopus, updated to end of citation year 2023. This work uses Scopus data. Calculations were performed using all Scopus author profiles as of August 1, 2024. If an author is not on the list it is simply because the composite indicator value was not high enough to appear on the list. It does not mean that the author does not do good work. PLEASE ALSO NOTE THAT THE DATABASE HAS BEEN PUBLISHED IN AN ARCHIVAL FORM AND WILL NOT BE CHANGED. The published version reflects Scopus author profiles at the time of calculation. We thus advise authors to ensure that their Scopus profiles are accurate. REQUESTS FOR CORRECIONS OF THE SCOPUS DATA (INCLUDING CORRECTIONS IN AFFILIATIONS) SHOULD NOT BE SENT TO US. They should be sent directly to Scopus, preferably by use of the Scopus to ORCID feedback wizard (https://orcid.scopusfeedback.com/) so that the correct data can be used in any future annual updates of the citation indicator databases. The c-score focuses on impact (citations) rather than productivity (number of publications) and it also incorporates information on co-authorship and author positions (single, first, last author). If you have additional questions, see attached file on FREQUENTLY ASKED QUESTIONS. Finally, we alert users that all citation metrics have limitations and their use should be tempered and judicious. For more reading, we refer to the Leiden manifesto: https://www.nature.com/articles/520429a
Facebook
TwitterInvaCost is the most up-to-date, comprehensive, standardized and robust data compilation and description of economic cost estimates associated with invasive species worldwide1. InvaCost has been constructed to provide a contemporary and freely available repository of monetary impacts that can be relevant for both research and evidence-based policy making. The ongoing work made by the InvaCost consortium2,3,4 leads to constantly improving the structure and content of the database (see sections below). The list of actual contributors to this data resource now largely exceeds the list of authors listed in this page. All details regarding the previous versions of InvaCost can be found by switching from one version to another using the “version” button above. IMPORTANT UPDATES: 1. All information, files, outcomes, updates and resources related to the InvaCost project are now available on a new website: http://invacost.fr/2. The names of the following columns have been changed between the previous and the current version: ‘Raw_cost_estimate_local_currency’ is now named ‘Raw_cost_estimate_original_currency’; ‘Min_Raw_cost_estimate_local_currency’ is now named ‘Min_Raw_cost_estimate_original_currency’; ‘Max_Raw_cost_estimate_local_currency’ is now named ‘Max_Raw_cost_estimate_original_currency’; ‘Cost_estimate_per_year_local_currency’ is now named ‘Cost_estimate_per_year_original_currency’3. The Frequently Asked Questions (FAQ) about the database and how to (1) understand it, (2) analyse it and (3) add new data are available at: https://farewe.github.io/invacost_FAQ/. There are over 60 questions (and responses), so there’s probably yours.4. Accordingly with the continuous development and updates of the database, a ‘living figure’ is now available online to display the evolving relative contributions of different taxonomic groups and regions to the overall cost estimates as the database is updated: https://borisleroy.com/invacost/invacost_livingfigure.html5. We have now added a new column called ‘InvaCost_ID’, which is now used to identify each cost entry in the current and future public versions of the database. As this new column only affects the identification of the cost entries and not their categorisation, this is not considered as a change of the structure of the whole database. Therefore, the first level of the version numbering remains ‘4’ (see VERSION NUMBERING section). CONTENT: This page contains four files: (1) 'InvaCost_database_v4.1' which contains 13,553 cost entries depicted by 66 descriptive columns; (2) ‘Descriptors 4.1’ provides full definition and details about the descriptive columns used in the database; (3) ‘Update_Invacost_4.1’ has details about the all the changes made between previous and current versions of InvaCost; (4) ‘InvaCost_template_4.1’ (downloadable file) provides an easier way of entering data in the spreadsheet, standardizing all the terms used on it as much as possible to avoid mistakes and saving time at post-refining stages (this file should be used by any external contributor to propose new cost data). METHODOLOGY: All the methodological details and tools used to build and populate this database are available in Diagne et al. 20201 and Angulo et al. 20215. Note that several papers used different approaches to investigate and analyse the database, and they are all available on our website http://invacost.fr/. VERSION NUMBERING: InvaCost is regularly updated with contributions from both authors and future users in order to improve it both quantitatively (by new cost information) and qualitatively (if errors are identified). Any reader or user can propose to update InvaCost by filling the ‘InvaCost_updates_template’ file with new entries or corrections, and sending it to our email address (updates@invacost.fr). Each updated public version of InvaCost is stored in this figShare repository, with a unique version number. For this purpose, we consider the original version of InvaCost publicly released in September 2020 as ‘InvaCost_1.0’. The further updated versions are named using the subsequent numbering (e.g., ‘InvaCost_2.0’, InvaCost_2.1’) and all information on changes made are provided in a dedicated file called ‘Updates-InvaCost’ (named using the same numbering, e.g., ‘Updates-InvaCost_2.0’, ‘Updates-InvaCost_2.1’). We consider changing the first level of this numbering (e.g. ‘InvaCost_3.x’ ‘InvaCost_4.x’) only when the structure of the database changes. Every user wanting to have the most up-to-date version of the database should refer to the latest released version. RECOMMENDATIONS: Every user should read the ‘Usage notes’ section of Diagne et al. 20201 before considering the database for analysis purposes or specific interpretation. InvaCost compiles cost data published in the literature, but does not aim to provide a ready-to-use dataset for specific analyses. While the cost data are described in a homogenized way in InvaCost, the intrinsic disparity, complexity, and heterogeneity of the cost data require specific data processing depending on the user objectives (see our FAQ). However, we provide necessary information and caveats about recorded costs, and we have now an open-source software designed to query and analyse this database6. CAUTION: InvaCost is currently being analysed by a network of international collaborators in the frame of the InvaCost project2,3,4 (see https://invacost.fr/en/outcomes/). Interested users may contact the InvaCost team if they wish to learn more about or contribute to these current efforts. Users are in no way prevented from performing their own independent analyses and collaboration with this network is not required. Nonetheless, users and contributors are encouraged to contact the InvaCost team before using the database, as the information contained may not be directly implementable for specific analyses. RELATED LINKS AND PUBLICATIONS: 1 Diagne, C., Leroy, B., Gozlan, R.E. et al. InvaCost, a public database of the economic costs of biological invasions worldwide. Sci Data 7, 277 (2020). https://doi.org/10.1038/s41597-020-00586-z 2 Diagne C, Catford JA, Essl F, Nuñez MA, Courchamp F (2020) What are the economic costs of biological invasions? A complex topic requiring international and interdisciplinary expertise. NeoBiota 63: 25–37. https://doi.org/10.3897/neobiota.63.55260 3 Researchgate page: https://www.researchgate.net/project/InvaCost-assessing-the-economic-costs-of-biological-invasions 4 InvaCost workshop: https://www.biodiversitydynamics.fr/invacost-workshop/ 5 Angulo E, Diagne C, Ballesteros-Mejia L. et al. (2021) Non-English languages enrich scientific knowledge: the example of economic costs of biological invasions. Science of the Total Environment 775:144441. https://doi.org/10.1016/j.scitotenv.2020.144441 6Leroy B, Kramer A M, Vaissière A-C, Courchamp F and Diagne C (2020) Analysing global economic costs of invasive alien species with the invacost R package. BioRXiv. doi: https://doi.org/10.1101/2020.12.10.419432
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LifeSnaps Dataset Documentation Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction. The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication. Data Import: Reading CSV For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command. Data Import: Setting up a MongoDB (Recommended) To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database. To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here. For the Fitbit data, run the following: mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
Facebook
TwitterLifeSnaps Dataset Documentation
Ubiquitous self-tracking technologies have penetrated various aspects of our lives, from physical and mental health monitoring to fitness and entertainment. Yet, limited data exist on the association between in the wild large-scale physical activity patterns, sleep, stress, and overall health, and behavioral patterns and psychological measurements due to challenges in collecting and releasing such datasets, such as waning user engagement, privacy considerations, and diversity in data modalities. In this paper, we present the LifeSnaps dataset, a multi-modal, longitudinal, and geographically-distributed dataset, containing a plethora of anthropological data, collected unobtrusively for the total course of more than 4 months by n=71 participants, under the European H2020 RAIS project. LifeSnaps contains more than 35 different data types from second to daily granularity, totaling more than 71M rows of data. The participants contributed their data through numerous validated surveys, real-time ecological momentary assessments, and a Fitbit Sense smartwatch, and consented to make these data available openly to empower future research. We envision that releasing this large-scale dataset of multi-modal real-world data, will open novel research opportunities and potential applications in the fields of medical digital innovations, data privacy and valorization, mental and physical well-being, psychology and behavioral sciences, machine learning, and human-computer interaction.
The following instructions will get you started with the LifeSnaps dataset and are complementary to the original publication.
Data Import: Reading CSV
For ease of use, we provide CSV files containing Fitbit, SEMA, and survey data at daily and/or hourly granularity. You can read the files via any programming language. For example, in Python, you can read the files into a Pandas DataFrame with the pandas.read_csv() command.
Data Import: Setting up a MongoDB (Recommended)
To take full advantage of the LifeSnaps dataset, we recommend that you use the raw, complete data via importing the LifeSnaps MongoDB database.
To do so, open the terminal/command prompt and run the following command for each collection in the DB. Ensure you have MongoDB Database Tools installed from here.
For the Fitbit data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c fitbit
For the SEMA data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c sema
For surveys data, run the following:
mongorestore --host localhost:27017 -d rais_anonymized -c surveys
If you have access control enabled, then you will need to add the --username and --password parameters to the above commands.
Data Availability
The MongoDB database contains three collections, fitbit, sema, and surveys, containing the Fitbit, SEMA3, and survey data, respectively. Similarly, the CSV files contain related information to these collections. Each document in any collection follows the format shown below:
{ _id: id (or user_id): type: data: }
Each document consists of four fields: id (also found as user_id in sema and survey collections), type, and data. The _id field is the MongoDB-defined primary key and can be ignored. The id field refers to a user-specific ID used to uniquely identify each user across all collections. The type field refers to the specific data type within the collection, e.g., steps, heart rate, calories, etc. The data field contains the actual information about the document e.g., steps count for a specific timestamp for the steps type, in the form of an embedded object. The contents of the data object are type-dependent, meaning that the fields within the data object are different between different types of data. As mentioned previously, all times are stored in local time, and user IDs are common across different collections. For more information on the available data types, see the related publication.
Surveys Encoding
BREQ2
Why do you engage in exercise?
Code
Text
engage[SQ001]
I exercise because other people say I should
engage[SQ002]
I feel guilty when I don’t exercise
engage[SQ003]
I value the benefits of exercise
engage[SQ004]
I exercise because it’s fun
engage[SQ005]
I don’t see why I should have to exercise
engage[SQ006]
I take part in exercise because my friends/family/partner say I should
engage[SQ007]
I feel ashamed when I miss an exercise session
engage[SQ008]
It’s important to me to exercise regularly
engage[SQ009]
I can’t see why I should bother exercising
engage[SQ010]
I enjoy my exercise sessions
engage[SQ011]
I exercise because others will not be pleased with me if I don’t
engage[SQ012]
I don’t see the point in exercising
engage[SQ013]
I feel like a failure when I haven’t exercised in a while
engage[SQ014]
I think it is important to make the effort to exercise regularly
engage[SQ015]
I find exercise a pleasurable activity
engage[SQ016]
I feel under pressure from my friends/family to exercise
engage[SQ017]
I get restless if I don’t exercise regularly
engage[SQ018]
I get pleasure and satisfaction from participating in exercise
engage[SQ019]
I think exercising is a waste of time
PANAS
Indicate the extent you have felt this way over the past week
P1[SQ001]
Interested
P1[SQ002]
Distressed
P1[SQ003]
Excited
P1[SQ004]
Upset
P1[SQ005]
Strong
P1[SQ006]
Guilty
P1[SQ007]
Scared
P1[SQ008]
Hostile
P1[SQ009]
Enthusiastic
P1[SQ010]
Proud
P1[SQ011]
Irritable
P1[SQ012]
Alert
P1[SQ013]
Ashamed
P1[SQ014]
Inspired
P1[SQ015]
Nervous
P1[SQ016]
Determined
P1[SQ017]
Attentive
P1[SQ018]
Jittery
P1[SQ019]
Active
P1[SQ020]
Afraid
Personality
How Accurately Can You Describe Yourself?
Code
Text
ipip[SQ001]
Am the life of the party.
ipip[SQ002]
Feel little concern for others.
ipip[SQ003]
Am always prepared.
ipip[SQ004]
Get stressed out easily.
ipip[SQ005]
Have a rich vocabulary.
ipip[SQ006]
Don't talk a lot.
ipip[SQ007]
Am interested in people.
ipip[SQ008]
Leave my belongings around.
ipip[SQ009]
Am relaxed most of the time.
ipip[SQ010]
Have difficulty understanding abstract ideas.
ipip[SQ011]
Feel comfortable around people.
ipip[SQ012]
Insult people.
ipip[SQ013]
Pay attention to details.
ipip[SQ014]
Worry about things.
ipip[SQ015]
Have a vivid imagination.
ipip[SQ016]
Keep in the background.
ipip[SQ017]
Sympathize with others' feelings.
ipip[SQ018]
Make a mess of things.
ipip[SQ019]
Seldom feel blue.
ipip[SQ020]
Am not interested in abstract ideas.
ipip[SQ021]
Start conversations.
ipip[SQ022]
Am not interested in other people's problems.
ipip[SQ023]
Get chores done right away.
ipip[SQ024]
Am easily disturbed.
ipip[SQ025]
Have excellent ideas.
ipip[SQ026]
Have little to say.
ipip[SQ027]
Have a soft heart.
ipip[SQ028]
Often forget to put things back in their proper place.
ipip[SQ029]
Get upset easily.
ipip[SQ030]
Do not have a good imagination.
ipip[SQ031]
Talk to a lot of different people at parties.
ipip[SQ032]
Am not really interested in others.
ipip[SQ033]
Like order.
ipip[SQ034]
Change my mood a lot.
ipip[SQ035]
Am quick to understand things.
ipip[SQ036]
Don't like to draw attention to myself.
ipip[SQ037]
Take time out for others.
ipip[SQ038]
Shirk my duties.
ipip[SQ039]
Have frequent mood swings.
ipip[SQ040]
Use difficult words.
ipip[SQ041]
Don't mind being the centre of attention.
ipip[SQ042]
Feel others' emotions.
ipip[SQ043]
Follow a schedule.
ipip[SQ044]
Get irritated easily.
ipip[SQ045]
Spend time reflecting on things.
ipip[SQ046]
Am quiet around strangers.
ipip[SQ047]
Make people feel at ease.
ipip[SQ048]
Am exacting in my work.
ipip[SQ049]
Often feel blue.
ipip[SQ050]
Am full of ideas.
STAI
Indicate how you feel right now
Code
Text
STAI[SQ001]
I feel calm
STAI[SQ002]
I feel secure
STAI[SQ003]
I am tense
STAI[SQ004]
I feel strained
STAI[SQ005]
I feel at ease
STAI[SQ006]
I feel upset
STAI[SQ007]
I am presently worrying over possible misfortunes
STAI[SQ008]
I feel satisfied
STAI[SQ009]
I feel frightened
STAI[SQ010]
I feel comfortable
STAI[SQ011]
I feel self-confident
STAI[SQ012]
I feel nervous
STAI[SQ013]
I am jittery
STAI[SQ014]
I feel indecisive
STAI[SQ015]
I am relaxed
STAI[SQ016]
I feel content
STAI[SQ017]
I am worried
STAI[SQ018]
I feel confused
STAI[SQ019]
I feel steady
STAI[SQ020]
I feel pleasant
TTM
Do you engage in regular physical activity according to the definition above? How frequently did each event or experience occur in the past month?
Code
Text
processes[SQ002]
I read articles to learn more about physical
Facebook
TwitterThis archived Paleoclimatology Study is available from the NOAA National Centers for Environmental Information (NCEI), under the World Data Service (WDS) for Paleoclimatology. The associated NCEI study type is Paleoceanography. The data include parameters of instrumental with a geographic location of Global. The time period coverage is from -22 to -66 in calendar years before present (BP). See metadata information for parameter and study location details. Please cite this study when using the data.
Facebook
TwitterUPDATE 1/7/2025: On June 28th 2023, the San Francisco Police Department (SFPD) changed its Stops Data Collection System (SDCS). As a result of this change, record identifiers have changed from the Department of Justice (DOJ) identifier to an internal record numbering system (referred to as "LEA Record ID"). The data that SFPD uploads to the DOJ system will contain the internal record number which can be used for joins with the data available on DataSF. A. SUMMARY The San Francisco Police Department (SFPD) Stop Data was designed to capture information to comply with the Racial and Identity Profiling Act (RIPA), or California Assembly Bill (AB)953. SFPD officers collect specific information on each stop, including elements of the stop, circumstances and the perceived identity characteristics of the individual(s) stopped. The information obtained by officers is reported to the California Department of Justice. This dataset includes data on stops starting on July 1st, 2018, which is when the data collection program went into effect. Read the detailed overview for this dataset here. B. HOW THE DATASET IS CREATED By the end of each shift, officers enter all stop data into the Stop Data Collection System, which is automatically submitted to the California Department of Justice (CA DOJ). Once a quarter the Department receives a stops data file from CA DOJ. The SFPD conducts several transformations of this data to ensure privacy, accuracy and compliance with State law and regulation. For increased usability, text descriptions have also been added for several data fields which include numeric codes (including traffic, suspicion, citation, and custodial arrest offense codes, and actions taken as a result of a stop). See the data dictionaries below for explanations of all coded data fields. Read more about the data collection, and transformation, including geocoding and PII cleaning processes, in the detailed overview of this dataset. C. UPDATE PROCESS Information is updated on a quarterly basis. D. HOW TO USE THIS DATASET This dataset includes information about police stops that occurred, including some details about the person(s) stopped, and what happened during the stop. Each row is a person stopped with a record identifier for the stop and a unique identifier for the person. A single stop may involve multiple people and may produce more than one associated unique identifier for the same record identifier. A certain percentage of stops have stop information that can’t be geocoded. This may be due to errors in data input at the officer level (typos in entry or providing an address that doesn't exist). More often, this is due to officers providing a level of detail that isn't codable to a geographic coordinate - most often at the Airport (ie: Terminal 3, door 22.) In these cases, the _location of the stops is coded as unknown. E. DATA DICTIONARIES CJIS Offense Codes data look up table Look up table for other coded data fields
Facebook
TwitterA. SUMMARY This dataset contains COVID-19 positive confirmed cases aggregated by several different geographic areas and by day. COVID-19 cases are mapped to the residence of the individual and shown on the date the positive test was collected. In addition, 2016-2020 American Community Survey (ACS) population estimates are included to calculate the cumulative rate per 10,000 residents. Dataset covers cases going back to 3/2/2020 when testing began. This data may not be immediately available for recently reported cases and data will change to reflect as information becomes available. Data updated daily. Geographic areas summarized are: 1. Analysis Neighborhoods 2. Census Tracts 3. Census Zip Code Tabulation Areas B. HOW THE DATASET IS CREATED Addresses from the COVID-19 case data are geocoded by the San Francisco Department of Public Health (SFDPH). Those addresses are spatially joined to the geographic areas. Counts are generated based on the number of address points that match each geographic area for a given date. The 2016-2020 American Community Survey (ACS) population estimates provided by the Census are used to create a cumulative rate which is equal to ([cumulative count up to that date] / [acs_population]) * 10000) representing the number of total cases per 10,000 residents (as of the specified date). COVID-19 case data undergo quality assurance and other data verification processes and are continually updated to maximize completeness and accuracy of information. This means data may change for previous days as information is updated. C. UPDATE PROCESS Geographic analysis is scripted by SFDPH staff and synced to this dataset daily at 05:00 Pacific Time. D. HOW TO USE THIS DATASET San Francisco population estimates for geographic regions can be found in a view based on the San Francisco Population and Demographic Census dataset. These population estimates are from the 2016-2020 5-year American Community Survey (ACS). This dataset can be used to track the spread of COVID-19 throughout the city, in a variety of geographic areas. Note that the new cases column in the data represents the number of new cases confirmed in a certain area on the specified day, while the cumulative cases column is the cumulative total of cases in a certain area as of the specified date. Privacy rules in effect To protect privacy, certain rules are in effect: 1. Any area with a cumulative case count less than 10 are dropped for all days the cumulative count was less than 10. These will be null values. 2. Once an area has a cumulative case count of 10 or greater, that area will have a new row of case data every day following. 3. Cases are dropped altogether for areas where acs_population < 1000 4. Deaths data are not included in this dataset for privacy reasons. The low COVID-19 death rate in San Francisco, along with other publicly available information on deaths, means that deaths data by geography and day is too granular and potentially risky. Read more in our privacy guidelines Rate suppression in effect where counts lower than 20 Rates are not calculated unless the cumulative case count is greater than or equal to 20. Rates are generally unstable at small numbers, so we avoid calculating them directly. We advise you to apply the same approach as this is best practice in epidemiology. A note on Census ZIP Code Tabulation Areas (ZCTAs) ZIP Code Tabulation Areas are spec
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data for Figure 9.13 from Chapter 9 of the Working Group I (WGI) Contribution to the Intergovernmental Panel on Climate Change (IPCC) Sixth Assessment Report (AR6).
Figure 9.13 shows Arctic sea ice historical records and CMIP6 projections.
How to cite this dataset
When citing this dataset, please include both the data citation below (under 'Citable as') and the following citation for the report component from which the figure originates: Fox-Kemper, B., H.T. Hewitt, C. Xiao, G. Aðalgeirsdóttir, S.S. Drijfhout, T.L. Edwards, N.R. Golledge, M. Hemer, R.E. Kopp, G. Krinner, A. Mix, D. Notz, S. Nowicki, I.S. Nurhati, L. Ruiz, J.-B. Sallée, A.B.A. Slangen, and Y. Yu, 2021: Ocean, Cryosphere and Sea Level Change. In Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Masson-Delmotte, V., P. Zhai, A. Pirani, S.L. Connors, C. Péan, S. Berger, N. Caud, Y. Chen, L. Goldfarb, M.I. Gomis, M. Huang, K. Leitzell, E. Lonnoy, J.B.R. Matthews, T.K. Maycock, T. Waterfield, O. Yelekçi, R. Yu, and B. Zhou (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA, pp. 1211–1362, doi:10.1017/9781009157896.011.
Figure subpanels
The figure has 2 subpanels, with data provided for both panels.
List of data provided
This dataset contains: - (Left panel) Absolute anomaly of monthly-mean Arctic sea ice area during the period 1979 to 2019 relative to the average monthly-mean Arctic sea ice area during the period 1979 to 2008. - (Right panel) Sea ice concentration in the Arctic for March and September, which usually are the months of maximum and minimum sea ice area, respectively.
First column: Satellite-retrieved mean sea ice concentration during the decade 1979–1988. Second column: Satellite-retrieved mean sea ice concentration during the decade 2010-2019. Third column: Absolute change in sea ice concentration between these two decades, with grid lines indicating non-significant differences. Fourth column: Number of available CMIP6 models that simulate a mean sea ice concentration above 15 % for the decade 2045–2054.
The average observational record of sea ice area is derived from the UHH sea ice area product (Doerr et al., 2021), based on the average sea ice concentration of OSISAF/CCI (OSI-450 for 1979–2015, OSI-430b for 2016–2019) (Lavergne et al., 2019), NASA Team (version 1, 1979–2019) (Cavalieri et al., 1996) and Bootstrap (version 3, 1979–2019) (Comiso, 2017) that is also used for the figure panels showing observed sea ice concentration.
Further details on data sources and processing are available in the chapter data table (Table 9.SM.9)
Data provided in relation to figure
Data provided in relation to Figure 9.13
Datafile 'mapplot_data.npz' included in the 'Plotted Data' folder of the dedicated GitHub repository is not archived here but on Zenodo at the link provided in the Related Documents section of this catalogue record.
CMIP6 is the sixth phase of the Coupled Model Intercomparison Project. NSIDC is the National Snow and Ice Data Center. UHH is the University of Hamburg (Universtität Hamburg).
Notes on reproducing the figure from the provided data
Both panels were plotted using standard matplotlib software - code is available via the link in the documentation.
Sources of additional information
The following weblinks are provided in the Related Documents section of this catalogue record: - Link to the figure on the IPCC AR6 website - Link to the report component containing the figure (Chapter 9) - Link to the Supplementary Material for Chapter 9, which contains details on the input data used in Table 9.SM.9 - Link to the data and code used to produce this figure and others in Chapter 9, archived on Zenodo. - Link to the output data and scripts for this figure, contained in a dedicated GitHub repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data sets accompanying the paper "The FAIR Assessment Conundrum: Reflections on Tools and Metrics", an analysis of a comprehensive set of FAIR assessment tools and the metrics used by these tools for the assessment.
The data set "metrics.csv" consists of the metrics collected from several sources linked to the analysed FAIR assessments tools. It is structured into 11 columns: (i) tool_id, (ii) tool_name, (iii) metric_discarded, (iv) metric_fairness_scope_declared, (v) metric_fairness_scope_observed, (vi) metric_id, (vii) metric_text, (viii) metric_technology, (ix) metric_approach, (x) last_accessed_date, and (xi) provenance.
The columns tool_id and tool_name are used for the identifier we assigned to each tool analysed and the full name of the tool respectively.
The metric_discarded column refers to the selection we operated on the collected metrics, since we excluded the metrics created for testing purposes or written in a language different from English. The possible values are boolean. We assigned TRUE if the metric was discarded.
The columns metric_fairness_scope_declared and metric_fairness_scope_observed are used for indicating the declared intent of the metrics, with respect to the FAIR principle assessed, and the one we observed respectively. Possible values are: (a) a letter of the FAIR acronym (for the metrics without a link declared to a specific FAIR principle), (b) one or more identifiers of the FAIR principles (F1, F2…), (c) n/a, if no FAIR references were declared, or (d) none, if no FAIR references were observed.
The metric_id and metric_text columns are used for the identifiers of the metrics and the textual and human-oriented content of the metrics respectively.
The column metric_technology is used for enumerating the technologies (a term used in its widest acceptation) mentioned or used by the metrics for the specific assessment purpose. Such technologies include very diverse typologies ranging from (meta)data formats to standards, semantic technologies, protocols, and services. For tools implementing automated assessments, the technologies listed take into consideration also the available code and documentation, not just the metric text.
The column metric_approach is used for identifying the type of implementation observed in the assessments. The identification of the implementation types followed a bottom-to-top approach applied to the metrics organised by the metric_fairness_scope_declared values. Consequently, while the labels used for creating the implementation type strings are the same, their combination and specialisation varies based on the characteristics of the actual set of metrics analysed. The main labels used are: (a) 3rd party service-based, (b) documentation-centred, (c) format-centred, (d) generic, (e) identifier-centred, (f) policy-centred, (g) protocol-centred, (h) metadata element-centred, (i) metadata schema-centred, (j) metadata value-centred, (k) service-centred, and (l) na.
The columns provenance and last_accessed_date are used for the main source of information about each metric (at least with regard to the text) and the date we last accessed it respectively.
The data set "classified_technologies.csv" consists of the technologies mentioned or used by the metrics for the specific assessment purpose. It is structured into 3 columns: (i) technology, (ii) class, and (iii) discarded.
The column technology is used for the names of the different technologies mentioned or used by the metrics.
The column class is used for specifying the type of technology used. Possible values are: (a) application programming interface, (b) format, (c) identifier, (d) library, (e) licence, (f) protocol, (g) query language, (h) registry, (i) repository, (j) search engine, (k) semantic artefact, and (l) service.
The discarded column refers to the exclusion of the value 'linked data' from the accepted technologies since it is too generic. The possible values are boolean. We assigned TRUE if the technology was discarded.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By FiveThirtyEight [source]
This dataset contains a comprehensive collection of data utilized by FiveThirtyEight in their articles, graphics, and interactive content pertaining to the predictions for NFL games. FiveThirtyEight uses an ELO algorithm to predict the potential winner of each game. Elo ratings are a measure of team strength based on head-to-head results, margin of victory, and quality of opponent. The ratings are always relative — they only have meaning in comparison to other teams’ ratings.
The database is compiled with a wide array of key metrics that add depth to any analysis or evaluation related to NFL games prediction. Items available range from team names and seasons numbers up till probability forecasts that precisely point out which team won or lost a certain match at any given date.
The columns within this dataset include: - Date: The day on which the game was played. - Season: Specifies during which season the game took place. - Neutral: This indicates whether the game was played at a neutral venue or not. - Playoff: Details if it's a playoff game - Team1 & Team2: Names both participating teams. - Elo1_pre & Elo2_pre: Indicates each team’s Elo rating before the game - Elo_prob1 & Elo_prob2 : Gives out winning probabilities for either team - Result 1 : Reveals who won
This data can be used by sports analysts and enthusiasts alike while making predictions about future matches and uncovering trends hidden beneath past experiences regarding NFL games. The data amassed here serve as tools for individuals who wish to delve into playful soothsaying based on solid statistics or for researchers willing to perform substantial studies encompassing historical figures related with decades' worth of American football.
Do keep in mind as you navigate through this extensive repository that all its contents come under Creative Commons Attribution 4.0 International License while our source codes adhere strictly with MIT License; retainings rights yet promoting productive borrowing for meaningful purposes guided towards creating new compelling outputs.
If you find this data useful in your work or personal projects, we would love to hear about your experiences and how our data repository has contributed to them
This dataset is beneficial for data analysts, data scientists, sports enthusiasts, or anyone who is interested in historical and predictive analysis of NFL games.
Here are some instructions on how to use this dataset:
Understanding the dataset: Before using this dataset, you must understand what each column represents. The information includes game details like team names (team1 and team2), their corresponding Elo ratings before (elo1_before and elo2_before) and after(elo1_after and elo2_after) the game, result of individual games(team1_win_prob) etc.
Predictive Analysis: Develop a Machine Learning model: Use features such as Elo scores before the match to predict match outcomes.It'll be interesting to see how accurate a predictive model can be! For instance - linear regression can be implemented on this kind of problem statement.
Historical Analysis: Analyze patterns from past results by producing descriptive statistics or creating visualizations with libraries such as Matplotlib or Seaborn in Python. Examples can include analyzing trends overtime like changes in ratings post matches for teams that have faced each other multiple times etc.
Testing Hypotheses: If you have any hypotheses about NFL games — perhaps that home field advantage increasingly matters, or certain teams outperform their predicted winning probabilities — you can test them using statistical methods such as A/B testing or regression analysis using pandas library's statistical methods .
Free text analyses are also possible through exploring these rich set of columns provided by FiveThirtyEight's documentation (ie., result,touchdowns)
Remember always check your data—clean it up if necessary—and approach it from different angles! An initial hypothesis may not hold true under scrutiny; but don't be discouraged since all findings are valuable when conducting rigorous research.
In conclusion - You could carry out various types of quantitative analysis based on just the Elo ratings and game results, so this dataset holds a wealth of opportunities for predictive modelling, statistical testing and storytelling through data. Happy Exploring!
- Predicting Future Games: This dataset c...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset WorkMeaningful governance of any system requires the system to be assessed and monitored effectively. In the domain of Artificial Intelligence (AI), global efforts have established a set of ethical principles like fairness, transparency, and privacy upon which AI governance expectations are being built. The computing research community has proposed numerous means of measuring an AI system’s normative qualities along these principles. Current reporting of these measures is principle-specific, limited in scope, or otherwise dispersed across publication platforms, hindering the domain’s ability to critique its practices. To address this, we introduce the Responsible AI Measures Dataset, consolidating 12,067 data points across 791 evaluation measures covering 11 ethical principles. It is extracted from a corpus of computing literature (n=257) published between 2011 and 2023. The dataset includes detailed descriptions of each measure, AI system characteristics, and publication metadata. An accompanying, interactive Sunburst visualization tool supports usability and interpretation. The Responsible AI Measures Dataset enables practitioners to explore existing assessment approaches and critically analyze how the computing domain measures normative concepts.Using the Interactive VisualizationThis dataset has a corresponding visualization that can be dynamically interacted with. It can be found as the "Sunburst_Visualization_Link.md" file in this repository. Note there are two versions. Version 1.0 was released in May 2025. Version 2.0 was released in August 2025.Demo Link: https://bit.ly/RAI_Measures_DemoTo use the visualization:Select a principle, followed by the component of the ML system, and the sociotechnical harm that you are interested in exploring. Note that hovering will display three pieces of metadata: the tier you are currently at in the visual, the parent (e.g., the tier prior), and the principle you are currently hovering above.Click on a measure to see the corresponding measurement process.Learn more about the measure, its formulaic variables (if quantitative), and relative context of use, please click the link on each measurement process to access the authors’ publication.Some measurement processes will include paper-specific references, terms, or formulas that may require further context to understand. Please use the paper title and lead author name(s) to further investigate the measure(s).Current VersionCurrently, this work is on Version 1.0 of the publicly shared dataset and corresponding visualization. The dataset is in a Microsoft Excel (.xslx) format. Note that it is not recommended to open the file in a .csv format due to the increased likelihood of corrupted characters and file formatting. Please read the below sections for more information on the dataset.Version 1.0 (July 2025)Target Output (Columns A and B in Blue): The resulting measures collected in this dataset.MeasureMeasurement ProcessEntry Points (Columns C and D in Orange): The primary features in narrowing down potential measures for an algorithmic system.PrinciplePart of the ML SystemConnections to Harm (Columns E - F in Pink): The sociotechnical harms for which the measure aims to make aware and/or mitigate.Primary HarmSecondary HarmMeasurement Properties (Columns G - I in Green): The standard(s) used in each measure's evaluation.Criterion NameCriterion DescriptionType of AssessmentAlgorithmic System Characteristics (Columns J - M in Purple): Additional features that a user can consider when narrowing down measures to use.Application AreaPurpose of ML SystemType of DataAlgorithm TypePublication Metadata (Columns N - P in Yellow): Details further documentation into each source that was extracted to collect each feature and measure.TitlePublication YearDOI LinkUsing the Interactive VisualizationThis dataset has a corresponding visualization that can be dynamically interacted with. It can be found as the "Version_1.0_Sunburst_Visualization_Link.md" file in this repository.
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Redirect Notice: The website https://transbase.sfgov.org/ is no longer in operation. Visitors to Transbase will be redirected to this page where they can view, visualize, and download Traffic Crash data.A. SUMMARYThis table contains all crashes resulting in an injury in the City of San Francisco. Fatality year-to-date crash data is obtained from the Office of the Chief Medical Examiner (OME) death records, and only includes those cases that meet the San Francisco Vision Zero Fatality Protocol maintained by the San Francisco Department of Public Health (SFDPH), San Francisco Police Department (SFPD), and San Francisco Municipal Transportation Agency (SFMTA). Injury crash data is obtained from SFPD’s Interim Collision System for 2018 through the current year-to-date, Crossroads Software Traffic Collision Database (CR) for years 2013-2017 and the Statewide Integrated Transportation Record System (SWITRS) maintained by the California Highway Patrol for all years prior to 2013. Only crashes with valid geographic information are mapped. All geocodable crash data is represented on the simplified San Francisco street centerline model maintained by the Department of Public Works (SFDPW). Collision injury data is queried and aggregated on a quarterly basis. Crashes occurring at complex intersections with multiple roadways are mapped onto a single point and injury and fatality crashes occurring on highways are excluded.The crash, party, and victim tables have a relational structure. The traffic crashes table contains information on each crash, one record per crash. The party table contains information from all parties involved in the crashes, one record per party. Parties are individuals involved in a traffic crash including drivers, pedestrians, bicyclists, and parked vehicles. The victim table contains information about each party injured in the collision, including any passengers. Injury severity is included in the victim table. For example, a crash occurs (1 record in the crash table) that involves a driver party and a pedestrian party (2 records in the party table). Only the pedestrian is injured and thus is the only victim (1 record in the victim table). To learn more about the traffic injury datasets, see the TIMS documentationB. HOW THE DATASET IS CREATEDTraffic crash injury data is collected from the California Highway Patrol 555 Crash Report as submitted by the police officer within 30 days after the crash occurred. All fields that match the SWITRS data schema are programmatically extracted, de-identified, geocoded, and loaded into TransBASE. See Section D below for details regarding TransBASE. C. UPDATE PROCESSAfter review by SFPD and SFDPH staff, the data is made publicly available approximately a month after the end of the previous quarter (May for Q1, August for Q2, November for Q3, and February for Q4). D. HOW TO USE THIS DATASETThis data is being provided as public information as defined under San Francisco and California public records laws. SFDPH, SFMTA, and SFPD cannot limit or restrict the use of this data or its interpretation by other parties in any way. Where the data is communicated, distributed, reproduced, mapped, or used in any other way, the user should acknowledge TransBASE.sfgov.org as the source of the data, provide a reference to the original data source where also applicable, include the date the data was pulled, and note any caveats specified in the associated metadata documentation provided. However, users should not attribute their analysis or interpretation of this data to the City of San Francisco. While the data has been collected and/or produced for the use of the City of San Francisco, it cannot guarantee its accuracy or completeness. Accordingly, the City of San Francisco, including SFDPH, SFMTA, and SFPD make no representation as to the accuracy of the information or its suitability for any purpose and disclaim any liability for omissions or errors that may be contained therein. As all data is associated with methodological assumptions and limitations, the City recommends that users review methodological documentation associated with the data prior to its analysis, interpretation, or communication.This dataset can also be queried on the TransBASE Dashboard. TransBASE is a geospatially enabled database maintained by SFDPH that currently includes over 200 spatially referenced variables from multiple agencies and across a range of geographic scales, including infrastructure, transportation, zoning, sociodemographic, and collision data, all linked to an intersection or street segment. TransBASE facilitates a data-driven approach to understanding and addressing transportation-related health issues,informed by a large and growing evidence base regarding the importance of transportation system design and land use decisions for health. TransBASE’s purpose is to inform public and private efforts to improve transportation system safety, sustainability, community health and equity in San Francisco.E. RELATED DATASETSTraffic Crashes Resulting in Injury: Parties InvolvedTraffic Crashes Resulting in Injury: Victims InvolvedTransBASE DashboardiSWITRSTIMSData pushed to ArcGIS Online on December 2, 2025 at 4:11 AM by SFGIS.Data from: https://data.sfgov.org/d/ubvf-ztfxDescription of dataset columns:
unique_id
unique table row identifier
cnn_intrsctn_fkey
nearest intersection centerline node key
cnn_sgmt_fkey
nearest street centerline segment key (empty if crash occurred at intersection)
case_id_pkey
unique crash report number
tb_latitude
latitude of crash (WGS 84)
tb_longitude
longitude of crash (WGS 84)
geocode_source
geocode source
geocode_location
geocode location
collision_datetime
the date and time when the crash occurred
collision_date
the date when the crash occurred
collision_time
the time when the crash occurred (24 hour time)
accident_year
the year when the crash occurred
month
month crash occurred
day_of_week
day of the week crash occurred
time_cat
generic time categories
juris
jurisdiction
officer_id
officer ID
reporting_district
SFPD reporting district
beat_number
SFPD beat number
primary_rd
the road the crash occurred on
secondary_rd
a secondary reference road that DISTANCE and DIRECT are measured from
distance
offset distance from secondary road
direction
direction of offset distance
weather_1
the weather condition at the time of the crash
weather_2
the weather condition at the time of the crash, if a second description is necessary
collision_severity
the injury level severity of the crash (highest level of injury in crash)
type_of_collision
type of crash
mviw
motor vehicle involved with
ped_action
pedestrian action involved
road_surface
road surface
road_cond_1
road condition
road_cond_2
road condition, if a second description is necessary
lighting
lighting at time of crash
control_device
control device status
intersection
indicates whether the crash occurred in an intersection
vz_pcf_code
California vehicle code primary collision factor violated
vz_pcf_group
groupings of similar vehicle codes violated
vz_pcf_description
description of vehicle code violated
vz_pcf_link
link to California vehicle code section
number_killed
counts victims in the crash with degree of injury of fatal
number_injured
counts victims in the crash with degree of injury of severe, visible, or complaint of pain
street_view
link to Google Streetview
dph_col_grp
generic crash groupings based on parties involved
dph_col_grp_description
description of crash groupings
party_at_fault
party number indicated as being at fault
party1_type
party 1 vehicle type
party1_dir_of_travel
party 1 direction of travel
party1_move_pre_acc
party 1 movement preceding crash
party2_type
party 2 vehicle type (empty if no party 2)
party2_dir_of_travel
party 2 direction of travel (empty if no party 2)
party2_move_pre_acc
party 2 movement preceding crash (empty if no party 2)
point
geometry type of crash location
data_as_of
date data added to the source system
data_updated_at
date data last updated the source system
data_loaded_at
date data last loaded here (in the open data portal)
analysis_neighborhood
supervisor_district
police_district
Current Police Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Police Districts' (qgnn-b9vv) the point in column 'point' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Current Supervisor Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Supervisor Districts' (26cr-cadq) the point in column 'point' is located. This
Facebook
TwitterThis table contains the SwiftFT catalog of point sources detected by the X-ray Telescope (XRT) on board the Swift satellite in observations centered on gamma-ray bursts (GRBs) during the first four years of operation (Jan 2005 - Dec 2008). Swift is a NASA mission with international participation dedicated to the gamma-ray burst study. It carries three instruments. The BAT is the large field of view instrument and operates in the 10-300 keV energy band; and two narrow field instruments, XRT and UVOT, that operate in the X-ray and UV/optical regime, respectively. The catalog was derived including pointing positions of the 374 fields centered on the GRBs covering a total area of ~32.55 square degrees. Since GRBs are distributed randomly in the sky, the survey covers totally unrelated parts of the sky, and is highly uniform courtesy of the XRT's stable point spread function and small vignetting correction factors. The observations for a particular field were merged together and the source search analysis was restricted to a circular area of 10 arcmin radius centered in the median of the individual observation aim points. The total exposure considering all the fields is of 36.8 Ms, with ~32% of the fields having more than 100 ks exposure time, and ~28% with exposure time in the range 50-100 ks. The catalog was generated by running the detection algorithm in the XIMAGE package version 4.4.1 that locates the point sources using a sliding-cell method. The average background intensity is estimated in several small square boxes uniformly located within the image. The position and intensity of each detected source are calculated in a box whose size maximizes the signal-to-noise ratio. The detect algorithm was run separately in the following three energy bands: 0.3-3 (Soft), 2-10 (Hard), and 0.3-10 (Full) keV. For each detections the three count rates in the soft, hard, and full bands are all corrected for dead times and vignetting using exposure maps and for the PSF. Hardness ratios are calculated using the three energy band and defined as HR = (cH - cS)/(cH + cS) where cS and cH are the count rates in the S(oft) and H(ard) bands, respectively. The catalog was cleaned of spurious and extended sources by visual inspection of all the observations. Count rates in the three bands were converted into flux in the 0.5-10, 0.5-2, and 2-10 keV energy bands, respectively. The flux was estimated using a power law spectrum with photon spectral index of 1.8 and a Galactic NH of 3.3 x 1020 cm-2. Each row in the catalog is a unique source. The detections from the soft, hard, and full bands were merged into a single catalog using a matching radius of 6 arcsec and retaining detection with a significance level of being spurious <= 2 x 10-5 in at least one band. There are 9387 total entries in the catalog. The SWIFTFT acronym honors both the Swift satellite and the memory of Francesca Tamburelli who made numerous crucial contributions to the development of the Swift-XRT data reduction software. This database table was created by the HEASARC in November 2021 based on the electronic version available from the ASI Data Center https://www.asdc.asi.it/xrtgrbdeep_cat/ and published in the Astronomy and Astrophysics Journal. This catalog is also available as the CDS catalog J/A+A/528/A122. The HEASARC added the source_number parameter, a counter to numerically identify each source in the catalog, as well as Galactic coordinates and changed the source name from SWIFTFTJHHMMSS.s+DDMM.m to SWIFTFT JHHMMSS.s+DDMM.m, adding a space between the catalog prefix and the formatted J2000 coordinates. This is a service provided by NASA HEASARC .
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Redirect Notice: The website https://transbase.sfgov.org/ is no longer in operation. Visitors to Transbase will be redirected to this page where they can view, visualize, and download Traffic Crash data.A. SUMMARYThis table contains all crashes resulting in an injury in the City of San Francisco. Fatality year-to-date crash data is obtained from the Office of the Chief Medical Examiner (OME) death records, and only includes those cases that meet the San Francisco Vision Zero Fatality Protocol maintained by the San Francisco Department of Public Health (SFDPH), San Francisco Police Department (SFPD), and San Francisco Municipal Transportation Agency (SFMTA). Injury crash data is obtained from SFPD’s Interim Collision System for 2018 through the current year-to-date, Crossroads Software Traffic Collision Database (CR) for years 2013-2017 and the Statewide Integrated Transportation Record System (SWITRS) maintained by the California Highway Patrol for all years prior to 2013. Only crashes with valid geographic information are mapped. All geocodable crash data is represented on the simplified San Francisco street centerline model maintained by the Department of Public Works (SFDPW). Collision injury data is queried and aggregated on a quarterly basis. Crashes occurring at complex intersections with multiple roadways are mapped onto a single point and injury and fatality crashes occurring on highways are excluded.The crash, party, and victim tables have a relational structure. The traffic crashes table contains information on each crash, one record per crash. The party table contains information from all parties involved in the crashes, one record per party. Parties are individuals involved in a traffic crash including drivers, pedestrians, bicyclists, and parked vehicles. The victim table contains information about each party injured in the collision, including any passengers. Injury severity is included in the victim table. For example, a crash occurs (1 record in the crash table) that involves a driver party and a pedestrian party (2 records in the party table). Only the pedestrian is injured and thus is the only victim (1 record in the victim table). To learn more about the traffic injury datasets, see the TIMS documentationB. HOW THE DATASET IS CREATEDTraffic crash injury data is collected from the California Highway Patrol 555 Crash Report as submitted by the police officer within 30 days after the crash occurred. All fields that match the SWITRS data schema are programmatically extracted, de-identified, geocoded, and loaded into TransBASE. See Section D below for details regarding TransBASE. C. UPDATE PROCESSAfter review by SFPD and SFDPH staff, the data is made publicly available approximately a month after the end of the previous quarter (May for Q1, August for Q2, November for Q3, and February for Q4). D. HOW TO USE THIS DATASETThis data is being provided as public information as defined under San Francisco and California public records laws. SFDPH, SFMTA, and SFPD cannot limit or restrict the use of this data or its interpretation by other parties in any way. Where the data is communicated, distributed, reproduced, mapped, or used in any other way, the user should acknowledge TransBASE.sfgov.org as the source of the data, provide a reference to the original data source where also applicable, include the date the data was pulled, and note any caveats specified in the associated metadata documentation provided. However, users should not attribute their analysis or interpretation of this data to the City of San Francisco. While the data has been collected and/or produced for the use of the City of San Francisco, it cannot guarantee its accuracy or completeness. Accordingly, the City of San Francisco, including SFDPH, SFMTA, and SFPD make no representation as to the accuracy of the information or its suitability for any purpose and disclaim any liability for omissions or errors that may be contained therein. As all data is associated with methodological assumptions and limitations, the City recommends that users review methodological documentation associated with the data prior to its analysis, interpretation, or communication.This dataset can also be queried on the TransBASE Dashboard. TransBASE is a geospatially enabled database maintained by SFDPH that currently includes over 200 spatially referenced variables from multiple agencies and across a range of geographic scales, including infrastructure, transportation, zoning, sociodemographic, and collision data, all linked to an intersection or street segment. TransBASE facilitates a data-driven approach to understanding and addressing transportation-related health issues,informed by a large and growing evidence base regarding the importance of transportation system design and land use decisions for health. TransBASE’s purpose is to inform public and private efforts to improve transportation system safety, sustainability, community health and equity in San Francisco.E. RELATED DATASETSTraffic Crashes Resulting in Injury: Parties InvolvedTraffic Crashes Resulting in Injury: Victims InvolvedTransBASE DashboardiSWITRSTIMSData pushed to ArcGIS Online on November 5, 2025 at 4:19 PM by SFGIS.Data from: https://data.sfgov.org/d/ubvf-ztfxDescription of dataset columns:
unique_id
unique table row identifier
cnn_intrsctn_fkey
nearest intersection centerline node key
cnn_sgmt_fkey
nearest street centerline segment key (empty if crash occurred at intersection)
case_id_pkey
unique crash report number
tb_latitude
latitude of crash (WGS 84)
tb_longitude
longitude of crash (WGS 84)
geocode_source
geocode source
geocode_location
geocode location
collision_datetime
the date and time when the crash occurred
collision_date
the date when the crash occurred
collision_time
the time when the crash occurred (24 hour time)
accident_year
the year when the crash occurred
month
month crash occurred
day_of_week
day of the week crash occurred
time_cat
generic time categories
juris
jurisdiction
officer_id
officer ID
reporting_district
SFPD reporting district
beat_number
SFPD beat number
primary_rd
the road the crash occurred on
secondary_rd
a secondary reference road that DISTANCE and DIRECT are measured from
distance
offset distance from secondary road
direction
direction of offset distance
weather_1
the weather condition at the time of the crash
weather_2
the weather condition at the time of the crash, if a second description is necessary
collision_severity
the injury level severity of the crash (highest level of injury in crash)
type_of_collision
type of crash
mviw
motor vehicle involved with
ped_action
pedestrian action involved
road_surface
road surface
road_cond_1
road condition
road_cond_2
road condition, if a second description is necessary
lighting
lighting at time of crash
control_device
control device status
intersection
indicates whether the crash occurred in an intersection
vz_pcf_code
California vehicle code primary collision factor violated
vz_pcf_group
groupings of similar vehicle codes violated
vz_pcf_description
description of vehicle code violated
vz_pcf_link
link to California vehicle code section
number_killed
counts victims in the crash with degree of injury of fatal
number_injured
counts victims in the crash with degree of injury of severe, visible, or complaint of pain
street_view
link to Google Streetview
dph_col_grp
generic crash groupings based on parties involved
dph_col_grp_description
description of crash groupings
party_at_fault
party number indicated as being at fault
party1_type
party 1 vehicle type
party1_dir_of_travel
party 1 direction of travel
party1_move_pre_acc
party 1 movement preceding crash
party2_type
party 2 vehicle type (empty if no party 2)
party2_dir_of_travel
party 2 direction of travel (empty if no party 2)
party2_move_pre_acc
party 2 movement preceding crash (empty if no party 2)
point
geometry type of crash location
data_as_of
date data added to the source system
data_updated_at
date data last updated the source system
data_loaded_at
date data last loaded here (in the open data portal)
analysis_neighborhood
supervisor_district
police_district
Current Police Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Police Districts' (qgnn-b9vv) the point in column 'point' is located. This enables the creation of region maps (choropleths) in the visualization canvas and data lens.
Current Supervisor Districts
This column was automatically created in order to record in what polygon from the dataset 'Current Supervisor Districts' (26cr-cadq) the point in column 'point' is located. This
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The development of high-fidelity mechanical property prediction models for the design of polycrystalline materials relies on large volumes of microstructural feature data. Concurrently, at these same scales, the deformation fields that develop during mechanical loading can be highly heterogeneous. Spatially correlated measurements of 3D microstructure and the ensuing deformation fields at the micro-scale would provide highly valuable insight into the relationship between microstructure and macroscopic mechanical response. They would also provide direct validation for numerical simulations that can guide and speed up the design of new materials and microstructures. However, to date, such data have been rare. Here, a one-of-a-kind, multi-modal dataset is presented that combines recent state-of-the-art experimental developments in 3D tomography and high-resolution deformation field measurements. Methods Material and Mechanical Testing.Wrought Inconel 718 (nominal composition in wt% Ni - 0.56%Al - 17.31%Fe - 0.14%Co - 17.97%Cr - 5.4%Nb - Ta - 1.00%Ti - 0.023%C - 0.0062%N) was subjected to a 30 minute annealing treatment at 1050 °C followed by water quenching, producing a grain size distribution centered at 62 micron with a nearly random texture. A two-step precipitation hardening treatment was conducted to form hardening precipitates. Tensile testing was performed at room temperature at a quasi-static strain rate using a custom in-situ 5000 N stage within a ThermoFisher Versa3D microscope on a flat dogbone-shaped specimen. The tensile test was interrupted at macroscopic plastic strain levels of 0.17%, 0.32%, 0.61%, and 1.26% for the collection of high-resolution images for digital image correlation (HR-DIC) measurements while loaded. High-Resolution Digital Image Correlation. A gold nanoparticle speckle pattern with an average particle size of 60 nanometers was deposited on the sample surface for DIC measurements. SEM image sets were acquired from the middle of the gauge length before loading and underload. Tiles of 8×8 SEM images, before and after deformation, with an image overlap of 15% were collected. Each image was acquired with a dwell time of 20 microseconds, a pixel resolution of 4096×4096, and a horizontal field width of 137 microns. Consequently, each pixel has a size of 33.4 nanometers. Regions of about 1×1 mm2 were investigated for the Inconel 718 nickel-based superalloy. DIC calculations are performed on these series of images and the results are merged using a pixel resolution merging procedure. A subset size of 31×31 pixels (1036.86×1036.86 nanometers) with a step size of 3 pixels (100.34 nanometers) was used for the DIC measurements. Digital image correlation was performed using the Heaviside-DIC method. 3D Crystallographic Orientation Measurements. The TriBeam system is used for the collection of orientation fields in 3D over a half cubic millimeter volume. After mechanical testing , the specimen is unloaded and surface EBSD measurements are performed on the surface of the specimen on the same region where the HR-DIC measurements were made. Electrical discharge machining cuts were performed to prepare a pillar with optimal geometry for a Tribeam experiment. The pillar is laser ablated with a step size of 1 micron in Z, the sectioning direction. Between each slice, EBSD measurements are collected with a step size of 1 micron (X,Y) to form cubic voxels. A set of 526 slices was obtained during the experiment and reconstructed into a 3D dataset using the DREAM.3D software. Prior to reconstruction, each EBSD slice was aligned to match the corresponding BSE image. Correlative measurements: Multi-modal Data Merging.The strain fields obtained from DIC corresponding to the investigated free surface of the 3D dataset are provided for the different loading steps. All fields have been aligned to fit the free surface of the 3D dataset. The distortion between both datasets was modeled using a polynomial function of degree 3. Individual slip traces were segmented from the DIC maps and indexed as individual features, using the iterative Hough transformation method. The location of each slip band in the 3D volume (coordinates of its endpoints on the (XY) surface), its inclination angle relative to the loading direction, its length, and average in-plane slip intensity and direction are all calculated. Mesh Generation with XtalMesh. One version of a mesh structure was created with XtalMesh, a publicly available code on GitHub. XtalMesh is used to create smooth representations of voxelized microstructures and leverages the state-of-the-art tetrahedralization algorithm fTetWild to generate an analysis-ready, boundary conforming tetrahedral mesh. The base workflow of XtalMesh was modified to better preserve the many small and thin features (mainly twins) of the Inconel 718 dataset from the effects of excessive smoothing (shrinkage and/or thinning). First, the default smoothing operation of XtalMesh was applied to the parent grain surface mesh geometry rather than of all the features/twins in the 3D dataset. This had the effect of smoothing only the twinned domains that bordered neighboring parent grains, leaving the twin boundaries still partially voxelized. At this point, the twinned regions of each parent grain are re-introduced into the parent grain mesh via constructive solid geometry (CSG) technique. For each twin, in order of smallest to largest based on a number of voxels, the intersection of its convex hull and respective parent grain mesh is computed and inserted into the overall surface mesh of the microstructure. The parent grain mesh is then redefined as the difference between itself and the previously calculated intersection. This new parent grain mesh is then used for the insertion process of the next twin. After the insertion of all twins was complete, tetrahedralization was performed on the resulting surface mesh of the entire microstructure using the fTetWild meshing algorithm. Geometric reconstruction and mesh generation using Simmetrix’ software suite. While it is possible to directly generate a mesh from a voxel dataset it is advantageous to introduce a geometric model, specifically a non-manifold boundary representation as an intermediate representation of the analysis domain. Such a model provides an unambiguous representation of the analysis domain and provides a mechanism to associate information such as material properties in a manner that is independent of the mesh. To be able to build a valid and appropriate (based on the needs of the simulation) finite element model from a voxel dataset assembled from a serial sectioning EBSD measurement, various procedures to remove artifacts are required. This includes the elimination of small groups of disconnected voxels, and removing noise from the grain boundaries (e.g., through the use of erosion and dilation filters). For the In718 RVE, features smaller than 50 connected voxels were removed followed by an erosion/dilation step using a 3x3x3 block structuring element. Care was taken not to apply the erosion filter to grains that were very thin (1 to 2 voxels thick) to preserve the geometry of those grains.This process is followed by the elimination of physically undesirable voxel configurations (e.g., voxel clusters of the same material connecting at a single voxel corner) that could create singularities in the finite element solution. The resulting geometric model represents each grain as a region (volume) with geometric faces (surfaces) representing grain boundaries. Attributes attached to each region allow the user to retrieve the grain ID as it was defined in the originating DREAM3D \cite{2014dream3D} dataset. At this stage, the face geometry still reflects the stair-stepped boundaries between the individual voxels, therefore a geometric-based algorithm is used to create smooth geometric faces while preserving the overall shape of the grain boundaries. The resulting geometric model can be tagged with meshing and analysis attributes to generate a run-ready input deck for the finite element solver.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Walking and running are mechanically and energetically different locomotion modes. For selecting one or another, speed is a parameter of paramount importance. Yet, both are likely controlled by similar low-dimensional neuronal networks that reflect in patterned muscle activations called muscle synergies. Here, we investigated how humans synergistically activate muscles during locomotion at different submaximal and maximal speeds. We analysed the duration and complexity (or irregularity) over time of motor primitives, the temporal components of muscle synergies. We found that the challenge imposed by controlling high-speed locomotion forces the central nervous system to produce muscle activation patterns that are wider and less complex relative to the duration of the gait cycle. The motor modules, or time-independent coefficients, were redistributed as locomotion speed changed. These outcomes show that robust locomotion control at challenging speeds is achieved by modulating the relative contribution of muscle activations and producing less complex and wider control signals, whereas slow speeds allow for more irregular control.
In this supplementary data set we made available: a) the metadata with anonymized participant information, b) the raw EMG, c) the touchdown and lift-off timings of the recorded limb, d) the filtered and time-normalized EMG, e) the muscle synergies extracted via NMF and f) the code to process the data, including the scripts to calculate the Higuchi's fractal dimension (HFD) of motor primitives. In total, 180 trials from 30 participants are included in the supplementary data set.
The file “metadata.dat” is available in ASCII and RData format and contains:
Code: the participant’s code
Group: the experimental group in which the participant was involved (G1 = walking and submaximal running; G2 = submaximal and maximal running)
Sex: the participant’s sex (M or F)
Speeds: the type of locomotion (W for walking or R for running) and speed at which the recordings were conducted in 10*[m/s]
Age: the participant’s age in years
Height: the participant’s height in [cm]
Mass: the participant’s body mass in [kg]
PB: 100 m-personal best time (for G2).
The "RAW_DATA.RData" R list consists of elements of S3 class "EMG", each of which is a human locomotion trial containing cycle segmentation timings and raw electromyographic (EMG) data from 13 muscles of the right-side leg. Cycle times are structured as data frames containing two columns that correspond to touchdown (first column) and lift-off (second column). Raw EMG data sets are also structured as data frames with one row for each recorded data point and 14 columns. The first column contains the incremental time in seconds. The remaining 13 columns contain the raw EMG data, named with the following muscle abbreviations: ME = gluteus medius, MA = gluteus maximus, FL = tensor fasciæ latæ, RF = rectus femoris, VM = vastus medialis, VL = vastus lateralis, ST = semitendinosus, BF = biceps femoris, TA = tibialis anterior, PL = peroneus longus, GM = gastrocnemius medialis, GL = gastrocnemius lateralis, SO = soleus. Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17). All the other trials consist of 30 gait cycles. Trials are named like “P20_R_20,” where the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.
Old versions not compatible with the R package musclesyneRgies
The files containing the gait cycle breakdown are available in RData format, in the file named “CYCLE_TIMES.RData”. The files are structured as data frames with as many rows as the available number of gait cycles and two columns. The first column named “touchdown” contains the touchdown incremental times in seconds. The second column named “stance” contains the duration of each stance phase of the right foot in seconds. Each trial is saved as an element of a single R list. Trials are named like “CYCLE_TIMES_P20_R_20,” where the characters “CYCLE_TIMES” indicate that the trial contains the gait cycle breakdown times, the characters “P20” indicate the participant number (in this example the 20th), the character “R” indicate the locomotion type (W=walking, R=running), and the numbers “20” indicate the locomotion speed in 10*m/s (in this case the speed is 2.0 m/s). Please note that the following trials include less than 30 gait cycles (the actual number shown between parentheses): P16_R_83 (20), P16_R_95 (25), P17_R_28 (28), P17_R_83 (24), P17_R_95 (13), P18_R_95 (23), P19_R_95 (18), P20_R_28 (25), P20_R_42 (27), P20_R_95 (25), P22_R_28 (23), P23_R_28(29), P24_R_28 (28), P24_R_42 (29), P25_R_28 (29), P25_R_95 (28), P26_R_28 (29), P26_R_95 (28), P27_R_28 (28), P27_R_42 (29), P27_R_95 (24), P28_R_28 (29), P29_R_95 (17).
The files containing the raw, filtered and the normalized EMG data are available in RData format, in the files named “RAW_EMG.RData” and “FILT_EMG.RData”. The raw EMG files are structured as data frames with as many rows as the amount of recorded data points and 13 columns. The first column named “time” contains the incremental time in seconds. The remaining 12 columns contain the raw EMG data, named with muscle abbreviations that follow those reported above. Each trial is saved as an element of a single R list. Trials are named like “RAW_EMG_P03_R_30”, where the characters “RAW_EMG” indicate that the trial contains raw emg data, the characters “P03” indicate the participant number (in this example the 3rd), the character “R” indicate the locomotion type (see above), and the numbers “30” indicate the locomotion speed (see above). The filtered and time-normalized emg data is named, following the same rules, like “FILT_EMG_P03_R_30”.
The files containing the muscle synergies extracted from the filtered and normalized EMG data are available in RData format, in the files named “SYNS_H.RData” and “SYNS_W.RData”. The muscle synergies files are divided in motor primitives and motor modules and are presented as direct output of the factorisation and not in any functional order. Motor primitives are data frames with 6000 rows and a number of columns equal to the number of synergies (which might differ from trial to trial) plus one. The rows contain the time-dependent coefficients (motor primitives), one column for each synergy plus the time points (columns are named e.g. “time, Syn1, Syn2, Syn3”, where “Syn” is the abbreviation for “synergy”). Each gait cycle contains 200 data points, 100 for the stance and 100 for the swing phase which, multiplied by the 30 recorded cycles, result in 6000 data points distributed in as many rows. This output is transposed as compared to the one discussed in the methods section to improve user readability. Each set of motor primitives is saved as an element of a single R list. Trials are named like “SYNS_H_P12_W_07”, where the characters “SYNS_H” indicate that the trial contains motor primitive data, the characters “P12” indicate the participant number (in this example the 12th), the character “W” indicate the locomotion type (see above), and the numbers “07” indicate the speed (see above). Motor modules are data frames with 12 rows (number of recorded muscles) and a number of columns equal to the number of synergies (which might differ from trial to trial). The rows, named with muscle abbreviations that follow those reported above, contain the time-independent coefficients (motor modules), one for each synergy and for each muscle. Each set of motor modules relative to one synergy is saved as an element of a single R list. Trials are named like “SYNS_W_P22_R_20”, where the characters “SYNS_W” indicate that the trial contains motor module data, the characters “P22” indicate the participant number (in this example the 22nd), the character “W” indicates the locomotion type (see above), and the numbers “20” indicate the speed (see above). Given the nature of the NMF algorithm for the extraction of muscle synergies, the supplementary data set might show non-significant differences as compared to the one used for obtaining the results of this paper.
The files containing the HFD calculated from motor primitives are available in RData format, in the file named “HFD.RData”. HFD results are presented in a list of lists containing, for each trial, 1) the HFD, and 2) the interval time k used for the calculations. HFDs are presented as one number (mean HFD of the primitives for that trial), as are the interval times k. Trials are named like “HFD_P01_R_95”, where the characters “HFD” indicate that the trial contains HFD data, the characters “P01” indicate the participant number (in this example the 1st), the character “R” indicates the locomotion type (see above), and the numbers “95” indicate the speed (see above).
All the code used for the pre-processing of EMG data, the extraction of muscle synergies and the calculation of HFD is available in R format. Explanatory comments are profusely present throughout the script “muscle_synergies.R”.
Facebook
Twitter''We introduce the Global rRNA Universal Metabarcoding Plankton database (McNichol and Williams et al., 2025), which consists of 1194 samples covering extensive latitudinal and longitudinal transects, including depth profiles, in all major ocean basins from 2003-2020. Unfractionated (>0.2 µm) seawater DNA samples were amplified using the 515Y/926R universal 3-domain rRNA primers, quantifying the relative abundance of amplicon sequencing variants (ASVs) from Bacteria, Archaea, and Eukaryotes with one denominator. Thus, the ratio of any organism (or group) to any other in a sample is directly comparable to the ratio in any other sample within the dataset, irrespective of gene copy number differences. This obviates a problem in prior global studies that used size-fractionation and different primers for prokaryotes and eukaryotes, precluding comparisons between abundances across size fractions or domains.
Sample Collection Samples were collected by multiple collaborations, which used slightly different sample collection techniques. These collection techniques will be outlined by individual cruise.
For the Atlantic Meridional Transects (AMT 19 and AMT 20), 5–10 L of whole seawater was collected from the sea surface using a Niskin bottle and was then filtered onto 0.22 µm Sterivex Durapore filters (Millipore Sigma, Burlington, MA, USA). Samples were collected by Stephanie Sargeant and Andy Rees, Plymouth Marine Laboratory (PML) as part of the Atlantic Meridional Transect (AMT) (Rees, Smyth and Brotas, 2024) research cruises 19 (2009) and 20 (2010) onboard UK research vessel RRS James Cook (JC039 and JC053 – (Rees, 2010a, 2010b). Sterivex filters were capped and stored in RNAlater® (ThermoFisher) at -80 °C until analysis.
For samples taken in the FRAM Strait (Wietz et al., 2021), whole seawater was collected using Remote Access Samplers (RAS; McLane) on seafloor moorings F4-S-1, HG-IV-S-1, Fevi-34, and EGC-5. Moorings were operated within the FRAM / HAUSGARTEN Observatory covering the West Spitsbergen Current, central Fram Strait, and East Greenland Current as well as the Marginal Ice Zone. RAS performed continuous, autonomous sampling from July 2016 – August 2017 in programmed intervals (weekly to monthly). Nominal deployment depths were 30 m (F4, HG-IV), 67 m (Fevi), and 80 m (EGC). However, vertical movements in the water column resulted in variable actual sampling depths, ranging from 25 to 150 m. Per sampling event, two lots of 500 mL of whole seawater was pumped into bags containing mercuric chloride for fixation. After RAS recovery, the two samples per sampling event were pooled, and approximately 700 mL of pooled water was filtered onto 0.22 µm Sterivex cartridges. Filtered samples were stored at -20°C until DNA extraction. For MOSAiC whole seawater was collected from the upper water via a rosette sampler equipped with Niskin bottles through a hole in the sea ice next to the RV Polarstern. If possible, duplicate samples, with two Niskins per depth were collected during the up-casts near the surface (~5 m), 10 m, chlorophyll max (~20–40 m), 50 m, and 100m. in these Niskins and. 1-4 litres was filtered on to Sterivex-filters (0.22 µm pore size) using a peristaltic pump in a temperature controlled lab at 1°C in the dark, only using red light. The number of Sterivex-filters used per sampling event varied between two during Polar Night and 3-4 during Polar day, depending on the biomass found in the samples. Sterivex filters with were stored at -80°C until further processing took place in the laboratory.
GEOTRACES cruises (Anderson et al., 2014), including transects GA02, GA03, GA10, and GP13, collected whole seawater using a Niskin bottle, filtering 100 mL of whole seawater between the surface and 5601 m onto 0.2 µm 25 mm polycarbonate filters. After filtration, 3 mL of sterile preservation solution (10 mM Tris, pH 8.0; 100 mM EDTA; 0.5 M NaCl) was added, and samples were stored in cryovials at -80°C until DNA extraction.
During the 2017 and 2019 SCOPE (Simons Collaboration on Ocean Processes and Ecology) - Gradients cruises, 0.7-4 L of whole seawater was collected at sea using the ships underway system, which is approximately 7 m below the surface, as well as the rosette sampler for depths between 15 – 125 m by Mary R. Gradoville, Brittany Stewart, and Esther Wing (Zehr lab) (Gradoville et al., 2020). This water was filtered onto 0.22 µm 25 mm Supor membrane filters (Pall Corporation, New York) and stored at -80°C until DNA extraction.
The collection of Southern Ocean transects include the 1) IND-2017 dataset, which were taken during the Totten Glacier-Sabrina Coast voyage in 2017 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2017_V01, 2) the Kerguelen-Axis Marine Science program (K-AXIS) in 2016 on the Australian Antarctic Division RV Aurora Australis 2015/16 voyage 3, 3) Global Ocean Ship-based Hydrographic Investigations Program (GO-SHIP) P15S cruise in 2016 as a part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V03, and 4) the Heard Earth-Ocean-Biosphere Interactions (HEOBI) voyage in 2016 as part of the CSIRO Marine National Facility RV Investigator Voyage IN2016_V01. For these cruises, 2 L of whole seawater was filtered onto 0.22 µm Sterivex-GP polyethersulfone membrane filters (Millipore). This water was collected from the ships underway during IND-2017 by Amaranta Focardi (Paulsen Lab, Macquarie University), from between 5 and 4625 m during K-AXIS by Bruce Deagle and Lawrence Clarke (Australian Antarctic Division), from between 5 and 6015 m during GO-SHIP P15S by Eric J. Raes, Swan LS Sow and Gabriela Paniagua Cabarrus (Environmental Genomics Team, CSIRO Environment), Nicole Hellessey (University of Tasmania) and Bernhard Tschitschko (University of New South Wales), and from between 7-3579 m during HEOBI by Thomas Trull (CSIRO Environment). After filtration, samples were stored at -80°C until analysis.
As part of GO-SHIP, there were several additional transects (i.e., I08S, I09N, P16 S/N), including some that also traversed into the Southern Ocean (i.e., I08S, P16S) or Arctic Ocean (P16N). For I08S and I09N, 2 L of whole seawater was filtered onto 0.22 µm 25 mm filters (Supor® hydrophilic polyethersulfone membrane) by Norm Nelson (I08S) and Elisa Halewood (I09N), UCSB, as part of the U.S. Global Ocean Ship-based Hydrographic Investigations Program aboard the R/V Roger Revelle during the cruises in 2007. Sucrose lysis buffer was added to filters, which were then stored at -80°C until DNA extraction. For P16N and P16S, samples were collected at various depths by Elisa Halewood and Meredith Meyers (Carlson Lab, UCSB) onto 0.22 µm 25 mm Supor filters during two latitudinal transects of the Pacific Ocean in 2005 and 2006 as part of the GO-SHIP repeat hydrography program (then known as CLIVAR). Samples were stored as partially extracted lysates in sucrose lysis buffer at -80°C until DNA extraction.
Finally, for samples from the Production Observations Through Another Trans-Latitudinal Oceanic Expedition (POTATOE) cruise, 20 L of whole seawater was collected from the sea surface between 1-2 m and filtered onto 0.22 µm Sterivex® filters during a “ship of opportunity” cruise on the RVIB Nathaniel B Palmer in 2003 (Baldwin et al., 2005). Sterivex filters were stored dry at -80°C until DNA extraction.
All datasets had corresponding environmental data. We included date, time, latitude, longitude, depth, temperature, salinity, oxygen for all transects, and nutrient data where available. However, some cruises have other environmental data which can be found at the British Oceanographic Data Centre https://www.bodc.ac.uk/ for both AMT cruises, at the CSIRO National Collections and Marine Infrastructure Data Trawler https://www.cmar.csiro.au/data/trawler/survey_details.cfm?survey=IN2016_V01 for IND-2017 and HEOBI, at the CLIVAR and Carbon Hydrographic Data Office https://cchdo.ucsd.edu/ for GO-SHIP P15S, P16N and P16S, at the Australian Antarctic Division Datacenter https://data.aad.gov.au/aadc/voyages/ for the K-AXIS cruise, at https://doi.org/10.6075/J0CCHLY9 for the I08S and I09N cruises, at the MGDS (Marine Geoscience Data System: https://www.marine-geo.org) for POTATOE, at https://scope.soest.hawaii.edu/data/gradients/documents/ for both SCOPE-Gradients cruises, and at PANGAEA https://www.pangaea.de/ for FRAM Strait and MOSAiC. Finally, we have also used satellite data to estimate the euphotic zone depth where photosynthetic available radiation (PAR) is 1% of its surface value (Lee et al., 2007; Kirk, 2010). We approximated the euphotic zone depth using the light attenuation at 490nm (Kd 490) product and the relationship Z eu(1%) = 4.6/Kd 490. We also used the script Longhurst-Province-Finder https://github.com/thechisholmlab/Longhurst-Province-Finder to assign each sample to the Longhurst Province in which it was sampled in, another useful column to help subset data and investigate specific regions of the ocean.
DNA Extraction For AMT cruises, DNA was isolated using the Qiagen AllPrep DNA/RNA Mini kit (Hilden, Germany) with modifications to be compatible with RNAlater® and to disrupt cell membranes (Varaljay et al., 2015). Briefly, the filter was removed from the Sterivex housing and immersed in RLT+ buffer that had been amended with 10 µl 1N NaOH per 1ml buffer, followed by a 2 minute agitation in a Mini-Beadbeater-96 (Biospec Inc., Bartlesville, OK, USA) with 0.1- and 0.5 mm sterile glass beads
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By US Open Data Portal, data.gov [source]
This Kaggle dataset showcases the groundbreaking research undertaken by the GRACEnet program, which is attempting to better understand and minimize greenhouse gas (GHG) emissions from agro-ecosystems in order to create a healthier world for all. Through multi-location field studies that utilize standardized protocols – combined with models, producers, and policy makers – GRACEnet seeks to: typify existing production practices, maximize C sequestration, minimize net GHG emissions, and meet sustainable production goals. This Kaggle dataset allows us to evaluate the impact of different management systems on factors such as carbon dioxide and nitrous oxide emissions, C sequestration levels, crop/forest yield levels – plus additional environmental effects like air quality etc. With this data we can start getting an idea of the ways that agricultural policies may be influencing our planet's ever-evolving climate dilemma
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Step 1: Familiarize yourself with the columns in this dataset. In particular, pay attention to Spreadsheet tab description (brief description of each spreadsheet tab), Element or value display name (name of each element or value being measured), Description (detailed description), Data type (type of data being measured) Unit (unit of measurement for the data) Calculation (calculation used to determine a value or percentage) Format (format required for submitting values), Low Value and High Value (range for acceptable entries).
Step 2: Familiarize yourself with any additional information related to calculations. Most calculations made use of accepted best estimates based on standard protocols defined by GRACEnet. Every calculation was described in detail and included post-processing steps such as quality assurance/quality control changes as well as measurement uncertainty assessment etc., as available sources permit relevant calculations were discussed collaboratively between all participating partners at every level where they felt necessary. All terms were rigorously reviewed before all partners agreed upon any decision(s). A range was established when several assumptions were needed or when there was a high possibility that samples might fall outside previously accepted ranges associated with standard protocol conditions set up at GRACEnet Headquarters laboratories resulting due to other external factors like soil type, climate etc,.
Step 3: Determine what types of operations are allowed within each spreadsheet tab (.csv file). For example on some tabs operations like adding an entire row may be permitted but using formulas is not permitted since all non-standard manipulations often introduce errors into an analysis which is why users are encouraged only add new rows/columns provided it is seen fit for their specific analysis operations like fill blank cells by zeros or delete rows/columns made redundant after standard filtering process which have been removed earlier from different tabs should be avoided since these nonstandard changes create unverified extra noise which can bias your results later on during robustness testing processes related to self verification process thereby creating erroneous output results also such action also might result into additional FET values due API's specially crafted excel documents while selecting two ways combo box therefore
- Analyzing and comparing the environmental benefits of different agricultural management practices, such as crop yields and carbon sequestration rates.
- Developing an app or other mobile platform to help farmers find management practices that maximize carbon sequestration and minimize GHG emissions in their area, based on their specific soil condition and climate data.
- Building an AI-driven model to predict net greenhouse gas emissions and C sequestration from potential weekly/monthly production plans across different regions in the world, based on optimal allocation of resources such as fertilizers, equipment, water etc
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the ...