Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Example DataFrame (Teeny-Tiny Castle)
This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.
How to Use
from datasets import load_dataset
dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')
Facebook
TwitterThis dataset was created by Leonie
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.
This dataset was generated using R.
# Set seed for reproducibility
set.seed(42)
# Define number of observations (students)
n <- 5000
# Generate study hours (independent variable)
# Uniform distribution between 0 and 12 hours
study_hours <- runif(n, min = 0, max = 12)
# Create relationship between study hours and grade
# Base grade: 40 points
# Each study hour adds an average of 5 points
# Add normal noise (standard deviation = 10)
theoretical_grade <- 40 + 5 * study_hours
# Add normal noise to make it realistic
noise <- rnorm(n, mean = 0, sd = 10)
# Calculate final grade
grade <- theoretical_grade + noise
# Limit grades between 0 and 100
grade <- pmin(pmax(grade, 0), 100)
# Create the dataframe
dataset <- data.frame(
student_id = 1:n,
study_hours = round(study_hours, 2),
grade = round(grade, 2)
)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"WeAreHere!" Management and teaching teams' questionnaire. This dataset includes: (1) the WaH management and teaching teams' questionnaire (21 questions including 5-point Likert scale questions, dichotomous questions, multiple choice questions, open questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the management and teaching teams' answers to the questionnaire (a total of 322 answers).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by _anxious
Released under CC0: Public Domain
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
"WeAreHere!" Children's questionnaire. This dataset includes: (1) the WaH children's questionnaire (20 questions including 5-point Likert scale questions, dichotomous questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the children's answers to the questionnaire (a total of 3664 answers) and a reduced version of it for doing the regression (with the 5-point likert scale variable "ask for help" transformed into a dichotomous variable). (3) The data frame in xlsx format, with the children's answers to the questionnaire and the categorization of their comments (sheet 1), the data frame with only the MCA variables selected (sheet 2), and the categories and subcategories table (sheet 3). (4) The data analysis procedure for the regression, the component and multiple component analysis (R script).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training. Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real). There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame. This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets and codes, which are used in the paper "Using large language models to address the bottleneck of georeferencing natural history collections"1. System requirements: Windows 10; R language: v 4.2.2; Python: v 3.8.122. Instructions for use: The "data" folder contain the key sampling and intermediate data in the analysis process of this study. The initial specimen dataset included a total of 13,064,051 records from the Global Biodiversity Information Facility (GBIF) can be downloaded from GBIF DOI: https://doi.org/10.15468/dl.fj3sqk.Data file name and its meaning or purpose:occurrence_filter_clean.csv: The data before sampling 5,000 records based on continents, after cleaning the initial specimen datamain data frame 5000_only country state county locality.csv: The 5,000 sample data used for georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 100_only country state county locality.csv: The 100 sub-sample data used for humnan and reasoning-LLM georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 5000.csv: records all output data and required records from the analysis of 5,000 sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsmain data frame 100.csv: records all output data and required records from the analysis of 100 sub-sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsgeoref_errorDis.csv: used for Figure 1bsummary_error_time_cost.csv: time taken and cost records for various georeferencing methods, used for Figure 4for_human_completed.csv: results of manual georeferencing by the participantshf_v2geo.tif: Global Human Footprint Dataset (Geographic) (Version 2.00), from https://gis.earthdata.nasa.gov/portal/home/item.html?id=048c92f5ce50462a86b0837254924151, used for Figure 5acountry file folder: global country and county polygon vector data, used to extract centroid coordinates of counties in ArcGIS v10.8
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset presents network traffic traces data of the 14 D-Link IoT devices from different types including camera, network camera, smart-plug, door-window sensor, and home-hub. It consists of:
• Network packet traces (inbound and outbound traffic) and
• IEEE 802.11 MAC frame traces.
The experimental testbed was set-up in the Network Systems and Signal Processing (NSSP) laboratory at Universiti Brunei Darussalam (UBD) to collect all the network traffic traces from 9th September 2020 to 10th January 2021 including an access point on a laptop. The network traffic traces were captured passively observing the Ethernet interface and the WiFi interface at the access point.
In packet traces, typical communication protocols, such as TCP, UDP, IP, ICMP, ARP, DNS, SSDP, TLS/SSL etc, data are captured which IoT devices use for communication on the Internet. In the probe request frame (a subtype of management frames) traces, data are recorded which IoT devices use to connect access point on the local area network.
The authors would like to thank the Faculty of Integrated Technologies, Universiti Brunei Darussalam, for the support to conduct this research experiment in the Network Systems and Signal Processing laboratory.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:
COMAP: A synthetic data generated by CARLA for cooperative perception.
OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.
DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.
OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.
[_!NOTE_] If you find the downloading speed is very slow, you can also try the Baidu Cloud: 链接: https://pan.baidu.com/s/12HZ1yk0y84NJfStZADMstA?pwd=hkja 提取码: hkja --来自百度网盘超级会员v1的分享
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
About
The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.
Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.
Anonymization
All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:
In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.
The script which was used to anonymize is available in the dataset.
Data
| N/o | Dataset 1 | dataset 2 |
|---|---|---|
| Frames | 36306 | 60984 |
| Beacon frames | 19693 | 27983 |
| Request frames | 798 | 1580 |
| Response frames | 15815 | 31421 |
| Identified Wi-Fi Networks | 54 | 70 |
| Identified MAC addresses | 2092 | 2705 |
| Identified Wireless devices | 128 | 186 |
| Capturetime | 480s | 422s |
Dataset contents
The two datasets are stored in the directories `1/` and `2/`. Each directory contains:
`anonymization.py` is the script which was used to remove identifiers.
`README.md` contains the documentation about the datasets
License
Copyright 2022-2023 Benjamin Vermunicht, Beat Signer, Maxim Van de Wynckel, Vrije Universiteit Brussel
Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.
THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
After the digital data were transmitted to Earth and received at the Jet Propulsion Laboratory, they were subject to a variety of processes to produce the final digital tapes and photoproducts. The first step was to strip out all the non-video data and produce a System Data Record (SDR). This was compiled into video format, and an Experiment Data Record (EDR) was produced. The EDR data consist of unprocessed (raw) instrument data. Substantial processing is required to reconstruct each image owing to the unique manner in which data were transmitted to earth. Images were initially recorded on 7-track magnetic tape recorders on the spacecraft. Each raw data frame retrieved from the tracking station thus contains every seventh pixel arranged in either increasing or decreasing order. Image data reconstructed from these raw data frames by the Mission Test Imaging System (MTIS) form the EDR digital archive tape.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Households are the fundamental units of co-residence and play a crucial role in social and economic reproduction worldwide. They are also widely used as units of enumeration for data collection purposes, with substantive implications for research on poverty, living conditions, family structure, and gender dynamics. However, reliable comparative data on households and changes and living arrangements around the world is still under development. The CORESIDENCE database (CoDB) aims to bridge the existing data gap by offering valuable insights not only into the documented disparities between countries but also into the often-elusive regional differences within countries. By providing comprehensive data, it facilitates a deeper understanding of the complex dynamics of co-residence around the world. This database is a significant contribution to research, as it sheds light on both macro-level variations across nations and micro-level variations within specific regions, facilitating more nuanced analyses and evidence-based policymaking. The CoDB is composed of three datasets covering 155 countries (National Dataset), 3563 regions (Subnational Dataset), and 1511 harmonized regions (Subnational-Harmonized Dataset) for the period 1960 to 2021, and it provides 146 indicators on household composition and family arrangements across the world. This repository is composed of the following elements: a RData file named CORESIDENDE_DATABASE containing the CoDB in the form of a List. The CORESIDENDE_DB list object is composed of six elements: NATIONAL: a data frame with the household composition and living arrangements indicators at the national level. SUBNATIONAL: a data frame with the household composition and living arrangements indicators at the subnational level computed over the original subnational division provided in each sample and data source. SUBNATIONAL_HARMONIZED: a data frame with the household composition and living arrangements indicators computed over the harmonized subnational regions. SUBNATIONAL_BOUNDARIES_CORESIDENCE: a spatial data frame (a sf object) with the boundary’s delimitation of the subnational harmonized regions created for this project. CODEBOOK: a data frame with the complete list of indicators, their code names and description. HARMONIZATION_TABLE: a data frame with the full list of individual country-year samples employed in this project and their state of inclusion in the 3 datasets composing the CoDB. Elements 1, 2, 3, 5 and 6 of the R list are also provided as csv files under the same names. Element 4, the harmonized boundaries, is at disposal as gpkg (Geopackage) file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (aggregated_raw.csv)286,260 samples across 8 datasets.Binary labels (1 = Fake, 0 = Real)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (aggregated_cleaned.csv)Includes standardized and cleaned cleaned_text column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (xframe_features_encoded.csv)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splitstrain.csv, val.csv, test.csv: Stratified splits of the cleaned and encoded data.Feature Metadata (feature_description.pdf)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.1 = Fake includes false, pants-on-fire, disagree, etc.; 0 = Real includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1 – https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThe HA4M dataset is a collection of multi-modal data relative to actions performed by different subjects in an assembly scenario for manufacturing. It has been collected to provide a good test-bed for developing, validating and testing techniques and methodologies for the recognition of assembly actions. To the best of the authors' knowledge, few vision-based datasets exist in the context of object assembly.The HA4M dataset provides a considerable variety of multi-modal data compared to existing datasets. Six types of simultaneous data are supplied: RGB frames, Depth maps, IR frames, RGB-Depth-Aligned frames, Point Clouds and Skeleton data.These data allow the scientific community to make consistent comparisons among processing approaches or machine learning approaches by using one or more data modalities. Researchers in computer vision, pattern recognition and machine learning can use/reuse the data for different investigations in different application domains such as motion analysis, human-robot cooperation, action recognition, and so on.Dataset detailsThe dataset includes 12 assembly actions performed by 41 subjects for building an Epicyclic Gear Train (EGT).The assembly task involves three phases first, the assembly of Block 1 and Block 2 separately, and then the final setting up of both Blocks to build the EGT. The EGT is made up of a total of 12 components divided into two sets: the first eight components for building Block 1 and the remaining four components for Block 2. Finally, two screws are fixed with an Allen Key to assemble the two blocks and thus obtain the EGT.Acquisition setupThe acquisition experiment took place in two laboratories (one in Italy and one in Spain), where an acquisition area was reserved for the experimental setup. A Microsoft Azure Kinect camera acquires videos during the execution of the assembly task. It is placed in front of the operator and the table where the components are spread over. The camera is place on a tripod at an height h of 1.54 m and a distance of 1.78m. The camera is down-tilted by an angle of 17 degrees.Technical informationThe HA4M dataset contains 217 videos of the assembly task performed by 41 subjects (15 females and 26 males). Their ages ranged from 23 to 60. All the subjects participated voluntarily and were provided with a written description of the experiment. Each subject was asked to execute the task several times and to perform the actions at their own convenience (e.g. with both hands), independently from their dominant hand. The HA4M project is a growing project. So new acquisitions, planned in the next future, will expand the current dataset.ActionsTwelve actions are considered in HA4M. Actions from 1 to 4 are needed to build Block 1, then actions from 5 to 8 for building Block 2 and finally, the actions from 9 to 12 for completing the EGT. Actions are listed below:Pick up/Place CarrierPick up/Place Gear Bearings (x3)Pick up/Place Planet Gears (x3)Pick up/Place Carrier ShaftPick up/Place Sun ShaftPick up/Place Sun GearPick up/Place Sun Gear BearingPick up/Place Ring BearPick up Block 2 and place it on Block 1Pick up/Place CoverPick up/Place Screws (x2)Pick up/Place Allen Key, Turn Screws, Return Allen Key and EGTAnnotationData annotation concerns the labeling of the different actions in the video sequences.The annotation of the actions has been manually done by observing the RGB videos, frame by frame. The start frame of each action is identified as the subject starts to move the arm to the component to be grasped. The end frame, instead, is recorded when the subject releases the component, so the next frame becomes the start frame of the subsequent action.The total number of actions annotated in this study is 4123, including the “don't care” action (ID=0) and the action repetitions in the case of actions 2, 3 and 11.Available codeThe dataset has been acquired using the Multiple Azure Kinect GUI software, available at https://gitlab.com/roberto.marani/multiple-azure-kinect-gui, based on the Azure Kinect Sensor SDK v1.4.1 and Azure Kinect Body Tracking SDK v1.1.2.The software records device data to a Matroska (.mkv) file, containing video tracks, IMU samples, and device calibration. In this work, IMU samples are not considered.The same Multiple Azure Kinect GUI software processes the Matroska file and returns the different types of data provided with our dataset: RGB images, RGB-depth-Aligned (RGB-A) images, Depth images, IR images, Point Cloud and Skeleton data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data frame and analysis script
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels. SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide.
The SAPFLUXNET version 0.1.5 database harbours 202 globally distributed datasets, from 121 geographical locations. SAPFLUXNET contains sap flow data for 2714 individual plants (1584 angiosperms and 1130 gymnosperms), belonging to 174 species (141 angiosperms and 33 gymnosperms), 95 different genera and 45 different families. More information on the database coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.
The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).
Changelog
Compared to version 0.1.4, this version includes some changes in the metadata, but all time series data (sap flow, environmental) remain the same.
For all datasets, climate metadata (temperature and precipitation, ‘si_mat’ and ‘si_map’) have been extracted from CHELSA (https://chelsa-climate.org/), replacing the previous climate data obtained with Wordclim. This change has modified the biome classification of the datasets in ‘si_biome’.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) is now assigned a value of 0 if species are in the understorey. This affects two datasets: AUS_MAR_UBD and AUS_MAR_UBW, where, previously, the sum of species basal area percentages could add up to more than 100%.
In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) has been corrected for datasets USA_SIL_OAK_POS, USA_SIL_OAK_1PR, USA_SIL_OAK_2PR.
In ‘site’ metadata, the vegetation type (‘si_igbp’) has been changed to SAV for datasets CHN_ARG_GWD and CHN_ARG_GWS.
Variables and units
SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.
Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. The description and units of all data and metadata variables can be found here: Metadata and data units.
To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:
library(sapfluxnetr)
vignette(package='sapfluxnetr')
vignette('metadata-and-data-units', package='sapfluxnetr')
vignette('data-flags', package='sapfluxnetr')
Data formats
SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.
Working with sfn_data files
To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).
After downloading the entire database, the sapfluxnetr package can be used to: - Work with data from a single site: data access, plotting and time aggregation. - Select the subset datasets to work with. - Work with data from multiple sites: data access, plotting and time aggregation.
Please check the following package vignettes to learn more about how to work with sfn_data files:
Quick guide
Metadata and data units
sfn_data classes
Custom aggregation
Memory and parallelization
Working with text files
We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.
Data issues and reporting
Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.
Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').
Data access, use and citation
This version of the SAPFLUXNET database is open access and corresponds to the data paper submitted to Earth System Science Data in August 2020.
When using SAPFLUXNET data in an academic work, please cite the data paper, when available, or alternatively, the Zenodo dataset (see the ‘Cite as’ section on the right panels of this web page).
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Example DataFrame (Teeny-Tiny Castle)
This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.
How to Use
from datasets import load_dataset
dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')