100+ datasets found
  1. h

    example-data-frame

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    AI Robotics Ethics Society (PUCRS)
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Example DataFrame (Teeny-Tiny Castle)

    This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

      How to Use
    

    from datasets import load_dataset

    dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')

  2. Sample Dataset for DataFrame Styling

    • kaggle.com
    zip
    Updated Jun 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leonie (2022). Sample Dataset for DataFrame Styling [Dataset]. https://www.kaggle.com/datasets/iamleonie/sample-dataset-for-dataframe-styling
    Explore at:
    zip(257 bytes)Available download formats
    Dataset updated
    Jun 11, 2022
    Authors
    Leonie
    Description

    Dataset

    This dataset was created by Leonie

    Contents

  3. Study Hours vs Grades Dataset

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
    Explore at:
    zip(33964 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Andrey Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

    Dataset Features

    • student_id: Unique identifier for each student (1-5000)
    • study_hours: Hours spent studying (0-12 hours, continuous)
    • grade: Final exam score (0-100 points, continuous)

    Potential Use Cases

    • Linear regression modeling and practice
    • Data visualization exercises
    • Statistical analysis tutorials
    • Machine learning for beginners
    • Educational research simulations

    Data Quality

    • No missing values
    • Normally distributed residuals
    • Realistic educational scenario
    • Ready for immediate analysis

    Data Generation Code

    This dataset was generated using R.

    R Code

    # Set seed for reproducibility
    set.seed(42)
    
    # Define number of observations (students)
    n <- 5000
    
    # Generate study hours (independent variable)
    # Uniform distribution between 0 and 12 hours
    study_hours <- runif(n, min = 0, max = 12)
    
    # Create relationship between study hours and grade
    # Base grade: 40 points
    # Each study hour adds an average of 5 points
    # Add normal noise (standard deviation = 10)
    theoretical_grade <- 40 + 5 * study_hours
    
    # Add normal noise to make it realistic
    noise <- rnorm(n, mean = 0, sd = 10)
    
    # Calculate final grade
    grade <- theoretical_grade + noise
    
    # Limit grades between 0 and 100
    grade <- pmin(pmax(grade, 0), 100)
    
    # Create the dataframe
    dataset <- data.frame(
     student_id = 1:n,
     study_hours = round(study_hours, 2),
     grade = round(grade, 2)
    )
    
  4. C

    Management and teaching teams' questionnaire (data frame)

    • dataverse.csuc.cat
    pdf, txt, xlsx
    Updated Jul 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa (2024). Management and teaching teams' questionnaire (data frame) [Dataset]. http://doi.org/10.34810/data609
    Explore at:
    pdf(384755), pdf(385007), txt(5183), pdf(385663), xlsx(142753)Available download formats
    Dataset updated
    Jul 27, 2024
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    https://ror.org/03zhx9h04
    Description

    "WeAreHere!" Management and teaching teams' questionnaire. This dataset includes: (1) the WaH management and teaching teams' questionnaire (21 questions including 5-point Likert scale questions, dichotomous questions, multiple choice questions, open questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the management and teaching teams' answers to the questionnaire (a total of 322 answers).

  5. Dataset for pandas data-frame 1.1

    • kaggle.com
    zip
    Updated Jun 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    _anxious (2024). Dataset for pandas data-frame 1.1 [Dataset]. https://www.kaggle.com/datasets/par7h0/dataset-for-pandas-data-frame-1-1/code
    Explore at:
    zip(763342 bytes)Available download formats
    Dataset updated
    Jun 16, 2024
    Authors
    _anxious
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by _anxious

    Released under CC0: Public Domain

    Contents

  6. C

    Children's questionnaire (data frame)

    • dataverse.csuc.cat
    pdf, txt +2
    Updated Jul 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa (2023). Children's questionnaire (data frame) [Dataset]. http://doi.org/10.34810/data247
    Explore at:
    pdf(485871), pdf(330192), pdf(331430), xlsx(2484824), pdf(485221), txt(7161), pdf(355715), xlsx(2504364), type/x-r-syntax(1161), pdf(355899), type/x-r-syntax(3928)Available download formats
    Dataset updated
    Jul 12, 2023
    Dataset provided by
    CORA.Repositori de Dades de Recerca
    Authors
    Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 20, 2021 - Oct 31, 2022
    Dataset funded by
    https://ror.org/03zhx9h04
    Description

    "WeAreHere!" Children's questionnaire. This dataset includes: (1) the WaH children's questionnaire (20 questions including 5-point Likert scale questions, dichotomous questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the children's answers to the questionnaire (a total of 3664 answers) and a reduced version of it for doing the regression (with the 5-point likert scale variable "ask for help" transformed into a dichotomous variable). (3) The data frame in xlsx format, with the children's answers to the questionnaire and the categorization of their comments (sheet 1), the data frame with only the MCA variables selected (sheet 2), and the categories and subcategories table (sheet 3). (4) The data analysis procedure for the regression, the component and multiple component analysis (R script).

  7. WELFake dataset for fake news detection in text data

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Feb 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2021). WELFake dataset for fake news detection in text data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4561253?locale=es
    Explore at:
    unknown(245086152)Available download formats
    Dataset updated
    Feb 24, 2021
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training. Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real). There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame. This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  8. Dataset & code for "Using large language models to address the bottleneck of...

    • figshare.com
    txt
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuyang Xie; xiao feng (2025). Dataset & code for "Using large language models to address the bottleneck of georeferencing natural history collections" [Dataset]. http://doi.org/10.6084/m9.figshare.28904936.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 17, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Yuyang Xie; xiao feng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets and codes, which are used in the paper "Using large language models to address the bottleneck of georeferencing natural history collections"1. System requirements: Windows 10; R language: v 4.2.2; Python: v 3.8.122. Instructions for use: The "data" folder contain the key sampling and intermediate data in the analysis process of this study. The initial specimen dataset included a total of 13,064,051 records from the Global Biodiversity Information Facility (GBIF) can be downloaded from GBIF DOI: https://doi.org/10.15468/dl.fj3sqk.Data file name and its meaning or purpose:occurrence_filter_clean.csv: The data before sampling 5,000 records based on continents, after cleaning the initial specimen datamain data frame 5000_only country state county locality.csv: The 5,000 sample data used for georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 100_only country state county locality.csv: The 100 sub-sample data used for humnan and reasoning-LLM georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 5000.csv: records all output data and required records from the analysis of 5,000 sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsmain data frame 100.csv: records all output data and required records from the analysis of 100 sub-sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsgeoref_errorDis.csv: used for Figure 1bsummary_error_time_cost.csv: time taken and cost records for various georeferencing methods, used for Figure 4for_human_completed.csv: results of manual georeferencing by the participantshf_v2geo.tif: Global Human Footprint Dataset (Geographic) (Version 2.00), from https://gis.earthdata.nasa.gov/portal/home/item.html?id=048c92f5ce50462a86b0837254924151, used for Figure 5acountry file folder: global country and county polygon vector data, used to extract centroid coordinates of counties in ArcGIS v10.8

  9. Raw data from datasets used in SIMON analysis

    • data.europa.eu
    unknown
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Raw data from datasets used in SIMON analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2580414?locale=hr
    Explore at:
    unknown(312591)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.

  10. m

    Data from: Packet-level and IEEE 802.11 MAC frame-level Network Traffic...

    • data.mendeley.com
    • narcis.nl
    Updated Jan 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajarshi Roy Chowdhury (2021). Packet-level and IEEE 802.11 MAC frame-level Network Traffic Traces Data of the D-Link IoT devices [Dataset]. http://doi.org/10.17632/84cc8grtkt.1
    Explore at:
    Dataset updated
    Jan 14, 2021
    Authors
    Rajarshi Roy Chowdhury
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset presents network traffic traces data of the 14 D-Link IoT devices from different types including camera, network camera, smart-plug, door-window sensor, and home-hub. It consists of:

    • Network packet traces (inbound and outbound traffic) and
    • IEEE 802.11 MAC frame traces.
    

    The experimental testbed was set-up in the Network Systems and Signal Processing (NSSP) laboratory at Universiti Brunei Darussalam (UBD) to collect all the network traffic traces from 9th September 2020 to 10th January 2021 including an access point on a laptop. The network traffic traces were captured passively observing the Ethernet interface and the WiFi interface at the access point.

    In packet traces, typical communication protocols, such as TCP, UDP, IP, ICMP, ARP, DNS, SSDP, TLS/SSL etc, data are captured which IoT devices use for communication on the Internet. In the probe request frame (a subtype of management frames) traces, data are recorded which IoT devices use to connect access point on the local area network.

    The authors would like to thank the Faculty of Integrated Technologies, Universiti Brunei Darussalam, for the support to conduct this research experiment in the Network Systems and Signal Processing laboratory.

  11. F

    CoSense3D

    • data.uni-hannover.de
    json, zip
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut für Kartographie und Geoinformatik (2025). CoSense3D [Dataset]. https://data.uni-hannover.de/dataset/cosense3d
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Sep 15, 2025
    Dataset authored and provided by
    Institut für Kartographie und Geoinformatik
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:

    • COMAP: A synthetic data generated by CARLA for cooperative perception.

    • OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.

    • DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.

    • OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.

    [_!NOTE_] If you find the downloading speed is very slow, you can also try the Baidu Cloud: 链接: https://pan.baidu.com/s/12HZ1yk0y84NJfStZADMstA?pwd=hkja 提取码: hkja --来自百度网盘超级会员v1的分享

  12. 802.11 Managemement frames from a public location

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benjamin Vermunicht; Benjamin Vermunicht (2025). 802.11 Managemement frames from a public location [Dataset]. http://doi.org/10.5281/zenodo.8003772
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Benjamin Vermunicht; Benjamin Vermunicht
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    About

    The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.

    Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.

    Anonymization

    All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:

    • All MAC addresses were renamed (e.g. 00:00:00:00:00:01)
    • All SSID's were renamed (e.g. NETWORK_1)
    • All generec 802.11 elements were removed from the pcap

    In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.

    The script which was used to anonymize is available in the dataset.

    Data

    Specifications for the datasets
    N/oDataset 1dataset 2
    Frames3630660984
    Beacon frames1969327983
    Request frames7981580
    Response frames1581531421
    Identified Wi-Fi Networks5470
    Identified MAC addresses20922705
    Identified Wireless devices128186
    Capturetime480s422s

    Dataset contents

    The two datasets are stored in the directories `1/` and `2/`. Each directory contains:

    • `capture-X.pcap`: an anonymized version of the original capture
    • `capture-X.csv`: content of each captured frame (timestamp, MAC address...) saved as a CSV file

    `anonymization.py` is the script which was used to remove identifiers.

    `README.md` contains the documentation about the datasets

    License

    Copyright 2022-2023 Benjamin Vermunicht, Beat Signer, Maxim Van de Wynckel, Vrije Universiteit Brussel

    Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:

    The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.

    THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

  13. VO1/VO2 MARS VISUAL IMAGING SUBSYSTEM EXPERIMENT DATA RECORD

    • data.nasa.gov
    • datasets.ai
    • +4more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). VO1/VO2 MARS VISUAL IMAGING SUBSYSTEM EXPERIMENT DATA RECORD [Dataset]. https://data.nasa.gov/dataset/vo1-vo2-mars-visual-imaging-subsystem-experiment-data-record-67f5e
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    After the digital data were transmitted to Earth and received at the Jet Propulsion Laboratory, they were subject to a variety of processes to produce the final digital tapes and photoproducts. The first step was to strip out all the non-video data and produce a System Data Record (SDR). This was compiled into video format, and an Experiment Data Record (EDR) was produced. The EDR data consist of unprocessed (raw) instrument data. Substantial processing is required to reconstruct each image owing to the unique manner in which data were transmitted to earth. Images were initially recorded on 7-track magnetic tape recorders on the spacecraft. Each raw data frame retrieved from the tracking station thus contains every seventh pixel arranged in either increasing or decreasing order. Image data reconstructed from these raw data frames by the Mission Test Imaging System (MTIS) form the EDR digital archive tape.

  14. The CORESIDENCE Database: National and Subnational Data on Household and...

    • data.europa.eu
    • zenodo.org
    unknown
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo, The CORESIDENCE Database: National and Subnational Data on Household and Living Arrangements Around the World, 1964-2021 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8142652?locale=hu
    Explore at:
    unknown(18275)Available download formats
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Households are the fundamental units of co-residence and play a crucial role in social and economic reproduction worldwide. They are also widely used as units of enumeration for data collection purposes, with substantive implications for research on poverty, living conditions, family structure, and gender dynamics. However, reliable comparative data on households and changes and living arrangements around the world is still under development. The CORESIDENCE database (CoDB) aims to bridge the existing data gap by offering valuable insights not only into the documented disparities between countries but also into the often-elusive regional differences within countries. By providing comprehensive data, it facilitates a deeper understanding of the complex dynamics of co-residence around the world. This database is a significant contribution to research, as it sheds light on both macro-level variations across nations and micro-level variations within specific regions, facilitating more nuanced analyses and evidence-based policymaking. The CoDB is composed of three datasets covering 155 countries (National Dataset), 3563 regions (Subnational Dataset), and 1511 harmonized regions (Subnational-Harmonized Dataset) for the period 1960 to 2021, and it provides 146 indicators on household composition and family arrangements across the world. This repository is composed of the following elements: a RData file named CORESIDENDE_DATABASE containing the CoDB in the form of a List. The CORESIDENDE_DB list object is composed of six elements: NATIONAL: a data frame with the household composition and living arrangements indicators at the national level. SUBNATIONAL: a data frame with the household composition and living arrangements indicators at the subnational level computed over the original subnational division provided in each sample and data source. SUBNATIONAL_HARMONIZED: a data frame with the household composition and living arrangements indicators computed over the harmonized subnational regions. SUBNATIONAL_BOUNDARIES_CORESIDENCE: a spatial data frame (a sf object) with the boundary’s delimitation of the subnational harmonized regions created for this project. CODEBOOK: a data frame with the complete list of indicators, their code names and description. HARMONIZATION_TABLE: a data frame with the full list of individual country-year samples employed in this project and their state of inclusion in the 3 datasets composing the CoDB. Elements 1, 2, 3, 5 and 6 of the R list are also provided as csv files under the same names. Element 4, the harmonized boundaries, is at disposal as gpkg (Geopackage) file.

  15. Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset...

    • figshare.com
    application/csv
    Updated Sep 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Nwaiwu (2025). Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection [Dataset]. http://doi.org/10.6084/m9.figshare.29539820.v2
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Sep 24, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Steve Nwaiwu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (aggregated_raw.csv)286,260 samples across 8 datasets.Binary labels (1 = Fake, 0 = Real)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (aggregated_cleaned.csv)Includes standardized and cleaned cleaned_text column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (xframe_features_encoded.csv)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splitstrain.csv, val.csv, test.csv: Stratified splits of the cleaned and encoded data.Feature Metadata (feature_description.pdf)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.1 = Fake includes false, pants-on-fire, disagree, etc.; 0 = Real includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498

  16. S

    HA4M - Human Action Multi-Modal Monitoring in Manufacturing

    • scidb.cn
    • resodate.org
    Updated Jul 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Marani; Laura Romeo; Grazia Cicirelli; Tiziana D'Orazio (2022). HA4M - Human Action Multi-Modal Monitoring in Manufacturing [Dataset]. http://doi.org/10.57760/sciencedb.01872
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2022
    Dataset provided by
    Science Data Bank
    Authors
    Roberto Marani; Laura Romeo; Grazia Cicirelli; Tiziana D'Orazio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewThe HA4M dataset is a collection of multi-modal data relative to actions performed by different subjects in an assembly scenario for manufacturing. It has been collected to provide a good test-bed for developing, validating and testing techniques and methodologies for the recognition of assembly actions. To the best of the authors' knowledge, few vision-based datasets exist in the context of object assembly.The HA4M dataset provides a considerable variety of multi-modal data compared to existing datasets. Six types of simultaneous data are supplied: RGB frames, Depth maps, IR frames, RGB-Depth-Aligned frames, Point Clouds and Skeleton data.These data allow the scientific community to make consistent comparisons among processing approaches or machine learning approaches by using one or more data modalities. Researchers in computer vision, pattern recognition and machine learning can use/reuse the data for different investigations in different application domains such as motion analysis, human-robot cooperation, action recognition, and so on.Dataset detailsThe dataset includes 12 assembly actions performed by 41 subjects for building an Epicyclic Gear Train (EGT).The assembly task involves three phases first, the assembly of Block 1 and Block 2 separately, and then the final setting up of both Blocks to build the EGT. The EGT is made up of a total of 12 components divided into two sets: the first eight components for building Block 1 and the remaining four components for Block 2. Finally, two screws are fixed with an Allen Key to assemble the two blocks and thus obtain the EGT.Acquisition setupThe acquisition experiment took place in two laboratories (one in Italy and one in Spain), where an acquisition area was reserved for the experimental setup. A Microsoft Azure Kinect camera acquires videos during the execution of the assembly task. It is placed in front of the operator and the table where the components are spread over. The camera is place on a tripod at an height h of 1.54 m and a distance of 1.78m. The camera is down-tilted by an angle of 17 degrees.Technical informationThe HA4M dataset contains 217 videos of the assembly task performed by 41 subjects (15 females and 26 males). Their ages ranged from 23 to 60. All the subjects participated voluntarily and were provided with a written description of the experiment. Each subject was asked to execute the task several times and to perform the actions at their own convenience (e.g. with both hands), independently from their dominant hand. The HA4M project is a growing project. So new acquisitions, planned in the next future, will expand the current dataset.ActionsTwelve actions are considered in HA4M. Actions from 1 to 4 are needed to build Block 1, then actions from 5 to 8 for building Block 2 and finally, the actions from 9 to 12 for completing the EGT. Actions are listed below:Pick up/Place CarrierPick up/Place Gear Bearings (x3)Pick up/Place Planet Gears (x3)Pick up/Place Carrier ShaftPick up/Place Sun ShaftPick up/Place Sun GearPick up/Place Sun Gear BearingPick up/Place Ring BearPick up Block 2 and place it on Block 1Pick up/Place CoverPick up/Place Screws (x2)Pick up/Place Allen Key, Turn Screws, Return Allen Key and EGTAnnotationData annotation concerns the labeling of the different actions in the video sequences.The annotation of the actions has been manually done by observing the RGB videos, frame by frame. The start frame of each action is identified as the subject starts to move the arm to the component to be grasped. The end frame, instead, is recorded when the subject releases the component, so the next frame becomes the start frame of the subsequent action.The total number of actions annotated in this study is 4123, including the “don't care” action (ID=0) and the action repetitions in the case of actions 2, 3 and 11.Available codeThe dataset has been acquired using the Multiple Azure Kinect GUI software, available at https://gitlab.com/roberto.marani/multiple-azure-kinect-gui, based on the Azure Kinect Sensor SDK v1.4.1 and Azure Kinect Body Tracking SDK v1.1.2.The software records device data to a Matroska (.mkv) file, containing video tracks, IMU samples, and device calibration. In this work, IMU samples are not considered.The same Multiple Azure Kinect GUI software processes the Matroska file and returns the different types of data provided with our dataset: RGB images, RGB-depth-Aligned (RGB-A) images, Depth images, IR images, Point Cloud and Skeleton data.

  17. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  18. Data frame and script

    • figshare.com
    application/csv
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda Kashikar; Timo Lüke; Michael Grosche (2024). Data frame and script [Dataset]. http://doi.org/10.6084/m9.figshare.25334125.v1
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Linda Kashikar; Timo Lüke; Michael Grosche
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data frame and analysis script

  19. Z

    SAPFLUXNET: A global database of sap flow measurements

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +1more
    Updated Sep 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Poyatos; Víctor Granda; Víctor Flo; Roberto Molowny-Horas; Kathy Steppe; Maurizio Mencuccini; Jordi Martínez-Vilalta (2020). SAPFLUXNET: A global database of sap flow measurements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2530797
    Explore at:
    Dataset updated
    Sep 26, 2020
    Dataset provided by
    ICREA/CREAF
    CREAF
    CREAF/Universitat Autònoma de Barcelona
    Ghent University
    Authors
    Rafael Poyatos; Víctor Granda; Víctor Flo; Roberto Molowny-Horas; Kathy Steppe; Maurizio Mencuccini; Jordi Martínez-Vilalta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General description

    SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels. SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide.

    The SAPFLUXNET version 0.1.5 database harbours 202 globally distributed datasets, from 121 geographical locations. SAPFLUXNET contains sap flow data for 2714 individual plants (1584 angiosperms and 1130 gymnosperms), belonging to 174 species (141 angiosperms and 33 gymnosperms), 95 different genera and 45 different families. More information on the database coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.

    The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).

    Changelog

    Compared to version 0.1.4, this version includes some changes in the metadata, but all time series data (sap flow, environmental) remain the same.

    For all datasets, climate metadata (temperature and precipitation, ‘si_mat’ and ‘si_map’) have been extracted from CHELSA (https://chelsa-climate.org/), replacing the previous climate data obtained with Wordclim. This change has modified the biome classification of the datasets in ‘si_biome’.

    In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) is now assigned a value of 0 if species are in the understorey. This affects two datasets: AUS_MAR_UBD and AUS_MAR_UBW, where, previously, the sum of species basal area percentages could add up to more than 100%.

    In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) has been corrected for datasets USA_SIL_OAK_POS, USA_SIL_OAK_1PR, USA_SIL_OAK_2PR.

    In ‘site’ metadata, the vegetation type (‘si_igbp’) has been changed to SAV for datasets CHN_ARG_GWD and CHN_ARG_GWS.

    Variables and units

    SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.

    Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. The description and units of all data and metadata variables can be found here: Metadata and data units.

    To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:

    remotes::install_github(

    'sapfluxnet/sapfluxnetr',

    build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes")

    )

    library(sapfluxnetr)

    to list all vignettes

    vignette(package='sapfluxnetr')

    variables and units

    vignette('metadata-and-data-units', package='sapfluxnetr')

    data flags

    vignette('data-flags', package='sapfluxnetr')

    Data formats

    SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.

    Working with sfn_data files

    To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).

    After downloading the entire database, the sapfluxnetr package can be used to: - Work with data from a single site: data access, plotting and time aggregation. - Select the subset datasets to work with. - Work with data from multiple sites: data access, plotting and time aggregation.

    Please check the following package vignettes to learn more about how to work with sfn_data files:

    Quick guide

    Metadata and data units

    sfn_data classes

    Custom aggregation

    Memory and parallelization

    Working with text files

    We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.

    Data issues and reporting

    Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.

    Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').

    Data access, use and citation

    This version of the SAPFLUXNET database is open access and corresponds to the data paper submitted to Earth System Science Data in August 2020.

    When using SAPFLUXNET data in an academic work, please cite the data paper, when available, or alternatively, the Zenodo dataset (see the ‘Cite as’ section on the right panels of this web page).

  20. m

    R codes and dataset for Visualisation of Diachronic Constructional Change...

    • bridges.monash.edu
    • researchdata.edu.au
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Monash University
    Authors
    Gede Primahadi Wijaya Rajeg
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame

example-data-frame

Example DataFrame

AiresPucrs/example-data-frame

Explore at:
221 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Robotics Ethics Society (PUCRS)
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

  How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')

Search
Clear search
Close search
Google apps
Main menu