100+ datasets found

h
example-data-frame
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
AI Robotics Ethics Society (PUCRS)
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')
Sample Dataset for DataFrame Styling
kaggle.com
zip
Updated Jun 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leonie (2022). Sample Dataset for DataFrame Styling [Dataset]. https://www.kaggle.com/datasets/iamleonie/sample-dataset-for-dataframe-styling
Explore at:
zip(257 bytes)Available download formats
Dataset updated
Jun 11, 2022
Authors
Leonie
Description
Dataset

This dataset was created by Leonie

Contents
Study Hours vs Grades Dataset
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
Explore at:
zip(33964 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Andrey Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

Dataset Features

student_id: Unique identifier for each student (1-5000)

study_hours: Hours spent studying (0-12 hours, continuous)

grade: Final exam score (0-100 points, continuous)

Potential Use Cases

Linear regression modeling and practice

Data visualization exercises

Statistical analysis tutorials

Machine learning for beginners

Educational research simulations

Data Quality

No missing values

Normally distributed residuals

Realistic educational scenario

Ready for immediate analysis

Data Generation Code

This dataset was generated using R.

R Code

# Set seed for reproducibility set.seed(42) # Define number of observations (students) n <- 5000 # Generate study hours (independent variable) # Uniform distribution between 0 and 12 hours study_hours <- runif(n, min = 0, max = 12) # Create relationship between study hours and grade # Base grade: 40 points # Each study hour adds an average of 5 points # Add normal noise (standard deviation = 10) theoretical_grade <- 40 + 5 * study_hours # Add normal noise to make it realistic noise <- rnorm(n, mean = 0, sd = 10) # Calculate final grade grade <- theoretical_grade + noise # Limit grades between 0 and 100 grade <- pmin(pmax(grade, 0), 100) # Create the dataframe dataset <- data.frame( student_id = 1:n, study_hours = round(study_hours, 2), grade = round(grade, 2) )
C
Management and teaching teams' questionnaire (data frame)
dataverse.csuc.cat
pdf, txt, xlsx
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa (2024). Management and teaching teams' questionnaire (data frame) [Dataset]. http://doi.org/10.34810/data609
Explore at:
pdf(384755), pdf(385007), txt(5183), pdf(385663), xlsx(142753)Available download formats
Unique identifier
https://doi.org/10.34810/data609
Dataset updated
Jul 27, 2024
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
https://ror.org/03zhx9h04
Description
"WeAreHere!" Management and teaching teams' questionnaire. This dataset includes: (1) the WaH management and teaching teams' questionnaire (21 questions including 5-point Likert scale questions, dichotomous questions, multiple choice questions, open questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the management and teaching teams' answers to the questionnaire (a total of 322 answers).
Dataset for pandas data-frame 1.1
kaggle.com
zip
Updated Jun 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_anxious (2024). Dataset for pandas data-frame 1.1 [Dataset]. https://www.kaggle.com/datasets/par7h0/dataset-for-pandas-data-frame-1-1/code
Explore at:
zip(763342 bytes)Available download formats
Dataset updated
Jun 16, 2024
Authors
_anxious
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by _anxious

Released under CC0: Public Domain

Contents
C
Children's questionnaire (data frame)
dataverse.csuc.cat
pdf, txt +2
Updated Jul 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa (2023). Children's questionnaire (data frame) [Dataset]. http://doi.org/10.34810/data247
Explore at:
pdf(485871), pdf(330192), pdf(331430), xlsx(2484824), pdf(485221), txt(7161), pdf(355715), xlsx(2504364), type/x-r-syntax(1161), pdf(355899), type/x-r-syntax(3928)Available download formats
Unique identifier
https://doi.org/10.34810/data247
Dataset updated
Jul 12, 2023
Dataset provided by
CORA.Repositori de Dades de Recerca
Authors
Carme Montserrat; Carme Montserrat; Marta Garcia-Molsosa; Marta Garcia-Molsosa
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 20, 2021 - Oct 31, 2022
Dataset funded by
https://ror.org/03zhx9h04
Description
"WeAreHere!" Children's questionnaire. This dataset includes: (1) the WaH children's questionnaire (20 questions including 5-point Likert scale questions, dichotomous questions and an open space for comments). The Catalan version (original), and the Spanish and English versions of the questionnaire can be found in this dataset in pdf format. (2) The data frame in xlsx format, with the children's answers to the questionnaire (a total of 3664 answers) and a reduced version of it for doing the regression (with the 5-point likert scale variable "ask for help" transformed into a dichotomous variable). (3) The data frame in xlsx format, with the children's answers to the questionnaire and the categorization of their comments (sheet 1), the data frame with only the MCA variables selected (sheet 2), and the categories and subcategories table (sheet 3). (4) The data analysis procedure for the regression, the component and multiple component analysis (R script).
WELFake dataset for fake news detection in text data
data.europa.eu
zenodo.org
unknown
Updated Feb 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2021). WELFake dataset for fake news detection in text data [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-4561253?locale=es
Explore at:
unknown(245086152)Available download formats
Dataset updated
Feb 24, 2021
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training. Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real). There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame. This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.
Dataset & code for "Using large language models to address the bottleneck of...
figshare.com
txt
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuyang Xie; xiao feng (2025). Dataset & code for "Using large language models to address the bottleneck of georeferencing natural history collections" [Dataset]. http://doi.org/10.6084/m9.figshare.28904936.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28904936.v1
Dataset updated
Nov 17, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yuyang Xie; xiao feng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Datasets and codes, which are used in the paper "Using large language models to address the bottleneck of georeferencing natural history collections"1. System requirements: Windows 10; R language: v 4.2.2; Python: v 3.8.122. Instructions for use: The "data" folder contain the key sampling and intermediate data in the analysis process of this study. The initial specimen dataset included a total of 13,064,051 records from the Global Biodiversity Information Facility (GBIF) can be downloaded from GBIF DOI: https://doi.org/10.15468/dl.fj3sqk.Data file name and its meaning or purpose：occurrence_filter_clean.csv: The data before sampling 5,000 records based on continents, after cleaning the initial specimen datamain data frame 5000_only country state county locality.csv: The 5,000 sample data used for georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 100_only country state county locality.csv: The 100 sub-sample data used for humnan and reasoning-LLM georeferencing, containing only basic information such as country, state/province, county, locality, and true latitude and longitude from GBIFmain data frame 5000.csv: records all output data and required records from the analysis of 5,000 sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsmain data frame 100.csv: records all output data and required records from the analysis of 100 sub-sample points, including coordinates and error distances from various georeferencing methods, locality text features, and readability metricsgeoref_errorDis.csv: used for Figure 1bsummary_error_time_cost.csv: time taken and cost records for various georeferencing methods, used for Figure 4for_human_completed.csv: results of manual georeferencing by the participantshf_v2geo.tif: Global Human Footprint Dataset (Geographic) (Version 2.00), from https://gis.earthdata.nasa.gov/portal/home/item.html?id=048c92f5ce50462a86b0837254924151, used for Figure 5acountry file folder: global country and county polygon vector data, used to extract centroid coordinates of counties in ArcGIS v10.8
Raw data from datasets used in SIMON analysis
data.europa.eu
unknown
Updated Jan 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). Raw data from datasets used in SIMON analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2580414?locale=hr
Explore at:
unknown(312591)Available download formats
Dataset updated
Jan 27, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.
m
Data from: Packet-level and IEEE 802.11 MAC frame-level Network Traffic...
data.mendeley.com
narcis.nl
Updated Jan 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajarshi Roy Chowdhury (2021). Packet-level and IEEE 802.11 MAC frame-level Network Traffic Traces Data of the D-Link IoT devices [Dataset]. http://doi.org/10.17632/84cc8grtkt.1
Explore at:
Unique identifier
https://doi.org/10.17632/84cc8grtkt.1
Dataset updated
Jan 14, 2021
Authors
Rajarshi Roy Chowdhury
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset presents network traffic traces data of the 14 D-Link IoT devices from different types including camera, network camera, smart-plug, door-window sensor, and home-hub. It consists of:

• Network packet traces (inbound and outbound traffic) and • IEEE 802.11 MAC frame traces.

The experimental testbed was set-up in the Network Systems and Signal Processing (NSSP) laboratory at Universiti Brunei Darussalam (UBD) to collect all the network traffic traces from 9th September 2020 to 10th January 2021 including an access point on a laptop. The network traffic traces were captured passively observing the Ethernet interface and the WiFi interface at the access point.

In packet traces, typical communication protocols, such as TCP, UDP, IP, ICMP, ARP, DNS, SSDP, TLS/SSL etc, data are captured which IoT devices use for communication on the Internet. In the probe request frame (a subtype of management frames) traces, data are recorded which IoT devices use to connect access point on the local area network.

The authors would like to thank the Faculty of Integrated Technologies, Universiti Brunei Darussalam, for the support to conduct this research experiment in the Network Systems and Signal Processing laboratory.
F
CoSense3D
data.uni-hannover.de
json, zip
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institut für Kartographie und Geoinformatik (2025). CoSense3D [Dataset]. https://data.uni-hannover.de/dataset/cosense3d
Explore at:
json, zipAvailable download formats
Dataset updated
Sep 15, 2025
Dataset authored and provided by
Institut für Kartographie und Geoinformatik
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Repo provides all datasets or their external links for the project CoSense3D. The related datasets are:

COMAP: A synthetic data generated by CARLA for cooperative perception.

OPV2Vt: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V for the purpose of globally time-aligned cooperative object detection (TA-COOD). The original replay files are interpolated to obtain the object and sensor locations at sub-frames. Each frame is spitted into 10 sub-frames for simulation.

DairV2Xt: New generate meta files based on dataset DAIR-V2X for the project CoSense3D with localization correction and ground truth generate for TA-COOD.

OPV2Va: A synthetic data generated by CARLA with the replay files provided by the dataset OPV2V augmented with semantic labels.

[_!NOTE_] If you find the downloading speed is very slow, you can also try the Baidu Cloud: 链接: https://pan.baidu.com/s/12HZ1yk0y84NJfStZADMstA?pwd=hkja 提取码: hkja --来自百度网盘超级会员v1的分享

802.11 Managemement frames from a public location

zenodo.org
data.niaid.nih.gov

zip

Updated Apr 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Benjamin Vermunicht; Benjamin Vermunicht (2025). 802.11 Managemement frames from a public location [Dataset]. http://doi.org/10.5281/zenodo.8003772

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.8003772

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Benjamin Vermunicht; Benjamin Vermunicht

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

About

The following datasets were captured at a busy Belgian train station between 9pm and 10pm, it contains all 802.11 management frames that were captured. both datasets were captured with approximately 20 minutes between then.

Both datasets are represented by a pcap and CSV file. The CSV file contains the frame type, timestamps, signal strength, SSID and MAC addresses for every frame. In the pcap file, all generic 802.11 elements were removed for anonymization purposes.

Anonymization

All frames were anonymized by removing identifying information or renaming identifiers. Concretely, the following transformations were applied to both datasets:

All MAC addresses were renamed (e.g. 00:00:00:00:00:01)
All SSID's were renamed (e.g. NETWORK_1)
All generec 802.11 elements were removed from the pcap

In the pcap file, anonymization actions could lead to "corrupted" frames because length tags do not correspond with the actual data. However, the file and its frames are still readable in packet analyzing tools such as Wireshark or Scapy.

The script which was used to anonymize is available in the dataset.

Data

Specifications for the datasets
N/o	Dataset 1	dataset 2
Frames	36306	60984
Beacon frames	19693	27983
Request frames	798	1580
Response frames	15815	31421
Identified Wi-Fi Networks	54	70
Identified MAC addresses	2092	2705
Identified Wireless devices	128	186
Capturetime	480s	422s

Dataset contents

The two datasets are stored in the directories `1/` and `2/`. Each directory contains:

`capture-X.pcap`: an anonymized version of the original capture
`capture-X.csv`: content of each captured frame (timestamp, MAC address...) saved as a CSV file

`anonymization.py` is the script which was used to remove identifiers.

`README.md` contains the documentation about the datasets

License

Permission is hereby granted, free of charge, to any person obtaining a copy of this dataset and associated documentation files (the “Dataset”), to deal in the Dataset without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Dataset, and to permit persons to whom the Dataset is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions that make use of the Dataset.

THE DATASET IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE DATASET.

VO1/VO2 MARS VISUAL IMAGING SUBSYSTEM EXPERIMENT DATA RECORD
data.nasa.gov
datasets.ai
+4more
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). VO1/VO2 MARS VISUAL IMAGING SUBSYSTEM EXPERIMENT DATA RECORD [Dataset]. https://data.nasa.gov/dataset/vo1-vo2-mars-visual-imaging-subsystem-experiment-data-record-67f5e
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
After the digital data were transmitted to Earth and received at the Jet Propulsion Laboratory, they were subject to a variety of processes to produce the final digital tapes and photoproducts. The first step was to strip out all the non-video data and produce a System Data Record (SDR). This was compiled into video format, and an Experiment Data Record (EDR) was produced. The EDR data consist of unprocessed (raw) instrument data. Substantial processing is required to reconstruct each image owing to the unique manner in which data were transmitted to earth. Images were initially recorded on 7-track magnetic tape recorders on the spacecraft. Each raw data frame retrieved from the tracking station thus contains every seventh pixel arranged in either increasing or decreasing order. Image data reconstructed from these raw data frames by the Mission Test Imaging System (MTIS) form the EDR digital archive tape.
The CORESIDENCE Database: National and Subnational Data on Household and...
data.europa.eu
zenodo.org
unknown
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo, The CORESIDENCE Database: National and Subnational Data on Household and Living Arrangements Around the World, 1964-2021 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8142652?locale=hu
Explore at:
unknown(18275)Available download formats
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Households are the fundamental units of co-residence and play a crucial role in social and economic reproduction worldwide. They are also widely used as units of enumeration for data collection purposes, with substantive implications for research on poverty, living conditions, family structure, and gender dynamics. However, reliable comparative data on households and changes and living arrangements around the world is still under development. The CORESIDENCE database (CoDB) aims to bridge the existing data gap by offering valuable insights not only into the documented disparities between countries but also into the often-elusive regional differences within countries. By providing comprehensive data, it facilitates a deeper understanding of the complex dynamics of co-residence around the world. This database is a significant contribution to research, as it sheds light on both macro-level variations across nations and micro-level variations within specific regions, facilitating more nuanced analyses and evidence-based policymaking. The CoDB is composed of three datasets covering 155 countries (National Dataset), 3563 regions (Subnational Dataset), and 1511 harmonized regions (Subnational-Harmonized Dataset) for the period 1960 to 2021, and it provides 146 indicators on household composition and family arrangements across the world. This repository is composed of the following elements: a RData file named CORESIDENDE_DATABASE containing the CoDB in the form of a List. The CORESIDENDE_DB list object is composed of six elements: NATIONAL: a data frame with the household composition and living arrangements indicators at the national level. SUBNATIONAL: a data frame with the household composition and living arrangements indicators at the subnational level computed over the original subnational division provided in each sample and data source. SUBNATIONAL_HARMONIZED: a data frame with the household composition and living arrangements indicators computed over the harmonized subnational regions. SUBNATIONAL_BOUNDARIES_CORESIDENCE: a spatial data frame (a sf object) with the boundary’s delimitation of the subnational harmonized regions created for this project. CODEBOOK: a data frame with the complete list of indicators, their code names and description. HARMONIZATION_TABLE: a data frame with the full list of individual country-year samples employed in this project and their state of inclusion in the 3 datasets composing the CoDB. Elements 1, 2, 3, 5 and 6 of the R list are also provided as csv files under the same names. Element 4, the harmonized boundaries, is at disposal as gpkg (Geopackage) file.
Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset...
figshare.com
application/csv
Updated Sep 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steve Nwaiwu (2025). Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset for Explainable Misinformation Detection [Dataset]. http://doi.org/10.6084/m9.figshare.29539820.v2
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29539820.v2
Dataset updated
Sep 24, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Steve Nwaiwu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is associated with the research article titled:"Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection"This corpus aggregates, harmonizes, and standardizes data from eight widely used fake news datasets. It supports multi-domain fake news detection with emphasis on explainability, cross-modal generalization, and robust performance.🗂️ Dataset ContentsThis repository contains the following resources:Aggregated Raw Corpus (aggregated_raw.csv)286,260 samples across 8 datasets.Binary labels (1 = Fake, 0 = Real)Includes metadata: source dataset, topic (if available), speaker/source, etc.Preprocessed Text Corpus (aggregated_cleaned.csv)Includes standardized and cleaned cleaned_text column.Text normalization applied using SpaCy (lowercasing, lemmatization, punctuation/URL/user removal).Fully Encoded Feature Matrix (xframe_features_encoded.csv)104 structured features derived from communication theory and media psychology.Includes source encoding, speaker credibility, social engagement, sentiment, subjectivity, sensationalism, and readability scores.All numerical features scaled to [0, 1]; categorical features one-hot encoded.Data Splitstrain.csv, val.csv, test.csv: Stratified splits of the cleaned and encoded data.Feature Metadata (feature_description.pdf)Documentation of all 104 features with descriptions, data sources, and rationales.🔧 Preprocessing OverviewTo ensure robust and generalizable modeling, the following standardized pipeline was applied:Text Preprocessing: Cleaned using SpaCy, lowercased, lemmatized, and stripped of stopwords, URLs, and usernames.Label Mapping:Datasets with multiclass labels (e.g., LIAR, FNC-1) were mapped to a unified binary schema using theory-informed rules.1 = Fake includes false, pants-on-fire, disagree, etc.; 0 = Real includes true, agree, mostly-true.Deduplication: Removed near-duplicate entries across datasets using fuzzy string matching and content hashing.Feature Engineering:Source credibility features (e.g., speaker credibility from LIAR).Social context (e.g., tweet volume, user mentions).Framing indicators (e.g., sentiment, subjectivity, sensationalism, readability).Feature Encoding: One-hot encoding for categorical attributes, Min-Max scaling for numerical features.📚 Original Data SourcesThis aggregated corpus was derived from the following datasets. Please cite them individually alongside this collection:LIAR – Wang (2017): https://doi.org/10.18653/v1/P17-2067FakeNewsNet (PolitiFact, BuzzFeed, GossipCop) – Shu et al.: https://doi.org/10.1145/3363574ISOT – Ahmed et al.: https://doi.org/10.48550/arXiv.1708.07104WELFake – Verma et al.: https://doi.org/10.1109/TCSS.2021.3068519FNC-1 – https://www.fakenewschallenge.org/FakeNewsAMT – Pérez-Rosas et al.: https://doi.org/10.18653/v1/C18-1287Celebrity Rumors – Horne & Adalı: https://doi.org/10.1609/icwsm.v11i1.15015PHEME – Zubiaga et al.: https://doi.org/10.6084/m9.figshare.4010619.v1📖 How to Cite This DatasetNwaiwu, S.; Jongsawat, N.; Tungkasthan, A. Decoding Disinformation: A Feature-Driven Explainable AI Approach to Multi-Domain Fake News Detection. Appl. Sci. 2025, 15, 9498. https://doi.org/10.3390/app15179498
S
HA4M - Human Action Multi-Modal Monitoring in Manufacturing
scidb.cn
resodate.org
Updated Jul 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Marani; Laura Romeo; Grazia Cicirelli; Tiziana D'Orazio (2022). HA4M - Human Action Multi-Modal Monitoring in Manufacturing [Dataset]. http://doi.org/10.57760/sciencedb.01872
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.01872
Dataset updated
Jul 6, 2022
Dataset provided by
Science Data Bank
Authors
Roberto Marani; Laura Romeo; Grazia Cicirelli; Tiziana D'Orazio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewThe HA4M dataset is a collection of multi-modal data relative to actions performed by different subjects in an assembly scenario for manufacturing. It has been collected to provide a good test-bed for developing, validating and testing techniques and methodologies for the recognition of assembly actions. To the best of the authors' knowledge, few vision-based datasets exist in the context of object assembly.The HA4M dataset provides a considerable variety of multi-modal data compared to existing datasets. Six types of simultaneous data are supplied: RGB frames, Depth maps, IR frames, RGB-Depth-Aligned frames, Point Clouds and Skeleton data.These data allow the scientific community to make consistent comparisons among processing approaches or machine learning approaches by using one or more data modalities. Researchers in computer vision, pattern recognition and machine learning can use/reuse the data for different investigations in different application domains such as motion analysis, human-robot cooperation, action recognition, and so on.Dataset detailsThe dataset includes 12 assembly actions performed by 41 subjects for building an Epicyclic Gear Train (EGT).The assembly task involves three phases first, the assembly of Block 1 and Block 2 separately, and then the final setting up of both Blocks to build the EGT. The EGT is made up of a total of 12 components divided into two sets: the first eight components for building Block 1 and the remaining four components for Block 2. Finally, two screws are fixed with an Allen Key to assemble the two blocks and thus obtain the EGT.Acquisition setupThe acquisition experiment took place in two laboratories (one in Italy and one in Spain), where an acquisition area was reserved for the experimental setup. A Microsoft Azure Kinect camera acquires videos during the execution of the assembly task. It is placed in front of the operator and the table where the components are spread over. The camera is place on a tripod at an height h of 1.54 m and a distance of 1.78m. The camera is down-tilted by an angle of 17 degrees.Technical informationThe HA4M dataset contains 217 videos of the assembly task performed by 41 subjects (15 females and 26 males). Their ages ranged from 23 to 60. All the subjects participated voluntarily and were provided with a written description of the experiment. Each subject was asked to execute the task several times and to perform the actions at their own convenience (e.g. with both hands), independently from their dominant hand. The HA4M project is a growing project. So new acquisitions, planned in the next future, will expand the current dataset.ActionsTwelve actions are considered in HA4M. Actions from 1 to 4 are needed to build Block 1, then actions from 5 to 8 for building Block 2 and finally, the actions from 9 to 12 for completing the EGT. Actions are listed below:Pick up/Place CarrierPick up/Place Gear Bearings (x3)Pick up/Place Planet Gears (x3)Pick up/Place Carrier ShaftPick up/Place Sun ShaftPick up/Place Sun GearPick up/Place Sun Gear BearingPick up/Place Ring BearPick up Block 2 and place it on Block 1Pick up/Place CoverPick up/Place Screws (x2)Pick up/Place Allen Key, Turn Screws, Return Allen Key and EGTAnnotationData annotation concerns the labeling of the different actions in the video sequences.The annotation of the actions has been manually done by observing the RGB videos, frame by frame. The start frame of each action is identified as the subject starts to move the arm to the component to be grasped. The end frame, instead, is recorded when the subject releases the component, so the next frame becomes the start frame of the subsequent action.The total number of actions annotated in this study is 4123, including the “don't care” action (ID=0) and the action repetitions in the case of actions 2, 3 and 11.Available codeThe dataset has been acquired using the Multiple Azure Kinect GUI software, available at https://gitlab.com/roberto.marani/multiple-azure-kinect-gui, based on the Azure Kinect Sensor SDK v1.4.1 and Azure Kinect Body Tracking SDK v1.1.2.The software records device data to a Matroska (.mkv) file, containing video tracks, IMU samples, and device calibration. In this work, IMU samples are not considered.The same Multiple Azure Kinect GUI software processes the Matroska file and returns the different types of data provided with our dataset: RGB images, RGB-depth-Aligned (RGB-A) images, Depth images, IR images, Point Cloud and Skeleton data.
Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...
zenodo.org
csv
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6607065
Dataset updated
Sep 15, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous authors; Anonymous authors
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Data frame and script
figshare.com
application/csv
Updated Dec 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linda Kashikar; Timo Lüke; Michael Grosche (2024). Data frame and script [Dataset]. http://doi.org/10.6084/m9.figshare.25334125.v1
Explore at:
application/csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25334125.v1
Dataset updated
Dec 13, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Linda Kashikar; Timo Lüke; Michael Grosche
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data frame and analysis script
Z
SAPFLUXNET: A global database of sap flow measurements
data.niaid.nih.gov
nde-dev.biothings.io
+1more
Updated Sep 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Poyatos; Víctor Granda; Víctor Flo; Roberto Molowny-Horas; Kathy Steppe; Maurizio Mencuccini; Jordi Martínez-Vilalta (2020). SAPFLUXNET: A global database of sap flow measurements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2530797
Explore at:
Dataset updated
Sep 26, 2020
Dataset provided by
ICREA/CREAF
CREAF
CREAF/Universitat Autònoma de Barcelona
Ghent University
Authors
Rafael Poyatos; Víctor Granda; Víctor Flo; Roberto Molowny-Horas; Kathy Steppe; Maurizio Mencuccini; Jordi Martínez-Vilalta
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General description

SAPFLUXNET contains a global database of sap flow and environmental data, together with metadata at different levels. SAPFLUXNET is a harmonised database, compiled from contributions from researchers worldwide.

The SAPFLUXNET version 0.1.5 database harbours 202 globally distributed datasets, from 121 geographical locations. SAPFLUXNET contains sap flow data for 2714 individual plants (1584 angiosperms and 1130 gymnosperms), belonging to 174 species (141 angiosperms and 33 gymnosperms), 95 different genera and 45 different families. More information on the database coverage can be found here: http://sapfluxnet.creaf.cat/shiny/sfn_progress_dashboard/.

The SAPFLUXNET project has been developed by researchers at CREAF and other institutions (http://sapfluxnet.creaf.cat/#team), coordinated by Rafael Poyatos (CREAF, http://www.creaf.cat/staff/rafael-poyatos-lopez), and funded by two Spanish Young Researcher's Grants (SAPFLUXNET, CGL2014-55883-JIN; DATAFORUSE, RTI2018-095297-J-I00 ) and an Alexander von Humboldt Research Fellowship for Experienced Researchers).

Changelog

Compared to version 0.1.4, this version includes some changes in the metadata, but all time series data (sap flow, environmental) remain the same.

For all datasets, climate metadata (temperature and precipitation, ‘si_mat’ and ‘si_map’) have been extracted from CHELSA (https://chelsa-climate.org/), replacing the previous climate data obtained with Wordclim. This change has modified the biome classification of the datasets in ‘si_biome’.

In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) is now assigned a value of 0 if species are in the understorey. This affects two datasets: AUS_MAR_UBD and AUS_MAR_UBW, where, previously, the sum of species basal area percentages could add up to more than 100%.

In ‘species’ metadata, the percentage of basal area with sap flow measurements for each species (‘sp_basal_area_perc’) has been corrected for datasets USA_SIL_OAK_POS, USA_SIL_OAK_1PR, USA_SIL_OAK_2PR.

In ‘site’ metadata, the vegetation type (‘si_igbp’) has been changed to SAV for datasets CHN_ARG_GWD and CHN_ARG_GWS.

Variables and units

SAPFLUXNET contains whole-plant sap flow and environmental variables at sub-daily temporal resolution. Both sap flow and environmental time series have accompanying flags in a data frame, one for sap flow and another for environmental variables. These flags store quality issues detected during the quality control process and can be used to add further quality flags.

Metadata contain relevant variables informing about site conditions, stand characteristics, tree and species attributes, sap flow methodology and details on environmental measurements. The description and units of all data and metadata variables can be found here: Metadata and data units.

To learn more about variables, units and data flags please use the functionalities implemented in the sapfluxnetr package (https://github.com/sapfluxnet/sapfluxnetr). In particular, have a look at the package vignettes using R:

remotes::install_github(

'sapfluxnet/sapfluxnetr',

build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes")

)

library(sapfluxnetr)

to list all vignettes

vignette(package='sapfluxnetr')

variables and units

vignette('metadata-and-data-units', package='sapfluxnetr')

data flags

vignette('data-flags', package='sapfluxnetr')

Data formats

SAPFLUXNET data can be found in two formats: 1) RData files belonging to the custom-built 'sfn_data' class and 2) Text files in .csv format. We recommend using the sfn_data objects together with the sapfluxnetr package, although we also provide the text files for convenience. For each dataset, text files are structured in the same way as the slots of sfn_data objects; if working with text files, we recommend that you check the data structure of 'sfn_data' objects in the corresponding vignette.

Working with sfn_data files

To work with SAPFLUXNET data, first they have to be downloaded from Zenodo, maintaining the folder structure. A first level in the folder hierarchy corresponds to file format, either RData files or csv's. A second level corresponds to how sap flow is expressed: per plant, per sapwood area or per leaf area. Please note that interconversions among the magnitudes have been performed whenever possible. Below this level, data have been organised per dataset. In the case of RData files, each dataset is contained in a sfn_data object, which stores all data and metadata in different slots (see the vignette 'sfn-data-classes'). In the case of csv files, each dataset has 9 individual files, corresponding to metadata (5), sap flow and environmental data (2) and their corresponding data flags (2).

After downloading the entire database, the sapfluxnetr package can be used to: - Work with data from a single site: data access, plotting and time aggregation. - Select the subset datasets to work with. - Work with data from multiple sites: data access, plotting and time aggregation.

Please check the following package vignettes to learn more about how to work with sfn_data files:

Quick guide

Metadata and data units

sfn_data classes

Custom aggregation

Memory and parallelization

Working with text files

We recommend to work with sfn_data objects using R and the sapfluxnetr package and we do not currently provide code to work with text files.

Data issues and reporting

Please report any issue you may find in the database by sending us an email: sapfluxnet@creaf.uab.cat.

Temporary data fixes, detected but not yet included in released versions will be published in SAPFLUXNET main web page ('Known data errors').

Data access, use and citation

This version of the SAPFLUXNET database is open access and corresponds to the data paper submitted to Earth System Science Data in August 2020.

When using SAPFLUXNET data in an academic work, please cite the data paper, when available, or alternatively, the Zenodo dataset (see the ‘Cite as’ section on the right panels of this web page).
m
R codes and dataset for Visualisation of Diachronic Constructional Change...
bridges.monash.edu
researchdata.edu.au
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gede Primahadi Wijaya Rajeg (2023). R codes and dataset for Visualisation of Diachronic Constructional Change using Motion Chart [Dataset]. http://doi.org/10.26180/5c844c7a81768
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.26180/5c844c7a81768
Dataset updated
May 30, 2023
Dataset provided by
Monash University
Authors
Gede Primahadi Wijaya Rajeg
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PublicationPrimahadi Wijaya R., Gede. 2014. Visualisation of diachronic constructional change using Motion Chart. In Zane Goebel, J. Herudjati Purwoko, Suharno, M. Suryadi & Yusuf Al Aried (eds.). Proceedings: International Seminar on Language Maintenance and Shift IV (LAMAS IV), 267-270. Semarang: Universitas Diponegoro. doi: https://doi.org/10.4225/03/58f5c23dd8387Description of R codes and data files in the repositoryThis repository is imported from its GitHub repo. Versioning of this figshare repository is associated with the GitHub repo's Release. So, check the Releases page for updates (the next version is to include the unified version of the codes in the first release with the tidyverse).The raw input data consists of two files (i.e. will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).These two input files are used in the R code file 1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. Then, the script 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.Next, input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).The repository adopts the project-oriented workflow in RStudio; double-click on the Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.

Facebook

Twitter

Click to copy link

Link copied

Cite

AI Robotics Ethics Society (PUCRS), example-data-frame [Dataset]. https://huggingface.co/datasets/AiresPucrs/example-data-frame

example-data-frame

Example DataFrame

AiresPucrs/example-data-frame

Explore at:

221 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

AI Robotics Ethics Society (PUCRS)

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Example DataFrame (Teeny-Tiny Castle)

This dataset is part of a tutorial tied to the Teeny-Tiny Castle, an open-source repository containing educational tools for AI Ethics and Safety research.

  How to Use

from datasets import load_dataset

dataset = load_dataset("AiresPucrs/example-data-frame", split = 'train')

Clear search

Close search

Google apps

Main menu

example-data-frame

Sample Dataset for DataFrame Styling

Dataset

Contents

Study Hours vs Grades Dataset

Dataset Features

Potential Use Cases

Data Quality

Data Generation Code

R Code

Management and teaching teams' questionnaire (data frame)

Dataset for pandas data-frame 1.1

Dataset

Contents

Children's questionnaire (data frame)

WELFake dataset for fake news detection in text data

Dataset & code for "Using large language models to address the bottleneck of...

Raw data from datasets used in SIMON analysis

Data from: Packet-level and IEEE 802.11 MAC frame-level Network Traffic...

CoSense3D

802.11 Managemement frames from a public location

VO1/VO2 MARS VISUAL IMAGING SUBSYSTEM EXPERIMENT DATA RECORD

The CORESIDENCE Database: National and Subnational Data on Household and...

Aggregated Fake News Corpus for X-FRAME: Preprocessed Multi-Domain Dataset...

HA4M - Human Action Multi-Modal Monitoring in Manufacturing

Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

Data frame and script

SAPFLUXNET: A global database of sap flow measurements

remotes::install_github(

'sapfluxnet/sapfluxnetr',

build_opts = c("--no-resave-data", "--no-manual", "--build-vignettes")

)

to list all vignettes

variables and units

data flags

R codes and dataset for Visualisation of Diachronic Constructional Change...

example-data-frameSee More Versions

Example DataFrame

AiresPucrs/example-data-frame

example-data-frame