43 datasets found

Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moreo, Alejandro (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Bunse, Mirko
Senz, Martin
Sebastiani, Fabrizio
Moreo, Alejandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
Data from: Dataset for Anomaly Detection in a Production Wireless Mesh...
zenodo.org
application/gzip
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Llorenç Cerdà-Alabern; Llorenç Cerdà-Alabern (2023). Dataset for Anomaly Detection in a Production Wireless Mesh Community Network [Dataset]. http://doi.org/10.5281/zenodo.6169917
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6169917
Dataset updated
Apr 27, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Llorenç Cerdà-Alabern; Llorenç Cerdà-Alabern
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
CSV dataset generated gathering data from a production wireless mesh community network. Data is gathered every 5 minutes during the interval 2021-04-13 00:00:00 to 2021-04-16 00:00:00. During the interval 2021-04-14 02:00:00 2021-04-14 17:50:00 (both included) there is the failure of a gateway in the mesh (nodeid 24).

Live mesh network monitoring link: http://dsg.ac.upc.edu/qmpsu

The dataset consists of single gzip compressed CSV file. The first line of the file is a header describing the features. The first column is a GMT timestamp of the sample in the format as "2021-03-16 00:00:00". The rest of the columns provide the comma-separated values of the features collected from each node in the corresponding capture.

A suffix with the nodeid is added to each feature. For instance, the feature having the number of processes of node with nodeid 24 is named as "processes-24". In total, 63 different nodes showed up during the samples, each being assigned a different nodeid.

Features are of two types: (i) absolute values, for instance, the CPU 1-minute load average, and (ii) counters that are monotonically increased, for instance the number of transmitted packets. We have converted counter-type kernel variables to rates, by dividing the difference between two consecutive samples, over the difference of the corresponding timestamps in seconds, as shown in the following pseudo-code:
feature.rate are columns computed from feature as
feature.rate <- (feature[2:n]-feature[1:(n-1)])/(epoch[2:n]-epoch[1:(n-1)])
feature.rate <- feature.rate[feature.rate >= 0] # discard samples where the counter is restarted
where n is the number of samples

features
- processes number of processes
- loadavg.m1 1 minute load average
- softirq.rate servicing softirqs
- iowait.rate waiting for I/O to complete
- intr.rate
- system.rate processes executing in kernel mode
- idle.rate twiddling thumbs
- user.rate normal processes executing in user mode
- irq.rate servicing interrupts
- ctxt.rate total number of context switches across all CPUs
- nice.rate niced processes executing in user mode
- nr_slab_unreclaimable The part of the Slab that can't be reclaimed under memory pressure
- nr_anon_pages anonymous memory pages
- swap_cache Memory that once was swapped out, is swapped back in but still also is in the swapfile
- page_tables Memory used to map between virtual and physical memory addresses
- swap
- eth.txe.rate tx errors over all ethernet interfaces
- eth.rxe.rate rx errors over all ethernet interfaces
- eth.txb.rate tx bytes over all ethernet interfaces
- eth.rxb.rate rx bytes over all ethernet interfaces
- eth.txp.rate tx packets over all ethernet interfaces
- eth.rxp.rate rx packets over all ethernet interfaces
- wifi.txe.rate tx errors over all wireless interfaces
- wifi.rxe.rate rx errors over all wireless interfaces
- wifi.txb.rate tx bytes over all wireless interfaces
- wifi.rxb.rate rx bytes over all wireless interfaces
- wifi.txp.rate tx packets over all wireless interfaces
- wifi.rxp.rate rx packets over all wireless interfaces
- txb.rate tx bytes over all ethernet and wifi interfaces
- txp.rate tx packets over all ethernet and wifi interfaces
- rxb.rate rx bytes over all ethernet and wifi interfaces
- rxp.rate rx packets over all ethernet and wifi interfaces
- sum.xb.rate tx+rx bytes over all ethernet and wifi interfaces
- sum.xp.rate tx+rx packets over all ethernet and wifi interfaces
- diff.xb.rate tx-rx bytes over all ethernet and wifi interfaces
- diff.xp.rate tx-rx packets over all ethernet and wifi interfaces
m
Network traffic and code for machine learning classification
data.mendeley.com
Updated Feb 20, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Víctor Labayen (2020). Network traffic and code for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.2
Explore at:
Unique identifier
https://doi.org/10.17632/5pmnkshffm.2
Dataset updated
Feb 20, 2020
Authors
Víctor Labayen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

Activities:

Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

The amount of data is stated as follows:

Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

The code of our machine learning approach is also included. There is a README.txt file with the documentation of how to use the code.
f
Windows Malware Detection Dataset
figshare.com
txt
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irfan Yousuf (2023). Windows Malware Detection Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21608262.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21608262.v1
Dataset updated
Mar 15, 2023
Dataset provided by
figshare
Authors
Irfan Yousuf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset for Windows Portable Executable Samples with four feature sets. It contains four CSV files, one CSV file per feature set. 1. First feature set (DLLs_Imported.csv file) contains the DLLs imported by each malware family. The first column contains SHA256 values, second column contains the label or family type of the malware while the remaining columns list the names of imported DLLs. 2. Second feature set (API_Functions.csv files) contains the API functions called by these malware alongwith their SHA256 hash values and labels. 3. Third feature set (PE_Header.csv) contains values of 52 fields of PE header. All the fields are labelled in the CSV file. 4. Fourth feature set (PE_Section.csv file) contains 9 field values of 10 different PE sections. All the fields are labelled in the CSV file.

Malware Type / family Labels:

0=Benign 1=RedLineStealer 2= Downloader
3=RAT 4=BankingTrojan 5=SnakeKeyLogger 6=Spyware

Annotated 12 lead ECG dataset

zenodo.org

zip

Updated Jun 7, 2021

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). Annotated 12 lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3625007

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3625007

Dataset updated

Jun 7, 2021

Dataset provided by

Zenodohttp://zenodo.org/

Authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

# Annotated 12 lead ECG dataset

Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students.
It is used as test set on the paper:
"Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network".

It contain annotations about 6 different ECGs abnormalities:
- 1st degree AV block (1dAVb);
- right bundle branch block (RBBB);
- left bundle branch block (LBBB);
- sinus bradycardia (SB);
- atrial fibrillation (AF); and,
- sinus tachycardia (ST).

## Folder content:

- `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a 
`(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different 
patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12
different leads of the ECG exam. 

The signals are sampled at 400 Hz. Some signals originally have a duration of 
10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples).
In order to make them all have the same size (4096 samples) we fill them with zeros
on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648
samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved
in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should
be multiplied by 1000 in order to obtain the signals in V.

In python, one can read this file using the following sequence:
```python
import h5py
with h5py.File(args.tracings, "r") as f:
  x = np.array(f['tracings'])
```

- The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
- `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header).
The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files.
The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2
 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a 
 third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
 3. `dnn.csv` prediction from the deep neural network described in 
 "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way 
 it maximizes the F1 score.
 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).

e
Sample data file with TOAR air quality data for machine learning excercise -...
b2find.eudat.eu
Updated Jan 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Sample data file with TOAR air quality data for machine learning excercise - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/b3b0256e-4819-5c15-859a-56909c66085f
Explore at:
Dataset updated
Jan 18, 2025
Description
This file has been obtained from the Tropospheric Ozone Assessment Report database described by Schultz, M.G. et al., Elementa Sci. Anthrop., 2017, doi:http://doi.org/10.1525/elementa.244. It contains 6 years of annual NO2 concentration percentiles at German measurement sites and corresponding station metadata. The intended use of these data is to demonstrate the set-up and training of a simple feed forward neural network, which shall attempt to predict the NO2 statistics based on the station characterisation from the metadata information. The data are stored as csv file (comma delimited) with 7 header lines plus column headings. Column headings are: year,id,station_id,station_type,station_type_of_area,station_nightlight_1km,station_wheat_production,station_nox_emissions,station_omi_no2_column,station_max_population_density_5km,perc75,perc98. station_id, station_type, and station_type_of_area are string variables, all other columns are numeric. year, id, and station_id should be ignored for the machine learning. perc75 and perc98 are 75%-iles and 98%-iles, respectively and given in units of nmol per mol (equivalent to ppbv).
d
Data and scripts associated with a manuscript investigating impacts of solid...
search.dataone.org
osti.gov
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg (2023). Data and scripts associated with a manuscript investigating impacts of solid phase extraction on freshwater organic matter optical signatures and mass spectrometry pairing [Dataset]. http://doi.org/10.15485/1995543
Explore at:
Unique identifier
https://doi.org/10.15485/1995543
Dataset updated
Aug 21, 2023
Dataset provided by
ESS-DIVE
Authors
Alan Roebuck; Brieanne Forbes; Vanessa A. Garayburu-Caruso; Samantha Grieger; Khadijah Homolka; James C. Stegen; Allison Myers-Pigg
Time period covered
Aug 30, 2021 - Sep 15, 2021
Area covered

Description
This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
H
IHerbSpec Metadata Spreadsheet v1
dataverse.harvard.edu
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dawson White (2025). IHerbSpec Metadata Spreadsheet v1 [Dataset]. http://doi.org/10.7910/DVN/LXBD43
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/LXBD43
Dataset updated
Jul 4, 2025
Dataset provided by
Harvard Dataverse
Authors
Dawson White
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset Description This dataset contains version 1 of the metadata schema developed by the International Herbarium Spectral Digitization Working Group (IHerbSpec) to support standardized documentation of spectral reflectance measurements of herbarium specimens. The spreadsheet defines session-, specimen-, and tissue-level metadata fields as column headers for recording data according to the IHerbSpec Base Protocol v1.0. The dataset includes two CSV files: -IHerbSpec_metadata_v1.csv – a clean version containing only the column headers (field names) used for metadata capture. The simpleFilename and filename fields have also been populated with example entries to show the recommended filename prefix structure corresponding to the filename format outlined in Part 3 of the IHerbSpec Protocol v1.0. -IHerbSpec_metadata_v1-examples.csv – the same template, populated with example records for spectral measurement files to illustrate proper field usage and formatting. This metadata schema is intended to promote consistent, interoperable data practices across projects and institutions, in alignment with FAIR data principles. It can be used during or after measurement sessions to organize and export structured metadata for each spectral sample. Future updates will reflect community feedback and protocol revisions. Users are encouraged to cite the specific version used in any shared dataset or publication.
I
Next-gen sequencing and metadata analyses of Great Lakes fungal data
databank.illinois.edu
Updated Dec 18, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew N. Miller (2017). Next-gen sequencing and metadata analyses of Great Lakes fungal data [Dataset]. http://doi.org/10.13012/B2IDB-9320144_V2
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-9320144_V2
Dataset updated
Dec 18, 2017
Authors
Andrew N. Miller
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
The Great Lakes
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
The data set consists of Illumina sequences derived from 48 sediment samples, collected in 2015 from Lake Michigan and Lake Superior for the purpose of inventorying the fungal diversity in these two lakes. DNA was extracted from ca. 0.5g of sediment using the MoBio PowerSoil DNA isolation kits following the Earth Microbiome protocol. PCR was completed with the fungal primers ITS1F and fITS7 using the Fluidigm Access Array. The resulting amplicons were sequenced using the Illumina Hi-Seq2500 platform with rapid 2 x 250nt paired-end reads. The enclosed data sets contain the forward read files for both primers, both fixed-header index files, and the associated map files needed to be processed in QIIME. In addition, enclosed are two rarefied OTU files used to evaluate fungal diversity. All decimal latitude and decimal longitude coordinates of our collecting sites are also included. File descriptions: Great_lakes_Map_coordinates.xlsx = coordinates of sample sites QIIME Processing ITS1 region: These are the raw files used to process the ITS1 Illumina reads in QIIME. ***only forward reads were processed GL_ITS1_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS1F_Miller_Fludigm_I1_fixedheader.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS1F_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS1 region. QIIME Processing ITS2 region: These are the raw files used to process the ITS2 Illumina reads in QIIME. ***only forward reads were processed GL_ITS2_HW_mapFile_meta.txt = This is the map file used in QIIME. ITS7_Miller_Fludigm_I1_Fixedheaders.fastq = Index file from Illumina. Headers were fixed to match the forward reads (R1) file in order to process in QIIME ITS7_Miller_Fludigm_R1.fastq = Forward Illumina reads for the ITS2 region. Resulting OTU Table and OTU table with taxonomy ITS1 Region wahl_ITS1_R1_otu_table.csv = File contains Representative OTUs based on ITS1 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS1_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS1 region for all the R1 and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev ITS2 Region wahl_ITS2_R1_otu_table.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample. wahl_ITS2_R1_otu_table_w_tax.csv = File contains Representative OTUs based on ITS2 region for all the R1 data and the number of each OTU found in each sample along with taxonomic determination based on the following database: sh_taxonomy_qiime_ver7_97_s_31.01.2016_dev Rarified illumina dataset for each ITS Region ITS1_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS1 region. ITS2_R1_nosing_rare_5000.csv = Environmental parameters and rarefied OTU dataset for ITS2 region. Column headings: #SampleID = code including researcher initials and sequential run number BarcodeSequence = LinkerPrimerSequence = two sequences used CTTGGTCATTTAGAGGAAGTAA or GTGARTCATCGAATCTTTG ReversePrimer = two sequences used GCTGCGTTCTTCATCGATGC or TCCTCCGCTTATTGATATGC run_prefix = initials of run operator Sample = location code, see thesis figures 1 and 2 for mapped locations and Great_lakes_Map_coordinates.xlsx for exact coordinates. DepthGroup = S= shallow (50-100 m), MS=mid-shallow (101-150 m), MD=mid-deep (151-200 m), and D=deep (>200 m)" Depth_Meters = Depth in meters Lake = lake name, Michigan or Superior Nitrogen % Carbon % Date = mm/dd/yyyy pH = acidity, potential of Hydrogen (pH) scale SampleDescription = Sample or control X = sequential run number OTU ID = Operational taxonomic unit ID
b
CTD data from casts before and after larval vertical distribution sampling...
bco-dmo.org
search.dataone.org
+1more
csv, zip
Updated Dec 30, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip O. Yund (2019). CTD data from casts before and after larval vertical distribution sampling from R/V C-Hawk day cruises in the Eastern Gulf of Maine from 2012 to 2014 [Dataset]. http://doi.org/10.1575/1912/bco-dmo.783736.1
Explore at:
csv(499.60 KB), zip(323.58 KB)Available download formats
Unique identifier
https://doi.org/10.1575/1912/bco-dmo.783736.1
Dataset updated
Dec 30, 2019
Dataset provided by
Biological and Chemical Data Management Office
Authors
Philip O. Yund
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Aug 1, 2012 - Aug 7, 2014
Area covered

Variables measured
Time, Depth, Density, Pressure, Salinity, cast_name, Temperature, Conductivity, End_latitude, Cast_time_UTC, and 5 more
Description
These data were presented in Weinstock et al., 2018 (see Fig 1 and 2). Only downcast density data were analyzed and presented in Weinstock et al., 2018. Water density data were converted to % of water column maximum in Figure 2 of Weinstock et al., 2018.

These data are available in two formats. The version in the BCO-DMO data system contains data from all casts concatenated together with added columns cast_name, Cast_time_UTC, start_latitude, start_longitude, end_latitude, and end_latitude which was originally contained in comment and header lines in 35 cast csv files. The 35 individual cast csv files are available in the "Data Files" section as CTD.zip: "CTD csv files with seabird headers."

Related Datasets: CTD casts were conducted immediately before and after the associated larval vertical distribution sampling (separate dataset description) and on both flood and ebb tides.
* CTD cast log for mussel study: https://www.bco-dmo.org/dataset/783749
* Mussel Larvae Vertical Distribution: https://www.bco-dmo.org/dataset/783755
s
Index to Marine and Lacustrine Geological Samples (IMLGS)
purl.stanford.edu
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Curators of Marine and Lacustrine Geological Samples Consortium; NOAA National Centers for Environmental Information (2025). Index to Marine and Lacustrine Geological Samples (IMLGS) [Dataset]. https://purl.stanford.edu/kb879ww0098
Explore at:
Dataset updated
May 22, 2025
Authors
Curators of Marine and Lacustrine Geological Samples Consortium; NOAA National Centers for Environmental Information
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Description
This copy of the Index to Marine and Lacustrine Geological Samples (IMLGS) was created on April 22, 2025 before its decommission on May 5, 2025. In addition to the csv file of the sample data, this deposit includes the html of the original NCEI page (https://www.ncei.noaa.gov/products/index-marine-lacustrine-samples,) a webarchive of metadata provided by NCEI from https://data.noaa.gov//metaview/page?xml=NOAA/NESDIS/NGDC/MGG/Geology/iso/xml/G00028.xml&view=getDataView&header=none, and an ML Commons Croissant metadata file that was generated for the csv file. The keywords below come from the NCEI dataset overview page. The Croissant file contains basic information about the columns. See the NCEI overview for more context on this dataset.

Original Description from dataset landing page (https://www.ncei.noaa.gov/products/index-marine-lacustrine-samples,): The Index to Marine and Lacustrine Geological Samples (IMLGS) is a community designed and maintained resource that enables scientists to discover and access geological material from seabed and lakebed cores, grabs, and dredges archived at participating institutions from around the world. Sample material is available directly from each repository. Before proposing research on any sample, please contact the repository’s curator for sample condition and availability.

Each repository submits data gleaned from physical samples to the IMLGS database, which is maintained by NOAA's National Centers for Environmental Information (NCEI). All sample data include basic collection and storage information, whereas some samples, at the discretion of the curator, may also include lithology, texture, age, mineralogy, weathering, metamorphism, glass remarks, color, physiographic province, principal investigator, and/or descriptive information. The public can access the IMLGS database by using NOAA NCEI’s data access resources.
d
Mineralogy of floodplain sediments from Meanders C, O, and Z in the East...
dataone.org
data.ess-dive.lbl.gov
+1more
Updated Mar 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Carrero; Patricia Fox; Peter Nico (2025). Mineralogy of floodplain sediments from Meanders C, O, and Z in the East River Watershed, CO, USA [Dataset]. http://doi.org/10.15485/2526687
Explore at:
Unique identifier
https://doi.org/10.15485/2526687
Dataset updated
Mar 7, 2025
Dataset provided by
ESS-DIVE
Authors
Sergio Carrero; Patricia Fox; Peter Nico
Time period covered
Jul 16, 2016 - Sep 26, 2017
Area covered

Description
This dataset includes bulk X-ray diffraction data from floodplain sediments collected as a part of the Watershed Function Scientific Focus Area (SFA) located in the Upper Colorado River Basin. The data were collected in order to investigate the role of biogeochemical cycling and other river corridor processes on riverine export of solutes. Sediment cores were collected from Meander C, Meander O, and Meander Z in July 2016 to September 2017 to depths of approximately 40-95 cm. Sample metadata including locations, depths, and sample dates are included in a csv file ("sample_list_and_locations.csv"). The file "diffraction_data.csv" contains raw diffraction data, and mineral quantification is in the file "mineral_abundance.csv". This dataset also includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type.
UMAHand: Hand Activity Dataset (Universidad de Málaga)
figshare.com
portaldelainvestigacion.uma.es
zip
Updated Jul 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete (2024). UMAHand: Hand Activity Dataset (Universidad de Málaga) [Dataset]. http://doi.org/10.6084/m9.figshare.25638246.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25638246.v3
Dataset updated
Jul 2, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Eduardo Casilari; Jennifer Barbosa-Galeano; Francisco Javier González-Cañete
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The objective of the UMAHand dataset is to provide a systematic, Internet-accessible benchmarking database for evaluating algorithms for the automatic identification of manual activities. The database was created by monitoring 29 predefined activities involving specific movements of the dominant hand. These activities were performed by 25 participants, each completing a certain number of repetitions. During each movement, participants wore a 'mote' or Shimmer sensor device on their dominant hand's wrist. This sensor, comparable in weight and volume to a wristwatch, was attached with an elastic band according to a predetermined orientation.The Shimmer device contains an Inertial Measurement Unit (IMU) with a triaxial accelerometer, gyroscope, magnetometer, and barometer. These sensors recorded measurements of acceleration, angular velocity, magnetic field, and atmospheric pressure at a constant sampling frequency of 100 Hz during each movement.The UMAHand Dataset comprises a main directory and three subdirectories: TRACES (containing measurements), VIDEOS (containing video sequences) and SCRIPTS (with two scripts that automate the downloading, unzipping and processing of the dataset). The main directory also includes three descriptive plain text files and an image:• "readme.txt": this file is a brief guide of the dataset which describes the basic characteristics of the database, the testbed or experimental framework used to generate it and the organization of the data file.• "user_characteristics.txt": which contains a line of six numerical (comma-separated) values for each participant describing their personal characteristics in the following order: 1) an abstract user identifier (a number from 01 to 25), 2) a binary value indicating whether the participant is left-handed (0) or right-handed (1), 3) a numerical value indicating gender: male (0), female (1), undefined or undisclosed (2), 4) the weight in kg, and 5) the height in cm and 6) the age in years.• "activity_description.txt": For each activity, this text file incorporates a line with the activity identifier (numbered from 01 to 29) and an alphanumeric string that briefly describes the performed action.• "sensor_orientation.jpg": a JPEG-type image file illustrating the way the sensor is carried and the orientation of the measurement axes.The TRACE subfolder with the data is, in turn, organized into 25 secondary subfolders, one for each participant, named with the word "output" followed by underscore symbol (_) and the corresponding participant identifier (a number from 1 to 25). Each subdirectory contains one CSV (Comma Separated Values) file for each trial (each repetition of any activity) performed by the corresponding volunteer.The filenames with the monitored data follow the following format: "user_XX_activity_YY_trial_ZZ.csv" where XX, YY, and ZZ represent the identifiers of the participant (XX), the activity (YY) and the repetition number (ZZ), respectively.In the files, which do not include any header, each line corresponds to a sample taken by the sensing node. Thus, each line of the CSV files presents a set of the simultaneous measurements captured by the sensors of the Shimmer mote at a certain instant. The values in each line are arranged as follows:•Timestamp, Ax, Ay, Az, Gx, Gy, Gz, Mx, My, Mz, Pwhere:-Timestamp is the time indication of the moment when the following measurements were taken. Time is measured in milliseconds elapsed since the start of the recording. Therefore, the first sample, in the first line of the file, has a zero value while the rest of the timestamps in the file are relative to this first sample.-Ax, Ay, Az are the measurements of the three axes of the triaxial accelerometer (in g units).-Gx, Gy, Gz indicate the components of the angular velocity measured by the triaxial gyroscope (in degrees per second or dps).-Mx, My, Mz represent the 3-axis data in microteslas (µT) captured by the magnetometer.-P is the measurement of pressure in millibars.Besides, the VIDEOS directory includes 29 anonymized video clips that illustrate with the corresponding examples the 29 manual activities carried out by the participants. The video files are encoded in MPEG4 format and named according to the format "Example_Activity_XX.mp4", where XX indicates the identifier of the movement (as described in the activity_description.txt file).Finally, the SCRIPTS subfolder comprises two scripts written in Python and Matlab. These two programs (named Load_traces), which perform the same function, are designed to automate the downloading and processing of the data. Specifically, these scripts perform the following tasks:1. Download the database from the public repository as a single compressed zip file.2. Unzip the aforementioned file and create the subfolder structure of the dataset in a specific directory named UMAHand_Dataset. As previously commented, in the subfolder named TRACES, one CSV trace file per each experiment (i.e. per each movement, user, and trial) is created.3. Read all the CSV files and store their information in a list of dictionaries (Python) or a matrix of structures (Matlab) named datasetTraces. Each element in that list/matrix has two fields: the filename (which identifies the user, the type of performed activity, and the trial number) and a numerical array of 11 columns containing the timestamps and the measurements of the sensors for that experiment (arranged as mentioned above).All experiments and data acquisition were conducted in private home environments. Participants were asked to perform those activities involving sustained or continuous hand movements (e.g. clapping hands) for at least 10 seconds. In the case of brief and punctual movements, which might require less than 10 seconds (e.g. picking up an object from the floor), volunteers were simply asked to execute the action until its conclusion. Thus, a total of 752 samples were collected, with durations ranging from 1.98 to 119.98 seconds.
d
Aerobic respiration controls on shale weathering, Geochimica et Cosmochimica...
search.dataone.org
osti.gov
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucien Stolze; Bhavna Arora; DIpankar Dwivedi; Carl Steefel; Zhi Li; Sergio Carrero; Benjamin Gilbert; Peter Nico; Markus Bill (2024). Aerobic respiration controls on shale weathering, Geochimica et Cosmochimica Acta, 2023: Dataset [Dataset]. http://doi.org/10.15485/1987859
Explore at:
Unique identifier
https://doi.org/10.15485/1987859
Dataset updated
Jun 12, 2024
Dataset provided by
ESS-DIVE
Authors
Lucien Stolze; Bhavna Arora; DIpankar Dwivedi; Carl Steefel; Zhi Li; Sergio Carrero; Benjamin Gilbert; Peter Nico; Markus Bill
Time period covered
Jan 30, 2018 - Mar 19, 2019
Area covered

Description
This data package was generated in order to support the development of a deep-time weathering model and to assess the coupling between shale weathering and aerobic respiration in the paper “Aerobic respiration controls on shale weathering” by Stolze et al., Geochimica et Cosmochimica Acta (2023). The package contains two csv files providing the average CO2(g) concentration profiles [ppm] and mineral concentration profiles [wt%], respectively. The CO2(g) concentration profiles were measured in the vicinity of the monitoring well PLM2 between January 2018 and April 2019. The gas samples were collected in the unsaturated zone to a depth of 1.52 m. The mineral concentration profiles were determined by X-Ray diffraction (XRD). The XRD measurements were performed on sub-core samples collected in the monitoring well PLM3 down to a depth of 7.01 m. The dataset additionally includes a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and a data dictionary (dd.csv) file that contains column/row headers used throughout the files along with a definition, units, and data type. Update on 2024-05-28: Revised versions of the CSV data files (CO2_data_GCA_Stolze_et_al_2023.csv and XRD_data_GCA_Stolze_et_al_2023.csv) were made to apply ESS-DIVE's CSV reporting format guidelines. Updated versions of the File Level Metadata (v2_20240528_flmd.csv) and Data Dictionary (v2_20240528_dd.csv) files were updated to reflect the changes made to the CSV files.
Character Encoding Examples
kaggle.com
Updated Dec 15, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2017). Character Encoding Examples [Dataset]. https://www.kaggle.com/datasets/rtatman/character-encoding-examples/discussion?sortBy=hot&group=owned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2017
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context:

Character encodings are sets of mappings from raw bits (0’s and 1’s) to text characters. When a text encoded with a specific encoder is decoded with a different encoder, it changes the output text. Sometimes this results in completely unreadable text.

This dataset is intended to provide a list of example texts in different character encodings to help you diagnose which file encoding your source file actually in.

Content

This dataset is made up of six text files that represent five different character encodings and six different languages. The character encodings represented in this dataset are ISO-8859-1 (also known as Latin 1),

ASCII, Windows 1251, UTF-16 that has been successfully converted into the UTF-8 and BIG-5. More information on the files is available in the file_guide.csv file.

Each text file contains a header and footer. The body text is delimited by this text:

*** START OF THE PROJECT GUTENBERG EBOOK [TITLE OF BOOK GOES HERE] ***

*** END OF THE PROJECT GUTENBERG EBOOK [TITLE OF BOOK GOES HERE]***

Acknowledgements:

The texts in this dataset were prepared by Project Gutenberg volunteers. These texts are in the public domain.

Inspiration:

Can you build an tool to automatically detect when a file in the wrong encoding is read in?

You can use this dataset to explore what happens when you read in text using different encoders.
Z
Cherenkov Telescope Data for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bunse, Mirko (2023). Cherenkov Telescope Data for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7090094
Explore at:
Dataset updated
Jul 23, 2023
Dataset provided by
Bunse, Mirko
Senz, Martin
Sebastiani, Fabrizio
Moreo, Alejandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This labeled data set is targeted at ordinal quantification. The goal of quantification is not to predict the class label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract the relevant features and labels from the public data set of the FACT Cherenkov telescope. These features are precisely the ones that domain experts from astro-particle physics employ in their analyses. The labels stem from a binning of a continuous energy label, which is common practice in these analyses.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, app-oq_tst_indices.csv, real_val_indices.csv, and real_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ(5%), is a variant thereof, where only the smoothest 5% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed. The labels of the FACT data lie on an ordinal scale and, hence, pose such an ordinal quantification task. The third protocol considers "real" distributions of labels. These distributions would be expected by observing the Crab Nebula through the FACT telescope.

Usage

You can extract the data fact.csv through the provided script extract-fact.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended) or

curl --fail -o fact.hdf5 https://factdata.app.tu-dortmund.de/dl2/FACT-Tools/v1.1.2/gamma_simulations_facttools_dl2.hdf5 julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-fact.jl

Outcome: The first row in the resulting fact.csv file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

Original data repository: https://factdata.app.tu-dortmund.de/

Reference analysis by astro-particle physicists: https://github.com/fact-project/open_crab_sample_analysis
u
Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East...
data.nceas.ucsb.edu
data.ess-dive.lbl.gov
+4more
Updated Aug 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wenming Dong; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg; Kenneth Williams (2023). Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East River Watershed, Colorado (2015-2021) [Dataset]. http://doi.org/10.15485/1660459
Explore at:
Unique identifier
https://doi.org/10.15485/1660459
Dataset updated
Aug 7, 2023
Dataset provided by
ESS-DIVE
Authors
Wenming Dong; Curtis Beutler; Wendy Brown; Alexander Newman; Dylan O'Ryan; Roelof Versteeg; Kenneth Williams
Time period covered
Sep 1, 2015 - Dec 31, 2021
Area covered

Description
This data package contains mean values for dissolved organic carbon (DOC) and dissolved inorganic carbon (DIC) for water samples taken from the East River Watershed in Colorado. The East River is part of the Watershed Function Scientific Focus Area (WFSFA) located in the Upper Colorado River Basin, United States. DOC and DIC concentrations in water samples were determined using a TOC-VCPH analyzer (Shimadzu Corporation, Japan). DOC was analyzed as non-purgeable organic carbon (NPOC) by purging HCl acidified samples with carbon-free air to remove DIC prior to measurement. After the acidified sample has been sparged, it is injected into a combustion tube filled with oxidation catalyst heated to 680 degrees C. The DOC in samples is combusted to CO2 and measured by a non-dispersive infrared (NDIR) detector. The peak area of the analog signal produced by the NDIR detector is proportional to the DOC concentration of the sample. DIC was determined by acidifying the samples with HCl first, and then purge with carbon-free air to release CO2 for analysis by NDIR detector. All files are labeled by location and variable, and data reported are the mean values upon minimum three replicate measurements with a relative standard deviation < 3%. All samples were analyzed under a rigorous quality assurance and quality control (QA/QC) process as detailed in the methods. This data package contains (1) a zip file (dic_npoc_data_2014-2021.zip) containing a total of 261 files: 260 data files of DIC and NPOC data from across the Lawrence Berkeley National Laboratory (LBNL) Watershed Function Scientific Focus Area (SFA) which is reported in .csv files per location and a locations.csv (1 file) with latitude and longitude for each location; (2) a file-level metadata (flmd.csv) file that lists each file contained in the dataset with associated metadata; and (3) a data dictionary (dd.csv) file that contains terms/column_headers used throughout the files along with a definition, units, and data type. There are a total of 99 locations containing isotope data. Update on 10/07/2020: Updated the data files to remove times from the timestamps, so that only dates remain. The data values have not changed. Update on 4/11/2021: Added Determination of Method Detection Limits (MDLs) for DIC, NPOC and TDN Analyses document, which can be accessed as a PDF or with Microsoft Word.Update on 6/10/2022: versioned updates to this dataset was made along with these changes: (1) updated dissolved inorganic carbon and dissolved organic carbon data for all locations up to 2021-12-31, (2) removal of units from column headers in datafiles, (3) added row underneath headers to contain units of variables, (4) restructure of units to comply with CSV reporting format requirements, (5) added -9999 for empty numerical cells, and (6) the addition of the file-level metadata (flmd.csv) and data dictionary (dd.csv) were added to comply with the File-Level Metadata Reporting Format. Update on 2022-09-09: Updates were made to reporting format specific files (file-level metadata and data dictionary) to correct swapped file names, add additional details on metadata descriptions on both files, add a header_row column to enable parsing, and add version number and date to file names (v2_20220909_flmd.csv and v2_20220909_dd.csv).
UWB Positioning and Tracking Data Set
data.europa.eu
zenodo.org
unknown
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). UWB Positioning and Tracking Data Set [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-8280736?locale=nl
Explore at:
unknownAvailable download formats
Dataset updated
Jul 3, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
UWB Positioning and Tracking Data Set UWB positioning data set contains measurements from four different indoor environments. The data set contains measurements that can be used for range-based positioning evaluation in different indoor environments. # Measurement system The measurements were made using 9 DW1000 UWB transceivers (DWM1000 modules) connected to the networked RaspberryPi computer using in-house radio board SNPN_UWB. 8 nodes were used as positioning anchor nodes with fixed locations in individual indoor environment and one node was used as a mobile positioning tag. Each UWB node is designed arround the RaspberryPi computer and are wirelessly connected to the measurement controller (e.g. laptop) using Wi-Fi and MQTT communication technologies. All tag positions were generated beforehand to as closelly resemble the human walking path as possible. All walking path points are equally spaced to represent the equidistand samples of a walking path in a time-domain. The sampled walking path (measurement TAG positions) are included in a downloadable data set file under downloads section. # Folder structure Folder structure is represented below this text. Folder contains four subfolders named by the indoor environments measured during the measurement campaign and a folder raw_data where raw measurement data is saved. Each environment folder has a anchors.csv file with anchor names and locations, .json file data.json with measurements, file walking_path.csv file with tag positions and subfolder floorplan with floorplan.dxf (AutoCAD format), floorplan.png and floorplan_track.jpg. Subfolder raw_data contains raw data in subfolders named by the four indor environments where the measurements were taken. Each location subfolder contains a subfolder data where data from each tag position from the walking_path.csv is collected in a separate folder. There is exactly the same number of folders in data folder as is the number of measurement points in the walking_path.csv. Each measurement subfolder contains 48 .csv files named by communication channel and anchor used for those measurements. For example: ch1_A1.csv contains all measurements at selected tag location with anchor A1 on UWB channel ch1. The location folder contains also anchors.csv and walking_path.csv files which are identical to the files mentioned previously. The last folder in the data set is the technical_validation folder, where results of technical validation of the data set are collected. They are separated into 8 subfolders: - cir_min_max_mean - los_nlos - positioning_wls - range - range_error - range_error_A6 - range_error_histograms - rss The organization of the data set is the following: data_set + location0 - anchors.csv - data.json - walking_path.csv + floorplan - floorplan.dxf - floorplan.png - floorplan_track.jpg - walking_path.csv + location1 - ... + location2 - ... + location3 - ... + raw_data + location0 + data + 1.07_9.37_1.2 - ch1_A1.csv - ch7_A8.csv - ... + 1.37_9.34_1.2 - ... + ... + location1 + ... + location2 + ... + location3 + ... + technical validation + cir_min_max_mean + positioning_wls + range + range_error + range_error_histograms + rss - LICENSE - README # Data format Raw measurements are saved in .csv files. Each file starts with a header, where first line represents the version of the file and the second line represents the data column names. The column names have a missing column name. Actual column names included in the .csv files are: TAG_ID ANCHOR_ID X_TAG Y_TAG Z_TAG X_ANCHOR Y_ANCHOR Z_ANCHOR NLOS RANGE FP_INDEX RSS RSS_FP FP_POINT1 FP_POINT2 FP_POINT3 STDEV_NOISE CIR_POWER MAX_NOISE RXPACC CHANNEL_NUMBER FRAME_LENGTH PREAMBLE_LENGTH BITRATE PRFR PREAMBLE_CODE CIR (starts with this column; all columns until the end of the line represent the channel impulse response) # Availability of CODE Code for data analysis and preprocessing of all data available in this data set is published on GitHub: https://github.com/KlemenBr/uwb_positioning.git The code is licensed under the Apache License 2.0. # Authors and License Author of data set in this repository is Klemen Bregar, klemen.bregar@ijs.si. This work is licensed under a Creative Commons Attribution 4.0 International License. # Funding The research leading to the data collection has been partially funded from the European Horizon 2020 Programme project eWINE under grant agreement No. 688116, the Slovenian Research Agency under Grant numbers P2-0016, J2-2507 and bilateral project with Grant number BI-ME/21-22-007.
Z
KGCW 2024 Challenge @ ESWC 2024
data.niaid.nih.gov
investigacion.usc.gal
+1more
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dimou, Anastasia (2024). KGCW 2024 Challenge @ ESWC 2024 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10721874
Explore at:
Dataset updated
Jun 11, 2024
Dataset provided by
Iglesias, Ana
Van Assche, Dylan
Dimou, Anastasia
Serles, Umutcan
Chaves-Fraga, David
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Knowledge Graph Construction Workshop 2024: challenge

Knowledge graph construction of heterogeneous data has seen a lot of uptakein the last decade from compliance to performance optimizations with respectto execution time. Besides execution time as a metric for comparing knowledgegraph construction, other metrics e.g. CPU or memory usage are not considered.This challenge aims at benchmarking systems to find which RDF graphconstruction system optimizes for metrics e.g. execution time, CPU,memory usage, or a combination of these metrics.

Task description

The task is to reduce and report the execution time and computing resources(CPU and memory usage) for the parameters listed in this challenge, comparedto the state-of-the-art of the existing tools and the baseline results providedby this challenge. This challenge is not limited to execution times to createthe fastest pipeline, but also computing resources to achieve the most efficientpipeline.

We provide a tool which can execute such pipelines end-to-end. This tool alsocollects and aggregates the metrics such as execution time, CPU and memoryusage, necessary for this challenge as CSV files. Moreover, the informationabout the hardware used during the execution of the pipeline is available aswell to allow fairly comparing different pipelines. Your pipeline should consistof Docker images which can be executed on Linux to run the tool. The tool isalready tested with existing systems, relational databases e.g. MySQL andPostgreSQL, and triplestores e.g. Apache Jena Fuseki and OpenLink Virtuosowhich can be combined in any configuration. It is strongly encouraged to usethis tool for participating in this challenge. If you prefer to use a differenttool or our tool imposes technical requirements you cannot solve, please contactus directly.

Track 1: Conformance

The set of new specification for the RDF Mapping Language (RML) established by the W3C Community Group on Knowledge Graph Construction provide a set of test-cases for each module:

RML-Core

RML-IO

RML-CC

RML-FNML

RML-Star

These test-cases are evaluated in this Track of the Challenge to determine their feasibility, correctness, etc. by applying them in implementations. This Track is in Beta status because these new specifications have not seen any implementation yet, thus it may contain bugs and issues. If you find problems with the mappings, output, etc. please report them to the corresponding repository of each module.

Note: validating the output of the RML Star module automatically through the provided tooling is currently not possible, see https://github.com/kg-construct/challenge-tool/issues/1.

Through this Track we aim to spark development of implementations for the new specifications and improve the test-cases. Let us know your problems with the test-cases and we will try to find a solution.

Track 2: Performance

Part 1: Knowledge Graph Construction Parameters

These parameters are evaluated using synthetic generated data to have moreinsights of their influence on the pipeline.

Data

Number of data records: scaling the data size vertically by the number of records with a fixed number of data properties (10K, 100K, 1M, 10M records).

Number of data properties: scaling the data size horizontally by the number of data properties with a fixed number of data records (1, 10, 20, 30 columns).

Number of duplicate values: scaling the number of duplicate values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of empty values: scaling the number of empty values in the dataset (0%, 25%, 50%, 75%, 100%).

Number of input files: scaling the number of datasets (1, 5, 10, 15).

Mappings

Number of subjects: scaling the number of subjects with a fixed number of predicates and objects (1, 10, 20, 30 TMs).

Number of predicates and objects: scaling the number of predicates and objects with a fixed number of subjects (1, 10, 20, 30 POMs).

Number of and type of joins: scaling the number of joins and type of joins (1-1, N-1, 1-N, N-M)

Part 2: GTFS-Madrid-Bench

The GTFS-Madrid-Bench provides insights in the pipeline with real data from thepublic transport domain in Madrid.

Scaling

GTFS-1 SQL

GTFS-10 SQL

GTFS-100 SQL

GTFS-1000 SQL

Heterogeneity

GTFS-100 XML + JSON

GTFS-100 CSV + XML

GTFS-100 CSV + JSON

GTFS-100 SQL + XML + JSON + CSV

Example pipeline

The ground truth dataset and baseline results are generated in different stepsfor each parameter:

The provided CSV files and SQL schema are loaded into a MySQL relational database.

Mappings are executed by accessing the MySQL relational database to construct a knowledge graph in N-Triples as RDF format

The pipeline is executed 5 times from which the median execution time of eachstep is calculated and reported. Each step with the median execution time isthen reported in the baseline results with all its measured metrics.Knowledge graph construction timeout is set to 24 hours. The execution is performed with the following tool: https://github.com/kg-construct/challenge-tool,you can adapt the execution plans for this example pipeline to your own needs.

Each parameter has its own directory in the ground truth dataset with thefollowing files:

Input dataset as CSV.

Mapping file as RML.

Execution plan for the pipeline in metadata.json.

Datasets

Knowledge Graph Construction Parameters

The dataset consists of:

Input dataset as CSV for each parameter.

Mapping file as RML for each parameter.

Baseline results for each parameter with the example pipeline.

Ground truth dataset for each parameter generated with the example pipeline.

Format

All input datasets are provided as CSV, depending on the parameter that is beingevaluated, the number of rows and columns may differ. The first row is alwaysthe header of the CSV.

GTFS-Madrid-Bench

The dataset consists of:

Input dataset as CSV with SQL schema for the scaling and a combination of XML,

CSV, and JSON is provided for the heterogeneity.

Mapping file as RML for both scaling and heterogeneity.

SPARQL queries to retrieve the results.

Baseline results with the example pipeline.

Ground truth dataset generated with the example pipeline.

Format

CSV datasets always have a header as their first row.JSON and XML datasets have their own schema.

Evaluation criteria

Submissions must evaluate the following metrics:

Execution time of all the steps in the pipeline. The execution time of a step is the difference between the begin and end time of a step.

CPU time as the time spent in the CPU for all steps of the pipeline. The CPU time of a step is the difference between the begin and end CPU time of a step.

Minimal and maximal memory consumption for each step of the pipeline. The minimal and maximal memory consumption of a step is the minimum and maximum calculated of the memory consumption during the execution of a step.

Expected output

Duplicate values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500020 triples

50 percent 1000020 triples

75 percent 500020 triples

100 percent 20 triples

Empty values

Scale Number of Triples

0 percent 2000000 triples

25 percent 1500000 triples

50 percent 1000000 triples

75 percent 500000 triples

100 percent 0 triples

Mappings

Scale Number of Triples

1TM + 15POM 1500000 triples

3TM + 5POM 1500000 triples

5TM + 3POM 1500000 triples

15TM + 1POM 1500000 triples

Properties

Scale Number of Triples

1M rows 1 column 1000000 triples

1M rows 10 columns 10000000 triples

1M rows 20 columns 20000000 triples

1M rows 30 columns 30000000 triples

Records

Scale Number of Triples

10K rows 20 columns 200000 triples

100K rows 20 columns 2000000 triples

1M rows 20 columns 20000000 triples

10M rows 20 columns 200000000 triples

Joins

1-1 joins

Scale Number of Triples

0 percent 0 triples

25 percent 125000 triples

50 percent 250000 triples

75 percent 375000 triples

100 percent 500000 triples

1-N joins

Scale Number of Triples

1-10 0 percent 0 triples

1-10 25 percent 125000 triples

1-10 50 percent 250000 triples

1-10 75 percent 375000 triples

1-10 100 percent 500000 triples

1-5 50 percent 250000 triples

1-10 50 percent 250000 triples

1-15 50 percent 250005 triples

1-20 50 percent 250000 triples

1-N joins

Scale Number of Triples

10-1 0 percent 0 triples

10-1 25 percent 125000 triples

10-1 50 percent 250000 triples

10-1 75 percent 375000 triples

10-1 100 percent 500000 triples

5-1 50 percent 250000 triples

10-1 50 percent 250000 triples

15-1 50 percent 250005 triples

20-1 50 percent 250000 triples

N-M joins

Scale Number of Triples

5-5 50 percent 1374085 triples

10-5 50 percent 1375185 triples

5-10 50 percent 1375290 triples

5-5 25 percent 718785 triples

5-5 50 percent 1374085 triples

5-5 75 percent 1968100 triples

5-5 100 percent 2500000 triples

5-10 25 percent 719310 triples

5-10 50 percent 1375290 triples

5-10 75 percent 1967660 triples

5-10 100 percent 2500000 triples

10-5 25 percent 719370 triples

10-5 50 percent 1375185 triples

10-5 75 percent 1968235 triples

10-5 100 percent 2500000 triples

GTFS Madrid Bench

Generated Knowledge Graph

Scale Number of Triples

1 395953 triples

10 3959530 triples

100 39595300 triples

1000 395953000 triples

Queries

Query Scale 1 Scale 10 Scale 100 Scale 1000

Q1 58540 results 585400 results No results available No results available

Q2 636 results 11998 results
125565 results 1261368 results

Q3 421 results 4207 results 42067 results 420667 results

Q4 13 results 130 results 1300 results 13000 results

Q5 35 results 350 results 3500 results 35000 results

Q6 1 result 1 result 1 result 1 result

Q7 68 results 67 results 67 results 53 results

Q8 35460 results 354600 results No results available No results available

Q9 130 results 1300
m
Medical radar signal dataset
data.mendeley.com
Updated Nov 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keisuke Edanami (2021). Medical radar signal dataset [Dataset]. http://doi.org/10.17632/6rp6wrd2pr.2
Explore at:
Unique identifier
https://doi.org/10.17632/6rp6wrd2pr.2
Dataset updated
Nov 29, 2021
Authors
Keisuke Edanami
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data information Subjects : 9 subjects (5 males, 4 females, mean age 24±5 years) Data format : raw signals (.lvm & .csv files.) File structure : The first 22 lines comprised a header. The variable names were displayed on line 23 in each column. After line 24, the signals were displayed. The information stored in each column was as follows: column 1: time; column 2: 24 GHz radar I-channel; column 3: 24 GHz radar Q-channel; column 4: 10 GHz radar I-channel; column 5: respiratory band signal; and column 6: ECG signal. Sampling rate : 1000 Hz Measurement time : 10 min

Also refer to the MATLAB code that pre-processes the signals and calculates the respiration rate and heart rates.

Facebook

Twitter

Click to copy link

Link copied

Cite

Moreo, Alejandro (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301

UCI and OpenML Data Sets for Ordinal Quantification

Explore at:

Dataset updated

Jul 25, 2023

Dataset provided by

Bunse, Mirko
Senz, Martin
Sebastiani, Fabrizio
Moreo, Alejandro

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

UCI and OpenML Data Sets for Ordinal Quantification

Data from: Dataset for Anomaly Detection in a Production Wireless Mesh...

Network traffic and code for machine learning classification

Windows Malware Detection Dataset

Annotated 12 lead ECG dataset

Sample data file with TOAR air quality data for machine learning excercise -...

Data and scripts associated with a manuscript investigating impacts of solid...

IHerbSpec Metadata Spreadsheet v1

Next-gen sequencing and metadata analyses of Great Lakes fungal data

CTD data from casts before and after larval vertical distribution sampling...

Index to Marine and Lacustrine Geological Samples (IMLGS)

Mineralogy of floodplain sediments from Meanders C, O, and Z in the East...

UMAHand: Hand Activity Dataset (Universidad de Málaga)

Aerobic respiration controls on shale weathering, Geochimica et Cosmochimica...

Character Encoding Examples

Context:

Content

Acknowledgements:

Inspiration:

Cherenkov Telescope Data for Ordinal Quantification

Dissolved Inorganic Carbon and Dissolved Organic Carbon Data for the East...

UWB Positioning and Tracking Data Set

KGCW 2024 Challenge @ ESWC 2024

Medical radar signal dataset

UCI and OpenML Data Sets for Ordinal Quantification