36 datasets found

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Jul 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.brv15dvj1
Dataset updated
Jul 8, 2024
Dataset provided by
Stanford University School of Medicine
Authors
Yuqi Tan; Tim Kempchen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.
Pre-Processed Power Grid Frequency Time Series
data.subak.org
csv
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2023). Pre-Processed Power Grid Frequency Time Series [Dataset]. https://data.subak.org/dataset/pre-processed-power-grid-frequency-time-series
Explore at:
csvAvailable download formats
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Description
Overview

This repository contains ready-to-use frequency time series as well as the corresponding pre-processing scripts in python. The data covers three synchronous areas of the European power grid:

Continental Europe

Great Britain

Nordic

This work is part of the paper "Predictability of Power Grid Frequency"[1]. Please cite this paper, when using the data and the code. For a detailed documentation of the pre-processing procedure we refer to the supplementary material of the paper.

Data sources

We downloaded the frequency recordings from publically available repositories of three different Transmission System Operators (TSOs).

Continental Europe [2]: We downloaded the data from the German TSO TransnetBW GmbH, which retains the Copyright on the data, but allows to re-publish it upon request [3].

Great Britain [4]: The download was supported by National Grid ESO Open Data, which belongs to the British TSO National Grid. They publish the frequency recordings under the NGESO Open License [5].

Nordic [6]: We obtained the data from the Finish TSO Fingrid, which provides the data under the open license CC-BY 4.0 [7].

Content of the repository

A) Scripts

In the "Download_scripts" folder you will find three scripts to automatically download frequency data from the TSO's websites.

In "convert_data_format.py" we save the data with corrected timestamp formats. Missing data is marked as NaN (processing step (1) in the supplementary material of [1]).

In "clean_corrupted_data.py" we load the converted data and identify corrupted recordings. We mark them as NaN and clean some of the resulting data holes (processing step (2) in the supplementary material of [1]).

The python scripts run with Python 3.7 and with the packages found in "requirements.txt".

B) Yearly converted and cleansed data

The folders "

File type: The files are zipped csv-files, where each file comprises one year.

Data format: The files contain two columns. The second column contains the frequency values in Hz. The first one represents the time stamps in the format Year-Month-Day Hour-Minute-Second, which is given as naive local time. The local time refers to the following time zones and includes Daylight Saving Times (python time zone in brackets):

TransnetBW: Continental European Time (CE)

Nationalgrid: Great Britain (GB)

Fingrid: Finland (Europe/Helsinki)

NaN representation: We mark corrupted and missing data as "NaN" in the csv-files.

Use cases

We point out that this repository can be used in two different was:

Use pre-processed data: You can directly use the converted or the cleansed data. Note however, that both data sets include segments of NaN-values due to missing and corrupted recordings. Only a very small part of the NaN-values were eliminated in the cleansed data to not manipulate the data too much.

Produce your own cleansed data: Depending on your application, you might want to cleanse the data in a custom way. You can easily add your custom cleansing procedure in "clean_corrupted_data.py" and then produce cleansed data from the raw data in "

License

This work is licensed under multiple licenses, which are located in the "LICENSES" folder.

We release the code in the folder "Scripts" under the MIT license .

The pre-processed data in the subfolders "**/Fingrid" and "**/Nationalgrid" are licensed under CC-BY 4.0.

TransnetBW originally did not publish their data under an open license. We have explicitly received the permission to publish the pre-processed version from TransnetBW. However, we cannot publish our pre-processed version under an open license due to the missing license of the original TransnetBW data.

Changelog

Version 2:

Add time zone information to description

Include new frequency data

Update references

Change folder structure to yearly folders

Version 3:

Correct TransnetBW files for missing data in May 2016
DATS 6401 - Final Project - Yon ho Cheong.zip
figshare.com
zip
Updated Dec 15, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7471007.v1
Dataset updated
Dec 15, 2018
Dataset provided by
figshare
Authors
Yon ho Cheong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
f
Data from: S1 Data -
plos.figshare.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang (2023). S1 Data - [Dataset]. http://doi.org/10.1371/journal.pone.0292466.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0292466.s001
Dataset updated
Oct 11, 2023
Dataset provided by
PLOS ONE
Authors
Yancong Zhou; Wenyue Chen; Xiaochen Sun; Dandan Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analyzing customers’ characteristics and giving the early warning of customer churn based on machine learning algorithms, can help enterprises provide targeted marketing strategies and personalized services, and save a lot of operating costs. Data cleaning, oversampling, data standardization and other preprocessing operations are done on 900,000 telecom customer personal characteristics and historical behavior data set based on Python language. Appropriate model parameters were selected to build BPNN (Back Propagation Neural Network). Random Forest (RF) and Adaboost, the two classic ensemble learning models were introduced, and the Adaboost dual-ensemble learning model with RF as the base learner was put forward. The four models and the other four classical machine learning models-decision tree, naive Bayes, K-Nearest Neighbor (KNN), Support Vector Machine (SVM) were utilized respectively to analyze the customer churn data. The results show that the four models have better performance in terms of recall rate, precision rate, F1 score and other indicators, and the RF-Adaboost dual-ensemble model has the best performance. Among them, the recall rates of BPNN, RF, Adaboost and RF-Adaboost dual-ensemble model on positive samples are respectively 79%, 90%, 89%,93%, the precision rates are 97%, 99%, 98%, 99%, and the F1 scores are 87%, 95%, 94%, 96%. The RF-Adaboost dual-ensemble model has the best performance, and the three indicators are 10%, 1%, and 6% higher than the reference. The prediction results of customer churn provide strong data support for telecom companies to adopt appropriate retention strategies for pre-churn customers and reduce customer churn.
Z
Adult dataset preprocessed
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adult dataset preprocessed [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12533513
Explore at:
Dataset updated
Jul 1, 2024
Dataset provided by
Schuster, Verena
Pustozerova, Anastasia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The files "adult_train.csv" and "adult_test.csv" contain preprocessed versions of the Adult dataset from the USI repository.

The file "adult_preprocessing.ipynb" contains a python notebook file with all the preprocessing steps used to generate "adult_train.csv" and "adult_test.csv" from the original Adult dataset.

The preprocessing steps include:

One-hot-encoding of categorical values

Imputation of missing values using knn-imputer with k=1

Standard scaling of ordinal attributes

Note: we assume the scenario when the test set is available before training (every attribute besides the target - "income"), therefore we combine train and test sets before the preprocessing.
Storage and Transit Time Data and Code
zenodo.org
zip
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Felton; Andrew Felton (2024). Storage and Transit Time Data and Code [Dataset]. http://doi.org/10.5281/zenodo.14171251
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14171251
Dataset updated
Nov 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Felton; Andrew Felton
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Author: Andrew J. Felton
Date: 11/15/2024

This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:

"Global estimates of the storage and transit time of water through vegetation"

Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.

#Data information:

The data folder contains key data sets used for analysis. In particular:

"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.

#Code information

Python scripts can be found in the "supporting_code" folder.

Each R script in this project has a role:

"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).

"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.

"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.

"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.

"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.

"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
u
Dataset for 'Stream Temperature Predictions for River Basin Management in...
data.nceas.ucsb.edu
search.dataone.org
+1more
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan (2023). Dataset for 'Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning', Water 2022 [Dataset]. http://doi.org/10.15485/1854257
Explore at:
Unique identifier
https://doi.org/10.15485/1854257
Dataset updated
Aug 8, 2023
Dataset provided by
ESS-DIVE
Authors
Helen Weierbach; Aranildo R. Lima; Jared D. Willard; Valerie C. Hendrix; Danielle S. Christianson; Misha Lubich; Charuleka Varadharajan
Time period covered
Jan 1, 1980 - Jun 30, 2021
Area covered

Description
This data package presents forcing data, model code, and model output for classical machine learning models that predict monthly stream water temperature as presented in the manuscript ‘Stream Temperature Predictions for River Basin Management in the Pacific Northwest and Mid-Atlantic Regions Using Machine Learning’, Water (Weierbach et al., 2022). Specifically, for input forcing datasets we include two files each generated using the BASIN-3D data integration tool (Varadharajan et al., 2022) for stations in the Pacific Northwest and Mid Atlantic Hydrologic regions. Model code (written in python with the use of jupyter notebooks) includes codes for data preprocessing, training Multiple Linear Regression, Support Vector Regression, and Extreme Gradient Boosted Tree models, and additional notebooks for analysis of model output. We include specific model output files which represent modeling configurations presented in the manuscript also presented in an hdf5 format. Together, these data make up the workflow for predictions across three scenarios (single station, regional, and predictions in unmonitored basins) presented in the manuscript and allow for reproducibility of modeling procedures.
i
Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...
ieee-dataport.org
dataverse.harvard.edu
Updated Nov 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Narla (2024). Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven Approach [Dataset]. http://doi.org/10.21227/8cbk-bc40
Explore at:
Unique identifier
https://doi.org/10.21227/8cbk-bc40
Dataset updated
Nov 19, 2024
Dataset provided by
IEEE Dataport
Authors
Gautam Narla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This study investigates the application of machine learning (ML) models in stock market forecasting, with a focus on their integration using PineScript, a domain-specific language for algorithmic trading. Leveraging diverse datasets, including historical stock prices and market sentiment data, we developed and tested various ML models such as neural networks, decision trees, and linear regression. Rigorous backtesting over multiple timeframes and market conditions allowed us to evaluate their predictive accuracy and financial performance. The neural network model demonstrated the highest accuracy, achieving a 75% success rate, significantly outperforming traditional models. Additionally, trading strategies derived from these ML models yielded a return on investment (ROI) of up to 12%, compared to an 8% benchmark index ROI. These findings underscore the transformative potential of ML in refining trading strategies, providing critical insights for financial analysts, investors, and developers. The study draws on insights from 15 peer-reviewed articles, financial datasets, and industry reports, establishing a robust foundation for future exploration of ML-driven financial forecasting. Tools and Technologies Used †PineScript PineScript, a scripting language integrated within the TradingView platform, was the primary tool used to develop and implement the machine learning models. Its robust features allowed for custom indicator creation, strategy backtesting, and real-time market data analysis. †Python Python was utilized for data preprocessing, model training, and performance evaluation.
D
Results for pseudo-3D Stokes simulations with a geometry-informed drag term...
darus.uni-stuttgart.de
Updated Dec 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Krach; Felix Weinhardt; Mingfeng Wang; Martin Schneider; Holger Class; Holger Steeb (2024). Results for pseudo-3D Stokes simulations with a geometry-informed drag term formulation for porous media with varying apertures [Dataset]. http://doi.org/10.18419/DARUS-4347
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4347
Dataset updated
Dec 20, 2024
Dataset provided by
DaRUS
Authors
David Krach; Felix Weinhardt; Mingfeng Wang; Martin Schneider; Holger Class; Holger Steeb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
DFG
Description
Content: This data set includes all the necessary input data to fully replicate the benchmarks and applications in Krach et al.(2024). In addition, the results for all solvers are provided as raw simulation output as well as in condense form. Complete datasets: 01_CYLINDER, 02_SINGLE_PRECIPITATE, 03_SEGMENT1, 04_SEGMENT2, 05_SEGMENT3 Each data set consists of all relevant input data for poremaps and DuMux as well as the results of both solvers. A README contains relevant information such as voxelsize, geometry size and boundary conditions. Pseudo-3D input data and results comprise different folders, each containing a different drag formulation. Input data for DuMux may be generated from the 3D input data using the localdrag python module. See the README or the source code for detailed information. For further details on the input files, please refer to the software documentation of the corresponding DuMux module Pseudo-3D-Stokes Module and poremaps. For input, evaluation or modification of the simulation results, I/O routines, interpolation algorithms and analysis tools are provided in the localdrag python module. Condensed results: 06_EVALUATION_PERMEABILITIES All permeabilities and relative errors for all domains can be found in condensed form (.csv files) in 06_EVALUATION_PERMEABILITIES. In order to plot this with the scripts provided, the functionalities provided in the localdrag python module are required. Related datasets and repositories: POREMAPS: git repository, DaRUS dataset DuMux: Website, git repository Image dataset of micromodel with precipitation: DaRUS dataset, research paper localdrag python module for preprocessing and data handling: git repository, DaRUS dataset
f
DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip
frontiersin.figshare.com
zip
Updated Oct 25, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuta Nagano; Benjamin Chain (2023). DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip [Dataset]. http://doi.org/10.3389/fimmu.2023.1276106.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fimmu.2023.1276106.s002
Dataset updated
Oct 25, 2023
Dataset provided by
Frontiers
Authors
Yuta Nagano; Benjamin Chain
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
T cell receptors (TR) underpin the diversity and specificity of T cell activity. As such, TR repertoire data is valuable both as an adaptive immune biomarker, and as a way to identify candidate therapeutic TR. Analysis of TR repertoires relies heavily on computational analysis, and therefore it is of vital importance that the data is standardized and computer-readable. However in practice, the usage of different abbreviations and non-standard nomenclature in different datasets makes this data pre-processing non-trivial. tidytcells is a lightweight, platform-independent Python package that provides easy-to-use standardization tools specifically designed for TR nomenclature. The software is open-sourced under the MIT license and is available to install from the Python Package Index (PyPI). At the time of publishing, tidytcells is on version 2.0.0.
VegeNet - Image datasets and Codes
zenodo.org
data.niaid.nih.gov
zip
Updated Oct 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jo Yen Tan; Jo Yen Tan (2022). VegeNet - Image datasets and Codes [Dataset]. http://doi.org/10.5281/zenodo.7254508
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7254508
Dataset updated
Oct 27, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Jo Yen Tan; Jo Yen Tan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).

Image datasets:

vege_original : Images of vegetables captured manually in data acquisition stage

vege_cropped_renamed : Images in (1) cropped to remove background areas and image labels renamed

non-vege images : Images of non-vegetable foods for CNN network to recognize other-than-vegetable foods

food_image_dataset : Complete set of vege (2) and non-vege (3) images for architecture building.

food_image_dataset_split : Image dataset (4) split into train and test sets

process : Images created when cropping (pre-processing step) to create dataset (2).
Curated list of HAR datasets
zenodo.org
data.niaid.nih.gov
bin, text/x-python
Updated May 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matej Králik; Matej Králik (2020). Curated list of HAR datasets [Dataset]. http://doi.org/10.5281/zenodo.3831958
Explore at:
bin, text/x-pythonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3831958
Dataset updated
May 18, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matej Králik; Matej Králik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A curated list of preprocessed & ready to use under a minute Human Activity Recognition datasets.

All the datasets are preprocessed in HDF5 format, created using the h5py python library. Scripts used for data preprocessing are provided as well (Load.ipynb and load_jordao.py)

Each HDF5 file contains at least the keys:

x a single array of size [sample count, temporal length, sensor channel count], contains the actual sensor data. Metadata contains the names of individual sensor channel count. All samples are zero-padded for constant length in the file, original lengths before padding available under the meta keys.

y a single array of size [sample count] with integer values for target classes (zero-based). Metadata contains the names of the target classes.

meta contain various metadata, depends on the dataset (original length before padding, subject no., trial no., etc.)

Usage example

import h5py with h5py.File(f'data/waveglove_multi.h5', 'r') as h5f: x = h5f['x'] y = h5f['y']['class'] print(f'WaveGlove-multi: {x.shape[0]} samples') print(f'Sensor channels: {h5f["x"].attrs["channels"]}') print(f'Target classes: {h5f["y"].attrs["labels"]}') first_sample = x[0] # Output: # WaveGlove-multi: 10044 samples # Sensor channels: ['acc1-x' 'acc1-y' 'acc1-z' 'gyro1-x' 'gyro1-y' 'gyro1-z' 'acc2-x' # 'acc2-y' 'acc2-z' 'gyro2-x' 'gyro2-y' 'gyro2-z' 'acc3-x' 'acc3-y' # 'acc3-z' 'gyro3-x' 'gyro3-y' 'gyro3-z' 'acc4-x' 'acc4-y' 'acc4-z' # 'gyro4-x' 'gyro4-y' 'gyro4-z' 'acc5-x' 'acc5-y' 'acc5-z' 'gyro5-x' # 'gyro5-y' 'gyro5-z'] # Target classes: ['null' 'hand swipe left' 'hand swipe right' 'pinch in' 'pinch out' # 'thumb double tap' 'grab' 'ungrab' 'page flip' 'peace' 'metal']

Current list of datasets:

WaveGlove-single (waveglove_single.h5)

WaveGlove-multi (waveglove_multi.h5)

uWave (uwave.h5)

OPPORTUNITY (opportunity.h5)

PAMAP2 (pamap2.h5)

SKODA (skoda.h5)

MHEALTH (non overlapping windows) (mhealth.h5)

Six datasets with all four predefined train/test folds
as preprocessed by Jordao et al. originally in WearableSensorData
(FNOW, LOSO, LOTO and SNOW prefixed .h5 files)
P
LSMI Dataset
paperswithcode.com
Updated Nov 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongyoung Kim; Jinwoo Kim; Seonghyeon Nam; Dongwoo Lee; Yeonkyung Lee; Nahyup Kang; Hyong-Euk Lee; ByungIn Yoo; Jae-Joon Han; Seon Joo Kim (2022). LSMI Dataset [Dataset]. https://paperswithcode.com/dataset/lsmi
Explore at:
Dataset updated
Nov 15, 2022
Authors
Dongyoung Kim; Jinwoo Kim; Seonghyeon Nam; Dongwoo Lee; Yeonkyung Lee; Nahyup Kang; Hyong-Euk Lee; ByungIn Yoo; Jae-Joon Han; Seon Joo Kim
Description
Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination (ICCV 2021)

Change Log LSMI Dataset Version : 1.1

1.0 : LSMI dataset released. (Aug 05, 2021)

1.1 : Add option for saving sub-pair images for 3-illuminant scene (ex. _1,_12,_13) & saving subtracted image (ex. _2,_3,_23) (Feb 20, 2022)

About [Paper] [Project site] [Download Dataset] [Video]

This is an official repository of "Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination", which is accepted as a poster in ICCV 2021.

This repository provides 1. Preprocessing code of "Large Scale Multi Illuminant (LSMI) Dataset" 2. Code of Pixel-level illumination inference U-Net 3. Pre-trained model parameter for testing U-Net

If you use our code or dataset, please cite our paper: @inproceedings{kim2021large, title={Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm Under Mixed Illumination}, author={Kim, Dongyoung and Kim, Jinwoo and Nam, Seonghyeon and Lee, Dongwoo and Lee, Yeonkyung and Kang, Nahyup and Lee, Hyong-Euk and Yoo, ByungIn and Han, Jae-Joon and Kim, Seon Joo}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={2410--2419}, year={2021} }

Requirements Our running environment is as follows:

Python version 3.8.3 Pytorch version 1.7.0 CUDA version 11.2

We provide a docker image, which supports all extra requirements (ex. dcraw,rawpy,tensorboard...), including specified version of python, pytorch, CUDA above.

You can download the docker image here.

The following instructions are assumed to run in a docker container that uses the docker image we provided.

Getting Started Clone this repo In the docker container, clone this repository first.

sh git clone https://github.com/DY112/LSMI-dataset.git

Download the LSMI dataset You should first download the LSMI dataset from here.

The dataset is composed of 3 sub-folers named "galaxy", "nikon", "sony".

Folders named by each camera include several scenes, and each scene folder contains full-resolution RAW files and JPG files that is converted to sRGB color space.

Move all three folders to the root of cloned repository.

In each sub-folders, we provides metadata (meta.json), and train/val/test scene index (split.json).

In meta.json, we provides following informations.

NumOfLights : Number of illuminants in the scene MCCCoord : Locations of Macbeth color chart Light1,2,3 : Normalized chromaticities of each illuminant (calculated through running 1_make_mixture_map.py)

Preprocess the LSMI dataset

Convert raw images to tiff files

To convert original 1-channel bayer-pattern images to 3-channel RGB tiff images, run following code:

sh python 0_cvt2tiff.py You should modify SOURCE and EXT variables properly.

The converted tiff files are generated at the same location as the source file.

This process uses DCRAW command, with '-h -D -4 -T' as options.

There is no black level subtraction, saturated pixel clipping or else.

You can change the parameters as appropriate for your purpose.

Make mixture map sh python 1_make_mixture_map.py Change the CAMERA variable properly to the target directory you want.

This code does the following operations for each scene:

Subtract black level (no saturation clipping) Use Macbeth Color Chart's achromatic patches, find each illuminant's chromaticities Use green channel pixel values, calculate pixel level illuminant mixture map Mask uncalculable pixel positions (which have 0 as value for all scene pairs) to ZERO_MASK

After running this code, npy tpye mixture map data will be generated at each scene's directory.

:warning: If you run this code with ZERO_MASK=-1, the full resolution mixture map may contains -1 for uncalculable pixels. You MUST replace this value appropriately before resizing to prevent this negative value from interpolating with other values.

Crop for train/test U-Net (Optional) sh python 2_preprocess_data.py

This preprocessing code is written only for U-Net, so you can skip this step and freely process the full resolution LSMI set (tiff and npy files).

The image and the mixture map are resized as a square with a length of the SIZE variable inside the code, and the ground-truth image is also generated.

Note that the side of the image will be cropped to make the image shape square.

If you don't want to crop the side of the image and just want to resize whole image anyway, use SQUARE_CROP=False

We set the default test size to 256, and set train size to 512, and SQUARE_CROP=True.

The new dataset is created in a folder with the name of the CAMERA_SIZE. (Ex. galaxy_512)

Use U-Net for pixel-level AWB You can download pre-trained model parameter here.

Pre-trained model is trained on 512x512 data with random crop & random pixel level relighting augmentation method.

Locate downloaded models folder into SVWB_Unet.

Test U-Net sh cd SVWB_Unet sh test.sh

Train U-Net sh cd SVWB_Unet sh train.sh

Dataset License http://creativecommons.org/licenses/by-nc/4.0/">https://i.creativecommons.org/l/by-nc/4.0/88x31.png" />
This work is licensed under a http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License.
o
mnist_784
openml.org
Updated Sep 29, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann LeCun; Corinna Cortes; Christopher J.C. Burges (2014). mnist_784 [Dataset]. https://www.openml.org/d/554
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2014
Authors
Yann LeCun; Corinna Cortes; Christopher J.C. Burges
Description
Author: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
Source: MNIST Website - Date unknown
Please cite:

The MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples

It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

With some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets.

The MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.
Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...
zenodo.org
zip
Updated Jan 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miha Mohorčič; Miha Mohorčič; Aleš Simončič; Aleš Simončič; Mihael Mohorčič; Mihael Mohorčič; Andrej Hrovat; Andrej Hrovat (2023). Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment [Dataset]. http://doi.org/10.5281/zenodo.7509280
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7509280
Dataset updated
Jan 6, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Miha Mohorčič; Miha Mohorčič; Aleš Simončič; Aleš Simončič; Mihael Mohorčič; Mihael Mohorčič; Andrej Hrovat; Andrej Hrovat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.

This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.

It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.

Related dataset

Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.

Measurement setup

The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device).
Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.

The following information about each received PR is collected:
- MAC address
- Supported data rates
- extended supported rates
- HT capabilities
- extended capabilities
- data under extended tag and vendor specific tag
- interworking
- VHT capabilities
- RSSI
- SSID
- timestamp when PR was received.

The collected data was forwarded to a remote database via a secure VPN connection.
A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.

Data preprocessing

The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database.
For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:

PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }

Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.

When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:

{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },

where PR_data is structured as follows:

{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.

This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored.
The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval.
If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended.
If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key.
The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png

At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.

Folder structure

For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period.
Each folder contains four files, each containing samples from that device.

The folders are named after the start and end time (in UTC).
For example, the folder [2022-09-22T22-00-00_2022-09-23T22-00-00](2022-09-22T22-00-00_2022-09-23T22-00-00) contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.

Files representing their location via mapping:
- 1.json -> location 1
- 2.json -> location 2
- 3.json -> location 3
- 4.json -> location 4

Environments description

The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo
The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset.
As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system.
Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.

Four Raspbery Pi-s were used:
- location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano)
- location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo
- location 3 -> nothernmost window in the building of Via Etnea near Piazza Università
- location 4 -> first window top the right of the entrance of the University of Catania

Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access)
Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.

Known dataset shortcomings

Due to technical and physical limitations, the dataset contains some identified deficiencies.

PRs are collected and transmitted in 10-second chunks.
Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.

Every 20 minutes the service is restarted on the recording device.
This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond.
For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.

The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.

Location 1 - Piazza del Duomo - Chierici

The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period.
Its location is constant and is not disturbed, dataset seems to have complete coverage.

Location 2 - Via Etnea - Piazza del Duomo

The device is located inside the building.
During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed.
As the device was moved back and forth, power outages and internet connection issues occurred.
The last three days in the record contain no PRs from this location.

Location 3 - Via Etnea - Piazza Università

Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building.
Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present.
This device appears to have been collecting data throughout the whole dataset period.

Location 4 - Piazza Università

This location is wirelessly connected to the access point.
The device was placed statically on a windowsill overlooking the square.
Due to physical limitations, the device had lost power several times during the deployment.
The internet connection was also interrupted sporadically.

Recognitions

The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
c
ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)
datacatalogue.cessda.eu
search.gesis.org
+1more
Updated Oct 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan (2023). ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023) [Dataset]. http://doi.org/10.7802/2620
Explore at:
Unique identifier
https://doi.org/10.7802/2620
Dataset updated
Oct 17, 2023
Dataset provided by
GESIS - Leibniz-Institut für Sozialwissenschaften
GESIS - Leibniz-Institut für Sozialwissenschaften & Heinrich-Heine-University Düsseldorf
LGI2P / IMT Mines Ales / University of Montpellier
Institute of Computer Science, FORTH-ICS
LIRMM / University of Montpellier
Authors
Gangopadhyay, Susmita; Schellhammer, Sebastian; Boland, Katarina; Schüller, Sascha; Todorov, Konstantin; Tchechmedjiev, Andon; Zapilko, Benjamin; Fafalios, Pavlos; Jabeen, Hajira; Dietze, Stefan
Measurement technique
Web scraping
Description
ClaimsKG is a knowledge graph of metadata information for fact-checked claims scraped from popular fact-checking sites. In addition to providing a single dataset of claims and associated metadata, truth ratings are harmonized and additional information is provided for each claim, e.g., about mentioned entities. Please see ( https://data.gesis.org/claimskg/ ) for further details about the data model, query examples and statistics.

The dataset facilitates structured queries about claims, their truth values, involved entities, authors, dates, and other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline, which harvests claim-related data from popular fact-checking web sites, annotates them with related entities from DBpedia/Wikipedia, and lifts all data to RDF using established vocabularies (such as schema.org).

The latest release of ClaimsKG covers 74066 claims and 72127 Claim Reviews. This is the fourth release of the dataset where data was scraped till Jan 31, 2023 containing claims published between 1996 and 2023 from 13 fact-checking websites. The websites are Fullfact, Politifact, TruthOrFiction, Checkyourfact, Vishvanews, AFP (French), AFP, Polygraph, EU factcheck, Factograph, Fatabyyano, Snopes and Africacheck. The claim-review (fact-checking) period for claims ranges between the year 1996 to 2023. Similar to the previous release, the Entity fishing python client ( https://github.com/hirmeos/entity-fishing-client-python ) has been used for entity linking and disambiguation in this release. Improvements have been made in the web scraping and data preprocessing pipeline to extract more entities from both claims and claims reviews. Currently, ClaimsKG contains 3408386 entities detected and referenced with DBpedia.

This latest release of ClaimsKG supersedes the previous versions as it contained all the claims from the previous versions together in addition to the additional new claims as well as improved entity annotation resulting in a higher number of entities.
Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...
zenodo.org
data.niaid.nih.gov
application/gzip
Updated Jun 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. http://doi.org/10.5281/zenodo.6568778
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6568778
Dataset updated
Jun 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.

An example showing how to use the dataset is shown below.

# code for testing robustness of a model import os.path from torchvision import datasets, transforms, models import torch.utils.data class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """ def find_classes(self, directory): classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) if not classes: raise FileNotFoundError(f"Couldn't find any class folder in {directory}.") class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if len(os.listdir(os.path.join(directory, cls_name))) > 0} return classes, class_to_idx # extract and unzip the dataset, then write top folder here dataset_folder = 'data/ImageNet-Patch' available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' } # select folder with specific target target_label = 954 dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ]) dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval() batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0] accuracy = correct / total attack_sr = attack_success / total print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
D
Code and benchmarks for geometry-informed drag term computation for...
darus.uni-stuttgart.de
Updated Dec 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Krach; Felix Weinhardt; Mingfeng Wang; Martin Schneider; Holger Class; Holger Steeb (2024). Code and benchmarks for geometry-informed drag term computation for pseudo-3D Stokes simulations with varying apertures [Dataset]. http://doi.org/10.18419/DARUS-4313
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4313
Dataset updated
Dec 20, 2024
Dataset provided by
DaRUS
Authors
David Krach; Felix Weinhardt; Mingfeng Wang; Martin Schneider; Holger Class; Holger Steeb
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4313https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.18419/DARUS-4313
Dataset funded by
DFG
Description
Content: This data set includes snapshots of the code used to compute the benchmarks and applications in Krach et al.(2024). All software tools provided enable the user to perform pseudo-3D Stokes simulations with a geometry-informed drag term using DuMux and to determine permeability, volumetric flux as well as local pressure and velocity fields for both, the domains in the publication as well as user defined geometries. Pseudo3D_Stokes: DuMuxsubmodule Pseudo-3D-Stokes Module for Varying Apertures is a DuMux module developed at research institutions. DuMux is a simulation framework focusing on Finite Volume discretization methods, model coupling for multi-physics applications, and flow and transport applications in porous media. This module aims to assist researchers in planning, improving, or interpreting microfluidic experiments through numerical simulations based on the Stokes equations in an easy and intuitive way. It uses .pgm files as input to create numerical grids. These .pgm files should include 8 bit grayscale values referring to the relative height of a microfluidic cell, which can be created from microscopy images of a microfluidic experiment using suitable image processing procedures. Based on the .pgm files, the python module localdrag (see below) should be used to create the suitable drag prefactor fields lambda1 and lambda2. For further details, refer to our publication (see below). Please check out the README for information on requirements, building DuMux including its submodule and examples. localdrag: python preprocessing module localdrag is a python module to create geometry informed pre-factor maps for pseudo-3D Stokes simulations with DuMux based on local pore morphology. localdrag is used as a preprocessing tool for the DuMux module pseud3D_stokes and is delivered directly when the pseudo3D_stokes is cloned with the recurse-submodules option (recommended, see README). Related datasets and repositories: POREMAPS, 3D Stokes solver used to create reference solutions: git repository, DaRUS dataset DuMux: Website, git repository Pseudo3D_Stokes: git repository Image dataset of micromodel with precipitation: DaRUS dataset, research paper localdrag python module for preprocessing and data handling: git repository Input data and results for all domains in Krach et al. (2024): DaRUS dataset
EEG dataset for the analysis of age-related changes in motor-related...
figshare.com
png
Updated Nov 19, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Frolov; Elena Pitsik; Vadim V. Grubov; Anton R. Kiselev; Vladimir Maksimenko; Alexander E. Hramov (2020). EEG dataset for the analysis of age-related changes in motor-related cortical activity during a series of fine motor tasks performance [Dataset]. http://doi.org/10.6084/m9.figshare.12301181.v2
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12301181.v2
Dataset updated
Nov 19, 2020
Dataset provided by
figshare
Authors
Nikita Frolov; Elena Pitsik; Vadim V. Grubov; Anton R. Kiselev; Vladimir Maksimenko; Alexander E. Hramov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
EEG signals were acquired from 20 healthy right-handed subjects performing a series of fine motor tasks cued by the audio command. The participants were divided equally into two distinct age groups: (i) 10 elderly adults (EA group, aged 55-72, 6 females); (ii) 10 young adults (YA group, aged 19-33, 3 females).The active phase of the experimental session included sequential execution of 60 fine motor tasks - squeezing a hand into a fist after the first audio command and holding it until the second audio command (30 repetitions per hand) (see Fig.1). Duration of the audio command determined type of the motor action to be executed: 0.25s for left hand (LH) movement and 0.75s for right rand (RH) movement. The time interval between two audio signals was selected randomly in the range 4-5s for each trial. The sequence of motor tasks was randomized and the pause between tasks was also chosen randomly in the range 6-8s to exclude possible training or motor-preparation effects caused by the sequential execution of the same tasks.Acquired EEG signals were then processed via preprocessing tools implemented in MNE Python package. Specifically, raw EEG signals were filtered by a Butterworth 5th order filter in the range 1-100 Hz, and by 50Hz Notch filter. Further, Independent Component Analysis (ICA) was applied to remove ocular and cardiac artifacts. Artifact-free EEG recordings were then segmented into 60 epochs according to the experimental protocol. Each epoch was 14s long, including 3s of baseline and 11s of motor-related brain activity, and time-locked to the first audio command indicating the start of motor execution. After visual inspection epochs that still contained artifacts were rejected. Finally, 15 epochs per movement type were stored for each subject.Individual epochs for each subject are stored in the attached MNE .fif files. Prefix EA or YA in the name of the file identifies the age group, which subject belongs to. Postfix LH or RH in the name of the file indicates the type of motor tasks.EEG signals were acquired from 20 healthy right-handed subjects performing a series of fine motor tasks cued by the audio command. The participants were divided equally into two distinct age groups: (i) 10 elderly adults (EA group, aged 55-72, 6 females); (ii) 10 young adults (YA group, aged 19-33, 3 females).
C
Annotations for ConfLab A Rich Multimodal Multisensor Dataset of...
data.4tu.nl
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung (2022). Annotations for ConfLab A Rich Multimodal Multisensor Dataset of Free-Standing Social Interactions In-the-Wild [Dataset]. http://doi.org/10.4121/20017664.v1
Explore at:
Unique identifier
https://doi.org/10.4121/20017664.v1
Dataset updated
Jun 8, 2022
Dataset provided by
4TU.ResearchData
Authors
Chirag Raman; Jose Vargas Quiros; Stephanie Tan; Ashraful Islam; Ekin Gedik; Hayley Hung
License
https://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdfhttps://data.4tu.nl/info/fileadmin/user_upload/Documenten/4TU.ResearchData_Restricted_Data_2022.pdf
Description
This file contains the annotations for the ConfLab dataset, including actions (speaking status), pose, and F-formations.

------------------

./actions/speaking_status:

./processed: the processed speaking status files, aggregated into a single data frame per segment. Skipped rows in the raw data (see https://josedvq.github.io/covfee/docs/output for details) have been imputed using the code at: https://github.com/TUDelft-SPC-Lab/conflab/tree/master/preprocessing/speaking_status

The processed annotations consist of:

./speaking: The first row contains person IDs matching the sensor IDs,

The rest of the row contains binary speaking status annotations at 60fps for the corresponding 2 min video segment (7200 frames).

./confidence: Same as above. These annotations reflect the continuous-valued rating of confidence of the annotators in their speaking annotation.

To load these files with pandas: pd.read_csv(p, index_col=False)

./raw.zip: the raw outputs from speaking status annotation for each of the eight annotated 2-min video segments. These were were output by the covfee annotation tool (https://github.com/josedvq/covfee)

Annotations were done at 60 fps.

--------------------

./pose:

./coco: the processed pose files in coco JSON format, aggregated into a single data frame per video segment. These files have been generated from the raw files using the code at: https://github.com/TUDelft-SPC-Lab/conflab-keypoints

To load in Python: f = json.load(open('/path/to/cam2_vid3_seg1_coco.json'))

The skeleton structure (limbs) is contained within each file in:

f['categories'][0]['skeleton']

and keypoint names at:

f['categories'][0]['keypoints']

./raw.zip: the raw outputs from continuous pose annotation. These were were output by the covfee annotation tool (https://github.com/josedvq/covfee)

Annotations were done at 60 fps.

---------------------

./f_formations:

seg 2: 14:00 onwards, for videos of the form x2xxx.MP4 in /video/raw/ for the relevant cameras (2,4,6,8,10).

seg 3: for videos of the form x3xxx.MP4 in /video/raw/ for the relevant cameras (2,4,6,8,10).

Note that camera 10 doesn't include meaningful subject information/body parts that are not already covered in camera 8.

First column: time stamp

Second column: "()" delineates groups, "<>" delineates subjects, cam X indicates the best camera view for which a particular group exists.

phone.csv: time stamp (pertaining to seg3), corresponding group, ID of person using the phone

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuqi Tan; Tim Kempchen (2024). Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis [Dataset]. http://doi.org/10.5061/dryad.brv15dvj1

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.brv15dvj1

Dataset updated

Jul 8, 2024

Dataset provided by

Stanford University School of Medicine

Authors

Yuqi Tan; Tim Kempchen

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Multiplexed imaging technologies provide insights into complex tissue architectures. However, challenges arise due to software fragmentation with cumbersome data handoffs, inefficiencies in processing large images (8 to 40 gigabytes per image), and limited spatial analysis capabilities. To efficiently analyze multiplexed imaging data, we developed SPACEc, a scalable end-to-end Python solution, that handles image extraction, cell segmentation, and data preprocessing and incorporates machine-learning-enabled, multi-scaled, spatial analysis, operated through a user-friendly and interactive interface. The demonstration dataset was derived from a previous analysis and contains TMA cores from a human tonsil and tonsillitis sample that were acquired with the Akoya PhenocyclerFusion platform. The dataset can be used to test the workflow and establish it on a user’s system or to familiarize oneself with the pipeline. Methods Tissue samples: Tonsil cores were extracted from a larger multi-tumor tissue microarray (TMA), which included a total of 66 unique tissues (51 malignant and semi-malignant tissues, as well as 15 non-malignant tissues). Representative tissue regions were annotated on corresponding hematoxylin and eosin (H&E)-stained sections by a board-certified surgical pathologist (S.Z.). Annotations were used to generate the 66 cores each with cores of 1mm diameter. FFPE tissue blocks were retrieved from the tissue archives of the Institute of Pathology, University Medical Center Mainz, Germany, and the Department of Dermatology, University Medical Center Mainz, Germany. The multi-tumor-TMA block was sectioned at 3µm thickness onto SuperFrost Plus microscopy slides before being processed for CODEX multiplex imaging as previously described. CODEX multiplexed imaging and processing To run the CODEX machine, the slide was taken from the storage buffer and placed in PBS for 10 minutes to equilibrate. After drying the PBS with a tissue, a flow cell was sealed onto the tissue slide. The assembled slide and flow cell were then placed in a PhenoCycler Buffer made from 10X PhenoCycler Buffer & Additive for at least 10 minutes before starting the experiment. A 96-well reporter plate was prepared with each reporter corresponding to the correct barcoded antibody for each cycle, with up to 3 reporters per cycle per well. The fluorescence reporters were mixed with 1X PhenoCycler Buffer, Additive, nuclear-staining reagent, and assay reagent according to the manufacturer's instructions. With the reporter plate and assembled slide and flow cell placed into the CODEX machine, the automated multiplexed imaging experiment was initiated. Each imaging cycle included steps for reporter binding, imaging of three fluorescent channels, and reporter stripping to prepare for the next cycle and set of markers. This was repeated until all markers were imaged. After the experiment, a .qptiff image file containing individual antibody channels and the DAPI channel was obtained. Image stitching, drift compensation, deconvolution, and cycle concatenation are performed within the Akoya PhenoCycler software. The raw imaging data output (tiff, 377.442nm per pixel for 20x CODEX) is first examined with QuPath software (https://qupath.github.io/) for inspection of staining quality. Any markers that produce unexpected patterns or low signal-to-noise ratios should be excluded from the ensuing analysis. The qptiff files must be converted into tiff files for input into SPACEc. Data preprocessing includes image stitching, drift compensation, deconvolution, and cycle concatenation performed using the Akoya Phenocycler software. The raw imaging data (qptiff, 377.442 nm/pixel for 20x CODEX) files from the Akoya PhenoCycler technology were first examined with QuPath software (https://qupath.github.io/) to inspect staining qualities. Markers with untenable patterns or low signal-to-noise ratios were excluded from further analysis. A custom CODEX analysis pipeline was used to process all acquired CODEX data (scripts available upon request). The qptiff files were converted into tiff files for tissue detection (watershed algorithm) and cell segmentation.

Clear search

Close search

Google apps

Main menu

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for...

Pre-Processed Power Grid Frequency Time Series

DATS 6401 - Final Project - Yon ho Cheong.zip

Data from: S1 Data -

Adult dataset preprocessed

Storage and Transit Time Data and Code

Dataset for 'Stream Temperature Predictions for River Basin Management in...

Enhancing Stock Market Forecasting with Machine Learning A PineScript-Driven...

Results for pseudo-3D Stokes simulations with a geometry-informed drag term...

DataSheet_2_tidytcells: standardizer for TR/MH nomenclature.zip

VegeNet - Image datasets and Codes

Curated list of HAR datasets

LSMI Dataset

mnist_784

Dataset of IEEE 802.11 probe requests from an uncontrolled urban environment...

ClaimsKG - A Knowledge Graph of Fact-Checked Claims (January, 2023)

Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

Code and benchmarks for geometry-informed drag term computation for...

EEG dataset for the analysis of age-related changes in motor-related...

Annotations for ConfLab A Rich Multimodal Multisensor Dataset of...

Demo dataset for: SPACEc, a streamlined, interactive Python workflow for multiplexed image processing and analysis