Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Processed PAMAP2 dataset
This dataset is based on the [PAMAP2 Dataset for Physical Activity Monitoring](https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring).
Compared to v0.2.0, this preprocessed dataset contains fewer activities. It only includes: lying, sitting, standing, walking, cycling, vaccuum_cleaning and ironing
The data is processed with the code from [this script]https://github.com/NLeSC/mcfly-tutorial/blob/master/utils/tutorial_pamap2.py), with the following function call:
```python
columns_to_use = ['hand_acc_16g_x', 'hand_acc_16g_y', 'hand_acc_16g_z',
'ankle_acc_16g_x', 'ankle_acc_16g_y', 'ankle_acc_16g_z',
'chest_acc_16g_x', 'chest_acc_16g_y', 'chest_acc_16g_z']
exclude_activities = [5, 7, 9, 10, 11, 12, 13, 18, 19, 20, 24, 0]
outputpath = tutorial_pamap2.fetch_and_preprocess(directory_to_extract_to,columns_to_use,
exclude_activities=exclude_activities,
val_test_size=(100, 1000))
```
## References
A. Reiss and D. Stricker. Introducing a New Benchmarked Dataset for Activity Monitoring. The 16th IEEE International Symposium on Wearable Computers (ISWC), 2012.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These datasets contain a set of news articles in English, French and Spanish extracted from Medisys (i‧e. advanced search) according the following criteria: (1) Keywords (at least): COVID-19, ncov2019, cov2019, coronavirus; (2) Keywords (all words): masque (French), mask (English), máscara (Spanish) (3) Periods: March 2020, May 2020, July 2020; (4) Countries: UK (English), Spain (Spanish), France (French). A corpus by country has been manually collected (copy/paste) from Medisys. For each country, 100 snippets by period (the 1st, 10th, 15th, 20th for each month) are built. The datasets are composed of: (1) A corpus preprocessed for the BioTex tool - https://gitlab.irstea.fr/jacques.fize/biotex_python (.txt) [~ 900 texts]; (2) The same corpus preprocessed for the Weka tool - https://www.cs.waikato.ac.nz/ml/weka/ (.arff); (3) Terms extracted with BioTex according spatio-temporal criteria (*.csv) [~ 9000 terms]. Other corpora can be collected with this same method. The code in Perl in order to preprocess textual data for terminology extraction (with BioTex) and classification (with Weka) tasks is available. A new version of this dataset (December 2020) includes additional data: - Python preprocessing and BioTex code [Execution_BioTex‧tgz]. - Terms extracted with different ranking measures (i‧e. C-Value, F-TFIDF-C_M) and methods (i‧e. extraction of words and multi-word terms) with the online version of BioTex [Terminology_with_BioTex_online_dec2020.tgz],
Preprocessing data in a reproducible and robust way is one of the current challenges in untargeted metabolomics workflows. Data curation in liquid chromatography-mass spectrometry (LC-MS) involves the removal of unwanted features (retention time; m/z pairs) to retain only high-quality data for subsequent analysis and interpretation. The present work introduces a package for the Python programming language for pre-processing LC-MS data for quality control procedures in untargeted metabolomics workflows. It is a versatile strategy that can be customized or fit for purpose according to the specific metabolomics application. It allows performing quality control procedures to ensure accuracy and reliability in LC-MS measurements, and it allows preprocessing metabolomics data to obtain cleaned matrices for subsequent statistical analysis. The capabilities of the package are showcased with pipelines for an LC-MS system suitability check, system conditioning, signal drift evaluation, and data curation. These applications were implemented to preprocess data corresponding to a new suite of plasma candidate plasma reference materials developed by the National Institute of Standards and Technology (NIST; hypertriglyceridemic, diabetic, and African-American plasma pools) to be used in untargeted metabolomics studies. in addition to NIST SRM 1950 – Metabolites in Frozen Human Plasma. The package offers a rapid and reproducible workflow that can be used in an automated or semi-automated fashion, and it is an open and free tool available to all users.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The datasets contain the raw data and preprocessed data (following the steps in the Jupyter Notebook) of 9 DHT22 sensors in a cold storage room. Details on how the data was gathered can be found in the publication "Self-Adaptive Integration of Distributed Sensor Systems for Monitoring Cold Storage Environments" by Elia Henrichs, Florian Stoll, and Christian Krupitzer.
This dataset consists of the following files:
This resource includes materials for the workshop about configuring and running a NextGen simulation and analyzing model outputs, presented during the 2025 NWCSI Bootcamp.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supercoiling-mediated feedback simulation dataset
Background These files represent simulation datasets generated for the publication "Supercoiling-mediated feedback rapidly couples and tunes transcription" by Christopher Johnstone and Kate E. Galloway.
All figures in the paper can be replicated by using the code available at https://github.com/GallowayLabMIT/tangles_model (permalink) and these datasets.
File summary
unprocessed_datasets.zip
contains the merged Julia simulation files.
preprocessed_datasets.zip
contains the smaller, preprocessed datasets used for actuall plotting of data figures.
File format
The preprocessed datasets are serialized Pandas dataframes (gzipped Parquet files).
The unprocessed datasets are self-describing HDF/H5 files.
Usage
The main figure-plotting notebook, notebooks/modeling_paper_figures.ipynb
, contained in the code repository mentioned above can use either the unprocessed or preprocessed datasets. If the preprocessed datasets are present, it will load them directly. If the preprocessed datasets are not present, that Python notebook will preprocess the data.
License This data is available under a CC-BY 4.0 International License. Please attribute:
Christopher Johnstone (cjohnsto@mit.edu)
Kate E. Galloway (katiegal@mit.edu)
The increasingly high number of big data applications in seismology has made quality control tools to filter, discard, or rank data of extreme importance. In this framework, machine learning algorithms, already established in several seismic applications, are good candidates to perform the task flexibility and efficiently. sdaas (seismic data/metadata amplitude anomaly score) is a Python library and command line tool for detecting a wide range of amplitude anomalies on any seismic waveform segment such as recording artifacts (e.g., anomalous noise, peaks, gaps, spikes), sensor problems (e.g., digitizer noise), metadata field errors (e.g., wrong stage gain in StationXML). The underlying machine learning model, based on the isolation forest algorithm, has been trained and tested on a broad variety of seismic waveforms of different length, from local to teleseismic earthquakes to noise recordings from both broadband and accelerometers. For this reason, the software assures a high degree of flexibility and ease of use: from any given input (waveform in miniSEED format and its metadata as StationXML, either given as file path or FDSN URLs), the computed anomaly score is a probability-like numeric value in [0, 1] indicating the degree of belief that the analyzed waveform represents an anomaly (or outlier), where scores ≤0.5 indicate no distinct anomaly. sdaas can be employed for filtering malformed data in a pre-process routine, assign robustness weights, or be used as metadata checker by computing randomly selected segments from a given station/channel: in this case, a persistent sequence of high scores clearly indicates problems in the metadata
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The 802.11 standard includes several management features and corresponding frame types. One of them are Probe Requests (PR), which are sent by mobile devices in an unassociated state to scan the nearby area for existing wireless networks. The frame part of PRs consists of variable-length fields, called Information Elements (IE), which represent the capabilities of a mobile device, such as supported data rates.
This dataset contains PRs collected over a seven-day period by four gateway devices in an uncontrolled urban environment in the city of Catania.
It can be used for various use cases, e.g., analyzing MAC randomization, determining the number of people in a given location at a given time or in different time periods, analyzing trends in population movement (streets, shopping malls, etc.) in different time periods, etc.
Related dataset
Same authors also produced the Labeled dataset of IEEE 802.11 probe requests with same data layout and recording equipment.
Measurement setup
The system for collecting PRs consists of a Raspberry Pi 4 (RPi) with an additional WiFi dongle to capture WiFi signal traffic in monitoring mode (gateway device). Passive PR monitoring is performed by listening to 802.11 traffic and filtering out PR packets on a single WiFi channel.
The following information about each received PR is collected: - MAC address - Supported data rates - extended supported rates - HT capabilities - extended capabilities - data under extended tag and vendor specific tag - interworking - VHT capabilities - RSSI - SSID - timestamp when PR was received.
The collected data was forwarded to a remote database via a secure VPN connection. A Python script was written using the Pyshark package to collect, preprocess, and transmit the data.
Data preprocessing
The gateway collects PRs for each successive predefined scan interval (10 seconds). During this interval, the data is preprocessed before being transmitted to the database. For each detected PR in the scan interval, the IEs fields are saved in the following JSON structure:
PR_IE_data = { 'DATA_RTS': {'SUPP': DATA_supp , 'EXT': DATA_ext}, 'HT_CAP': DATA_htcap, 'EXT_CAP': {'length': DATA_len, 'data': DATA_extcap}, 'VHT_CAP': DATA_vhtcap, 'INTERWORKING': DATA_inter, 'EXT_TAG': {'ID_1': DATA_1_ext, 'ID_2': DATA_2_ext ...}, 'VENDOR_SPEC': {VENDOR_1:{ 'ID_1': DATA_1_vendor1, 'ID_2': DATA_2_vendor1 ...}, VENDOR_2:{ 'ID_1': DATA_1_vendor2, 'ID_2': DATA_2_vendor2 ...} ...} }
Supported data rates and extended supported rates are represented as arrays of values that encode information about the rates supported by a mobile device. The rest of the IEs data is represented in hexadecimal format. Vendor Specific Tag is structured differently than the other IEs. This field can contain multiple vendor IDs with multiple data IDs with corresponding data. Similarly, the extended tag can contain multiple data IDs with corresponding data.
Missing IE fields in the captured PR are not included in PR_IE_DATA.
When a new MAC address is detected in the current scan time interval, the data from PR is stored in the following structure:
{'MAC': MAC_address, 'SSIDs': [ SSID ], 'PROBE_REQs': [PR_data] },
where PR_data is structured as follows:
{ 'TIME': [ DATA_time ], 'RSSI': [ DATA_rssi ], 'DATA': PR_IE_data }.
This data structure allows to store only 'TOA' and 'RSSI' for all PRs originating from the same MAC address and containing the same 'PR_IE_data'. All SSIDs from the same MAC address are also stored. The data of the newly detected PR is compared with the already stored data of the same MAC in the current scan time interval. If identical PR's IE data from the same MAC address is already stored, only data for the keys 'TIME' and 'RSSI' are appended. If identical PR's IE data from the same MAC address has not yet been received, then the PR_data structure of the new PR for that MAC address is appended to the 'PROBE_REQs' key. The preprocessing procedure is shown in Figure ./Figures/Preprocessing_procedure.png
At the end of each scan time interval, all processed data is sent to the database along with additional metadata about the collected data, such as the serial number of the wireless gateway and the timestamps for the start and end of the scan. For an example of a single PR capture, see the Single_PR_capture_example.json file.
Folder structure
For ease of processing of the data, the dataset is divided into 7 folders, each containing a 24-hour period. Each folder contains four files, each containing samples from that device.
The folders are named after the start and end time (in UTC). For example, the folder 2022-09-22T22-00-00_2022-09-23T22-00-00 contains samples collected between 23th of September 2022 00:00 local time, until 24th of September 2022 00:00 local time.
Files representing their location via mapping: - 1.json -> location 1 - 2.json -> location 2 - 3.json -> location 3 - 4.json -> location 4
Environments description
The measurements were carried out in the city of Catania, in Piazza Università and Piazza del Duomo The gateway devices (rPIs with WiFi dongle) were set up and gathering data before the start time of this dataset. As of September 23, 2022, the devices were placed in their final configuration and personally checked for correctness of installation and data status of the entire data collection system. Devices were connected either to a nearby Ethernet outlet or via WiFi to the access point provided.
Four Raspbery Pi-s were used: - location 1 -> Piazza del Duomo - Chierici building (balcony near Fontana dell’Amenano) - location 2 -> southernmost window in the building of Via Etnea near Piazza del Duomo - location 3 -> nothernmost window in the building of Via Etnea near Piazza Università - location 4 -> first window top the right of the entrance of the University of Catania
Locations were suggested by the authors and adjusted during deployment based on physical constraints (locations of electrical outlets or internet access) Under ideal circumstances, the locations of the devices and their coverage area would cover both squares and the part of Via Etna between them, with a partial overlap of signal detection. The locations of the gateways are shown in Figure ./Figures/catania.png.
Known dataset shortcomings
Due to technical and physical limitations, the dataset contains some identified deficiencies.
PRs are collected and transmitted in 10-second chunks. Due to the limited capabilites of the recording devices, some time (in the range of seconds) may not be accounted for between chunks if the transmission of the previous packet took too long or an unexpected error occurred.
Every 20 minutes the service is restarted on the recording device. This is a workaround for undefined behavior of the USB WiFi dongle, which can no longer respond. For this reason, up to 20 seconds of data will not be recorded in each 20-minute period.
The devices had a scheduled reboot at 4:00 each day which is shown as missing data of up to a few minutes.
Location 1 - Piazza del Duomo - Chierici
The gateway device (rPi) is located on the second floor balcony and is hardwired to the Ethernet port. This device appears to function stably throughout the data collection period. Its location is constant and is not disturbed, dataset seems to have complete coverage.
Location 2 - Via Etnea - Piazza del Duomo
The device is located inside the building. During working hours (approximately 9:00-17:00), the device was placed on the windowsill. However, the movement of the device cannot be confirmed. As the device was moved back and forth, power outages and internet connection issues occurred. The last three days in the record contain no PRs from this location.
Location 3 - Via Etnea - Piazza Università
Similar to Location 2, the device is placed on the windowsill and moved around by people working in the building. Similar behavior is also observed, e.g., it is placed on the windowsill and moved inside a thick wall when no people are present. This device appears to have been collecting data throughout the whole dataset period.
Location 4 - Piazza Università
This location is wirelessly connected to the access point. The device was placed statically on a windowsill overlooking the square. Due to physical limitations, the device had lost power several times during the deployment. The internet connection was also interrupted sporadically.
Recognitions
The data was collected within the scope of Resiloc project with the help of City of Catania and project partners.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zipped file includes the dataset in .csv file and python scripts used to preprocess video data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a Python script for classifying apple leaf diseases using a Vision Transformer (ViT) model. The dataset used is the Plant Village dataset, which contains images of apple leaves with four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
The goal of this project is to classify apple leaf diseases using a Vision Transformer (ViT) model. The dataset is divided into four classes: Healthy, Apple Scab, Black Rot, and Cedar Apple Rust. The script includes data preprocessing, model training, and evaluation steps.
matplotlib
, seaborn
, numpy
, pandas
, tensorflow
, and sklearn
. These libraries are used for data visualization, data manipulation, and building/training the deep learning model.walk_through_dir
function is used to explore the dataset directory structure and count the number of images in each class.Train
, Val
, and Test
directories, each containing subdirectories for the four classes.ImageDataGenerator
from Keras to apply data augmentation techniques such as rotation, horizontal flipping, and rescaling to the training data. This helps in improving the model's generalization ability.Patches
layer that extracts patches from the images. This is a crucial step in Vision Transformers, where images are divided into smaller patches that are then processed by the transformer.seaborn
to provide a clear understanding of the model's predictions.Dataset Preparation
Train
, Val
, and Test
directories, with each directory containing subdirectories for each class (Healthy, Apple Scab, Black Rot, Cedar Apple Rust).Install Required Libraries
pip install tensorflow matplotlib seaborn numpy pandas scikit-learn
Run the Script
Analyze Results
Fine-Tuning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages
Overview
BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from WordProject, a subset of BhasaAnuvaad.
How to use
The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/WordProject.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.
Description
FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.
$ python stats_dataset.py
$ python viz_dataset.py
FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.
Publications
If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:
@article{kyritsis2020data,
title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
journal={IEEE Journal of Biomedical and Health Informatics},
year={2020},
publisher={IEEE}}
@inproceedings{kyritsis2017automated,
title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
year={2019},
organization={IEEE}}
Technical details
We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:
import pickle as pkl import numpy as np
with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)
The dataset variable in the snipet above is a dictionary with (5) keys. Namely:
'subject_id'
'session_id'
'signals_raw'
'signals_proc'
'meal_gt'
The contents under a specific key can be obtained by:
sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth
The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.
sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).
ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.
raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.
proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).
meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).
Ethics and funding
Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.
Contact
Any inquiries regarding the FreeFIC dataset should be addressed to:
Dr. Konstantinos KYRITSIS
Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124
Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Geological Survey Ireland has a core scanning suite consisting of a Short-Wave Infra-red (SWIR) camera and a Medium-Wave Infra-red (MWIR) camera.We have over 400km of drill core in our core store and are in the process of scanning all of it. We currently have ~7Tb of data.This data is freely available, but due to the size of the files please email gsi.corestore[AT]gsi.ie so we can facilitate delivery.This is a sample dataset consisting of 1 box of core.A single core-box scanned in the Short Wave Infra-red range for use with explanatory notebooks available on our GitHub repository. This data consists of box 25 of drillhole GSI-17-007, 105.98m to 110.35m. This box contains the contact between the Ballymore Formation and the Oakport Formation.We are open to collaboration using either the scanner or the data with any of our stakeholders.For questions, issues, suggestions for improvement or to discuss collaboration, please contact Russell Rogers, c/o duty.geologist[AT]gsi.ie.We also have a GitHub repository that hosts notebooks using the sample dataset, explaining some of the methods we have used in python to pre-process and process our image data.1. Opening and Starting with Geological Survey Ireland Hyperpectral Data2. Denoising Geological Survey Ireland Hyperspectral Data3. Removing the core box from the image4. Removing the continuum5. ClusteringThe notebook uses the Minisom module, because it is a very lightweight implementation with minimal dependencies, but there are many other SOM implementations available in python.
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
The Hunter groundwater model. This was created using the preprocess.py script in the "HUN GW Model code v01" acting on the index.xml file contained in the top-level directory of this dataset. The index.xml file contains provenance information for the raw-data (HUN GW Model Mines raw data v01) and tells the preprocessing scripts where to find this raw data. As the groundwater model is successively built using the python scripts, provenance is successively added to the files generated (as headers, or similar data structures). The exception to this is the finite-element mesh (mesh3D/mesh3D.*) which is the main output of the process, and which eventually contains such an enormous amount of provenance that the code suffers from buffer overflows: therefore its provenance is dumped to mesh3D/provenance_dump periodically.
As the scripts gradually build the groundwater model, they modify index.xml: it acts as a journal file for model creation, allowing provenance backtracking when using any standard xml viewer.
The final MOOSE input files are found in the "simulate" directory, and an example of the final MOOSE output is the "HUN GW Model simulate ua999 pawsey v01" dataset.
Created using preprocess.py found in the "HUN GW Model code v01" dataset, acting on the index.xml file, and hence using the raw data found in the "HUN GW Model Mines raw data v01" dataset.
Bioregional Assessment Programme (XXXX) HUN GW Model v01. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/90554dbf-4992-49ec-98b1-53c6067e97a2.
Derived From HUN GW Model code v01
Derived From HUN GW Model Mines raw data v01
This dataset is in context of the real world data science work and how the data analyst and data scientist work.
The dataset consists of four columns Year, Level_1(Ethnic group/gender), Level_2(Age group), and population
I would sincerely thank GeoIQ for sharing this dataset with me along with tasks. Just having a basic knowledge of Pandas and Numpy and other python data science libraries is not enough. How can you execute tasks and how can you preprocess the data before making any prediction is very important. Most of the datasets in Kaggle are clean and well arranged but this dataset thought me how real world data science and analysis works. Every data science beginner must work on this dataset and try to execute the tasks. It would only give them a good exposer to the real data science world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
New version: The python scripts to run the lab experiment were added.Open data: Neural electrophysiological correlates of detection and identification awarenessSupplementary material for the associated publication.OVERVIEWHumans have conscious experiences of the events in their environment. Previous research from electroencephalography (EEG) has shown visual awareness negativity (VAN) at about 200 ms to be a neural correlate of consciousness (NCC). In the present study, the stimulus was a ring with a Gabor patch tilting either left or right. On each trial, subjects rated their awareness on a three-level perceptual awareness scale that captured both detection (something vs. nothing) and identification (identification vs. something). Separate staircases were used to adjust stimulus opacity to the detection threshold and the identification threshold. Event-related potentials were extracted for VAN and late positivity.DATE & LOCATION OF DATA COLLECTION:Subjects (N = 43, student volunteers) were tested between 2022-maj-23 and 2022-june-30 at the Department of Psychology, Campus Albano, Stockholm, Sweden.DATA & FILE OVERVIEWThe files contain the raw data, scripts, and results of main and supplementary analyses of the electroencephalography (EEG) study reported in the main publication.For convenience, the report files of the main analyses in the manuscript are saved separately.Visual awareness negativity (VAN) results: analysis_VANo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidentify_badEEGyes_ntrials25.htmlLate positivity (LP) results: analysis_LPo_clean_data_blocklength_16_pawarelimit0.8_maxopadetect_maxopaidenify_badEEGyes_ntrials25.htmlbdf_up_to_20.zip: contains EEG data files for the first 20 subjects in .bdf format (generated by the Biosemi amplifier)bdf_after_20.zip: contains EEG data files for the remaining subjects in .bdf format (generated by the Biosemi amplifier)Log.zip: contains log files of the EEG session (generated by Python)readme_notes_on_id.txt: Information about issues during data collectionpsychopy.zip: contains scripts in python and psychopy to run the experiment. Scripts were written by Rasmus Eklund.MNE-python.zip: contains scripts in MNE-python to preprocess the EEG data. Scripts were written by Rasmus Eklund.R_graded.zipThe main reports are in R_graded > results > reports. They are .html files generated with Quarto.photodiode_supplement.pdf: Supplementary analysis of the relationship between python opacity settings and actual changes on the computer screenMETHODOLOGICAL INFORMATIONThe visual stimuli were gabor-grated rings. Subjects rated their awareness of the rings. Event-related potentials were computed from the EEG data.The experiment was programmed in Python: https://www.python.org/The EEG data were recorded as .bdf files with an Active Two BioSemi system (BioSemi, Amsterdam, Netherlands; www.biosemi.com)Instrument- or software-specific information needed to interpret the data:MNE-Python (Gramfort A., et al., 2013): https://mne.tools/stable/index.html#R and relevant packages: https://www.r-project.org/
The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw da...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages
Overview
BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from SeamlessAlign, a subset of BhasaAnuvaad.
How to use
The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/SeamlessAlign.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This study investigated the dependence of the early tropical cyclone (TC) weakening rate in response to an imposed moderate environmental vertical wind shear (VWS) on the warm-core strength and height of the TC vortex using idealized numerical simulations. Results show that the weakening of the warm core by upper-level ventilation is the primary factor leading to the early TC weakening in response to an imposed environmental VWS. The upper-level ventilation is dominated by eddy radial advection of the warm-core air. The TC weakening rate is roughly proportional to the warm-core strength and height of the initial TC vortex. The boundary-layer ventilation shows no relationship with the early weakening rate of the TC in response to an imposed moderate VWS. The findings suggest that some previous diverse results regarding the TC weakening in environmental VWS could be partly due to the different warm-core strengths and heights of the initial TC vortex. Methods The original data generated by our idealized experiments using the WRF model is very large, so we used Fortran (you can also use Python, MATLAB and other tools) to preprocess the data and obtain the main variables needed for our research analysis. The preprocessed data is in binary format. The WRF model is a numerical weather prediction and atmospheric research model developed by organizations including the National Center for Atmospheric Research (NCAR) and the National Centers for Environmental Prediction (NCEP) in the USA. WRF is open-source software and can be downloaded from https://github.com/wrf-model/WRF/releases. The specific parameters and settings used to configure the WRF model runs are described in detail in the paper. Interested researchers can follow the settings in the paper to regenerate the original raw data. However, the raw data files are very large (tens of GB per file), making direct analysis difficult. Therefore, we used tools like Fortran to preprocess the raw data into smaller binary files containing the key variables needed for analysis, such as potential temperature, etc. The binary files are around a few hundred MB in size. We strongly recommend that subsequent researchers directly use these preprocessed binary data files, which will greatly simplify the data processing workflow.
Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.