https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv
:contains 30 rows (nodes).dataset_30_edges_interactions.csv
: contains 47 rows (edges).dataset_30
refers to the same graph.Each dataset contains the following columns:
Name of the Column | Type | Description |
UniProt ID | string | protein identification |
label | string | protein label (type of node) |
properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
Relationship ID | string | relationship identification |
Source ID | string | identification of the source protein in the relationship |
Target ID | string | identification of the target protein in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_30* |
30 | 47 |
Y |
dataset_60* |
60 |
181 |
Y |
dataset_120* |
120 |
689 |
Y |
dataset_240* |
240 |
2819 |
Y |
dataset_300* |
300 |
4658 |
Y |
dataset_600* |
600 |
18004 |
Y |
dataset_1200* |
1200 |
71785 |
Y |
dataset_2400* |
2400 |
288600 |
Y |
dataset_3000* |
3000 |
449727 |
Y |
dataset_6000* |
6000 |
1799413 |
Y |
dataset_12000* |
12000 |
7199863 |
Y |
dataset_24000* |
24000 |
28792361 |
Y |
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | node identification |
label | string | node label (type of node) |
properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | relationship identification |
source | string | identification of the source node in the relationship |
target | string | identification of the target node in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_dummy* | 3 | 6 | N |
dataset_dummy2* | 3 | 6 | N |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This CSV represents a dummy dataset to test the functionality of trusted repository search capabilities and of research data governance practices. The associated dummy dissertation is entitled Financial Econometrics Dummy Dissertation. The dummy file is a 7KB CSV containing 5000 rows of notional demographic tabular data.
This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).
It is not classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.
The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different ‘classes’.
This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.
The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.
This data table is updated every 3 months.
Ref: DVSA/MOT/01 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060287/dvsa-mot-01-mot-test-results-by-class-of-vehicle1.csv"> Download CSV 16.1 KB
These tables give data for the following classes of vehicles:
All figures are for vehicles as they were brought in for the MOT.
A failed test usually has multiple failure items.
The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.
The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.
The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.
These data tables are updated every 3 months.
Ref: DVSA/MOT/02 View online https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1060255/dvsa-mot-02-mot-class-1-and-2-vehicles-initial-failures-by-defect-category-.csv"> Download CSV 19.1 KB
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
10
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
The data is licensed through the Creative Commons Attribution 4.0 International.
If you have used our data and are publishing your work, we ask that you please reference both:
this database through its DOI, and
any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.
Included Files
Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.
Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.
Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data
Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.
We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Clean_Data_v1-0-0.zip: contains all the downsampled data
The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.
There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.
The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.
Database_References_v1-0-0.bib
Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.
File Format: Downsampled Data
These are the "LP_Specimen_processed_data.csv" files in the "Clean_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:
The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data
Time[s]: time in seconds since the start of the test
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: the surface temperature in degC
These data files can be easily loaded using the pandas library in Python through:
import pandas data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_Specimen_processed_data.csv" files in the "Unreduced_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:
The first column is the index of each data point
S/No: sample number recorded by the DAQ
System Date: Date and time of sample
Time[s]: time in seconds since the start of the test
C_1_Force[kN]: load cell force
C_1_Déform1[mm]: extensometer displacement
C_1_Déplacement[mm]: cross-head displacement
Eng_Stress[MPa]: engineering stress
Eng_Strain[]: engineering strain
e_true: true strain
Sigma_true: true stress in MPa
(optional) Temperature[C]: specimen surface temperature in degC
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
hidden_index: internal reference ID
grade: material grade
spec: specifications for the material
source: base material for the test specimen
id: internal name for the specimen
lp: load protocol
size: type of specimen (M8, M12, M20)
gage_length_mm_: unreduced section length in mm
avg_reduced_dia_mm_: average measured diameter for the reduced section in mm
avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm
avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm
fy_n_mpa_: nominal yield stress
fu_n_mpa_: nominal ultimate stress
t_a_deg_c_: ambient temperature in degC
date: date of test
investigator: person(s) who conducted the test
location: laboratory where test was conducted
machine: setup used to conduct test
pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control
pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control
pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control
citekey: reference corresponding to the Database_References.bib file
yield_stress_mpa_: computed yield stress in MPa
elastic_modulus_mpa_: computed elastic modulus in MPa
fracture_strain: computed average true strain across the fracture surface
c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass
file: file name of corresponding clean (downsampled) stress-strain data
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')
citekey: reference in "Campaign_References.bib".
Grade: material grade.
Spec.: specifications (e.g., J2+N).
Yield Stress [MPa]: initial yield stress in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Elastic Modulus [MPa]: initial elastic modulus in MPa
size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign
Caveats
The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:
A500
A992_Gr50
BCP325
BCR295
HYP400
S460NL
S690QL/25mm
S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm
This is the Sample Submission CSV file after running the CellSegmentator tool on the images and recording relevant outputs.
The extra data included is: - RLE Masks (for each cell) - Submission Style RLE Masks (for each cell) - Bounding Boxes (for each cell)
Dataset Name
This dataset contains structured data for machine learning and analysis purposes.
Contents
data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.
Usage
Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')
Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
can-csvThis dataset contains controller area network (CAN) traffic for the 2017 Subaru Forester, the 2016 Chevrolet Silverado, the 2011 Chevrolet Traverse, and the 2011 Chevrolet Impala.For each vehicle, there are samples of attack-free traffic--that is, normal traffic--as well as samples of various types of attacks. The spoofing attacks, such as RPM spoofing, speed spoofing, etc., have an observable effect on the vehicle under test.This repository contains only .csv files. It is a subset of the can-dataset repository.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An example of TRTH intraday top-of-book transaction data for a single Johannesburg Stock Exchange (JSE) listed equity. The data is for teaching, learning and research projects sourced from the legacy Tick History v1 SOAP API interface from https://tickhistory.thomsonreuters.com/TickHistory in May 2016. Related raw data and similar data-structures can now be accessed using Tick History v2 and the REST API https://hosted.datascopeapi.reuters.com/RestApi/v1.
Configuration control: the test dataset contains 16 CSV files with names: "
Attributes: The data set is for the ticker: AGLJ.J from May 2010 until May 2016. The files include the following attributes: RIC, Local Date-Time, Event Type, Price at the Event, Volume at the Event, Best Bid Changes, Best Ask Changes, and Trade Event Sign: RIC, DateTimeL, Type, Price, Volume, L1 Bid, L1 Ask, Trade Sign. The Local Date-Time (DateTimeL) is a serial date number where 1 corresponds to Jan-1-0000, for example, 736333.382013 corresponds to 4-Jan-2016 09:10:05 (or 20160104T091005 in ISO 8601 format). The trade event sign (Trade Sign) indicates whether the transaction was buyer (or seller) initiated as +1 (-1) and was prepared using the method of Lee and Ready (2008).
Disclaimer: The data is not up-to-date, is incomplete, it has been pre-processed; as such it is not fit for any other purpose than teaching and learning, and algorithm testing. For complete, up-to-date, and error-free data please use the Tick History v2 interface directly.
Research Objectives: The data has been used to build empirical evidence in support of hierarchical causality and universality in financial markets by considering price impact on different time and averaging scales, feature selection on different scales as inputs into scale dependent machine learning applications, and for various aspects of agent-based model calibration and market ecology studies on different time and averaging scales.
Acknowledgements to: Diane Wilcox, Dieter Hendricks, Michael Harvey, Fayyaaz Loonat, Michael Gant, Nicholas Murphy and Donovan Platt.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Datasets relative to the manuscript describing the ScanGrow [Proof of Concept] application:
Worth RM and Espina L (2022) ScanGrow: Deep Learning-Based Live Tracking of Bacterial Growth in Broth. Front. Microbiol. 13:900596.
The contents of the three compressed folders are described below.
TRAINING_MODEL.ZIP Collection of images and spreadsheets that was used in the training of the image classification model that ScanGrow [PoC] uses by default. This training dataset should be subjected to the pre-processing workflow provided with ScanGrow to obtain the grouped images to be fed to the model training utility.
TEST_MODEL.ZIP
Collection of images and spreadsheets comprising the Test dataset used in the evaluation of the image classification model. This includes: - New scans and spreadsheets (represented in Figure 3 as gray triangles). - Evaluation.csv: combined results of the output files from command "Test Model" when run with: * Dataset Test: these scans and spreadsheets (not used for training), * Dataset Training: the dataset used for training the model, or * Dataset Validation: the Training dataset after having flipped horizontally and offsetting the images and adjusted the spectrophotometric values according to the newly inverted well positions.
Data from a sample run used to test ScanGrow on a microplate containing different concentrations of several antibiotics. This includes:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is the "development dataset" for the DCASE 2022 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques".
The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:
Overview of the task
Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial intelligence (AI)-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.
This task is the follow-up to DCASE 2020 Task 2 and DCASE 2021 Task 2. The task this year is to detect anomalous sounds under three main conditions:
1. Only normal sound clips are provided as training data (i.e., unsupervised learning scenario). In real-world factories, anomalies rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to create or collect and unknown anomalous sounds that were not observed in the given training data must be detected. This condition is the same as in DCASE 2020 Task 2 and DCASE 2021 Task 2.
2. Factors other than anomalies change the acoustic characteristics between training and test data (i.e., domain shift). In real-world cases, operational conditions of machines or environmental noise often differ between the training and testing phases. For example, the operation speed of a conveyor can change due to seasonal demand, or environmental noise can fluctuate depending on the states of surrounding machines. This condition is the same as in DCASE 2021 Task 2.
3. In test data, samples unaffected by domain shifts (source domain data) and those affected by domain shifts (target domain data) are mixed, and the source/target domain of each sample is not specified. Therefore, the model must detect anomalies regardless of the domain (i.e., domain generalization).
Definition
We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".
Dataset
This dataset consists of three sections for each machine type (Sections 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.
File names and attribute csv files
File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:
[filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...
Recording procedure
Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.
Directory structure
- /dev_data
- /fan
- /train (only normal clips)
- /section_00_source_train_normal_0000_
Baseline system
Two baseline systems are available on the Github repository baseline_ae and baseline_mobile_net_v2. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.
Condition of use
This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
Citation
If you use this dataset, please cite all the following three papers.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Annotated 12 lead ECG dataset Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4. It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST). Companion python scripts are available in: https://github.com/antonior92/automatic-ecg-diagnosis -------- Citation ``` Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4 ``` Bibtex: ``` @article{ribeiro_automatic_2020, title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network}, author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.}, year = {2020}, volume = {11}, pages = {1760}, doi = {https://doi.org/10.1038/s41467-020-15432-4}, journal = {Nature Communications}, number = {1} } ``` ----- ## Folder content: - `ecg_tracings.hdf5`: The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V. In python, one can read this file using the following sequence: ```python import h5py with h5py.File(args.tracings, "r") as f: x = np.array(f['tracings']) ``` - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line. - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST` corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`). 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist. 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score. 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset). 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset). 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
🗣️ English Speech Audio Dataset (Sample)
This dataset contains English speech samples, annotated by dialect, speaking rate, environmental condition, and includes ground truth transcriptions. It is intended to support research and applications in Automatic Speech Recognition (ASR), and Spoken language understanding.
📁 Dataset Structure
Audio segments are stored in .wav format Accompanied by a CSV file (En_dataset.csv) with rich metadata
📊Dataset Statistics… See the full description on the dataset page: https://huggingface.co/datasets/Marianne0Habib/stt-english-test-dataset-sample.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description Welcome to the "Hung Vuong Hospital Embryo Classification" dataset. This page provides a comprehensive overview of the data files, their formats, and the essential columns you'll encounter in this competition. Taking a moment to understand the data will help you navigate the challenge effectively and make informed decisions during your analysis and modeling.
The dataset comprises the following key files:
train folder - Contains images of embryos at day-3 and day-5 for training purposes. test folder - Contains images of embryos at day-3 and day-5 for testing purposes. train.csv - Contains information about the training set. test.csv - Contains information about the test set. sample_submission.csv - A sample submission file that demonstrates the correct submission format. Data Format Expectations
The embryo images are arranged within subfolders under the train and test directories. Each image is saved in JPG format and is labeled with a prefix. Images corresponding to day-3 embryos have the prefix D3 while images related to day-5 embryos bear the prefix D5. This prefix-based categorization allows for easy identification of the embryo's developmental stage.
Expected Output
Your task in this competition is to create a deep learning model that can accurately classify embryo images as 1 for good or 0 for not good for both day-3 and day-5 stages. The model should be trained on the training set and then used to predict the embryo quality in the test set. The ID column assigns an ID to each image. You will create the Class column as the result of model classification. The submission file contains only 2 columns: ID and Class (See the sample submission file)
Columns
You will encounter the following columns throughout the dataset:
ID - Refers to the ID of the images in the test set. Image - Refers to the file name of the embryo images in the train or test folder. Class - Represents the evaluation of the embryo images. This column provides the ground truth label for each image, indicating whether the embryo is classified as 'good' or 'not good'. We encourage you to explore, analyze, and preprocess the provided data to build a robust model for accurate embryo quality classification. Good luck, and may your innovative solutions contribute to advancements in reproductive science!
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.
The file classes.txt contains a list of classes corresponding to each label.
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
Original Data Source: Yahoo Answers 10 categories for NLP CSV
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The Pesticide Data Program (PDP) is a national pesticide residue database program. Through cooperation with State agriculture departments and other Federal agencies, PDP manages the collection, analysis, data entry, and reporting of pesticide residues on agricultural commodities in the U.S. food supply, with an emphasis on those commodities highly consumed by infants and children. This dataset provides information on where each tested sample was collected, where the product originated from, what type of product it was, and what residues were found on the product, for calendar years 1992 through 2020. The data can measure residues of individual compounds and classes of compounds, as well as provide information about the geographic distribution of the origin of samples, from growers, packers and distributors. The dataset also includes information on where the samples were taken, what laboratory was used to test them, and all testing procedures (by sample, so can be linked to the compound that is identified). The dataset also contains a reference variable for each compound that denotes the limit of detection for a pesticide/commodity pair (LOD variable). The metadata also includes EPA tolerance levels or action levels for each pesticide/commodity pair. The dataset will be updated on a continual basis, with a new resource data file added annually after the PDP calendar-year survey data is released. Resources in this dataset:Resource Title: CSV Data Dictionary for PDP. File Name: PDP_DataDictionary.csvResource Description: Machine-readable Comma Separated Values (CSV) format data dictionary for PDP Database Zip files. Defines variables for the sample identity and analytical results data tables/files. The ## characters in the Table and Text Data File name refer to the 2-digit year for the PDP survey, like 97 for 1997 or 01 for 2001. For details on table linking, see PDF. Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Data dictionary for Pesticide Data Program. File Name: PDP DataDictionary.pdfResource Description: Data dictionary for PDP Database Zip files.Resource Software Recommended: Adobe Acrobat,url: https://www.adobe.com Resource Title: 2019 PDP Database Zip File. File Name: 2019PDPDatabase.zipResource Title: 2018 PDP Database Zip File. File Name: 2018PDPDatabase.zipResource Title: 2017 PDP Database Zip File. File Name: 2017PDPDatabase.zipResource Title: 2016 PDP Database Zip File. File Name: 2016PDPDatabase.zipResource Title: 2015 PDP Database Zip File. File Name: 2015PDPDatabase.zipResource Title: 2014 PDP Database Zip File. File Name: 2014PDPDatabase.zipResource Title: 2013 PDP Database Zip File. File Name: 2013PDPDatabase.zipResource Title: 2012 PDP Database Zip File. File Name: 2012PDPDatabase.zipResource Title: 2011 PDP Database Zip File. File Name: 2011PDPDatabase.zipResource Title: 2010 PDP Database Zip File. File Name: 2010PDPDatabase.zipResource Title: 2009 PDP Database Zip File. File Name: 2009PDPDatabase.zipResource Title: 2008 PDP Database Zip File. File Name: 2008PDPDatabase.zipResource Title: 2007 PDP Database Zip File. File Name: 2007PDPDatabase.zipResource Title: 2005 PDP Database Zip File. File Name: 2005PDPDatabase.zipResource Title: 2004 PDP Database Zip File. File Name: 2004PDPDatabase.zipResource Title: 2003 PDP Database Zip File. File Name: 2003PDPDatabase.zipResource Title: 2002 PDP Database Zip File. File Name: 2002PDPDatabase.zipResource Title: 2001 PDP Database Zip File. File Name: 2001PDPDatabase.zipResource Title: 2000 PDP Database Zip File. File Name: 2000PDPDatabase.zipResource Title: 1999 PDP Database Zip File. File Name: 1999PDPDatabase.zipResource Title: 1998 PDP Database Zip File. File Name: 1998PDPDatabase.zipResource Title: 1997 PDP Database Zip File. File Name: 1997PDPDatabase.zipResource Title: 1996 PDP Database Zip File. File Name: 1996PDPDatabase.zipResource Title: 1995 PDP Database Zip File. File Name: 1995PDPDatabase.zipResource Title: 1994 PDP Database Zip File. File Name: 1994PDPDatabase.zipResource Title: 1993 PDP Database Zip File. File Name: 1993PDPDatabase.zipResource Title: 1992 PDP Database Zip File. File Name: 1992PDPDatabase.zipResource Title: 2006 PDP Database Zip File. File Name: 2006PDPDatabase.zipResource Title: 2020 PDP Database Zip File. File Name: 2020PDPDatabase.zipResource Description: Data and supporting files for PDP 2020 surveyResource Software Recommended: Microsoft Access,url: https://products.office.com/en-us/access
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
The Yelp reviews full star dataset is constructed by randomly taking 130,000 training samples and 10,000 testing samples for each review star from 1 to 5. In total there are 650,000 trainig samples and 50,000 testing samples.
Dataset Description
The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 to 5) and review text. The review texts are… See the full description on the dataset page: https://huggingface.co/datasets/yassiracharki/Yelp_Reviews_for_Sentiment_Analysis_fine_grained_5_classes.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
This data release supports interpretations of field-observed root distributions within a shallow landslide headscarp (CB1) located below Mettman Ridge within the Oregon Coast Range, approximately 15 km northeast of Coos Bay, Oregon, USA. (Schmidt_2021_CB1_topo_far.png and Schmidt_2021_CB1_topo_close.png). Root species, diameter (greater than or equal to 1 mm), general orientation relative to the slide scarp, and depth below ground surface were characterized immediately following landsliding in response to large-magnitude precipitation in November 1996 which triggered thousands of landslides within the area (Montgomery and others, 2009). The enclosed data includes: (1) tests of root-thread failure as a function of root diameter and tensile load for different plant species applicable to the broader Oregon Coast Range and (2) tape and compass survey of the planform geometry of the CB1 landslide and the roots observed in the slide scarp. Root diameter and load measurements were principally collected in the general area of the CB1 slide for 12 species listed in: Schmidt_2021_OR_root_species_list.csv. Methodology of the failure tests included identifying roots of a given plant species, trimming root threads into 15-20 cm long segments, measuring diameters including bark (up to 6.5 mm) with a micrometer at multiple points along the segment to arrive at an average, clamping a segment end to a calibrated spring and loading roots until failure recording the maximum load. Files containing the tensile failure tests described in Schmidt and others (2001) include root diameter (mm), critical tensile load at failure (kg), root cross-sectional area (m^2), and tensile strength (MPa). Tensile strengths were calculated as: (critical tensile load at failure * gravitational acceleration)/root cross-sectional area. The files are labeled: Schmidt_2021_OR_root_AceCir.csv, Schmidt_2021_OR_root_AceMac.csv, Schmidt_2021_OR_root_AlnRub.csv, Schmidt_2021_OR_root_AnaMar.csv, Schmidt_2021_OR_root_DigPur.csv, Schmidt_2021_OR_root_MahNer.csv, Schmidt_2021_OR_root_PolMun.csv, Schmidt_2021_OR_root_PseMen_damaged.csv, Schmidt_2021_OR_root_PseMen_healthy.csv, Schmidt_2021_OR_root_RubDis.csv, Schmidt_2021_OR_root_RubPar.csv, Schmidt_2021_OR_root_SamCae.csv, and Schmidt_2021_OR_root_TsuHet.csv. File naming follows the convention of adopting the first three letters of the binomial system defining genus and species of their Latin names. Live and damaged roots were identified based on their color, texture, plasticity, adherence of bark to woody material, and compressibility. For example, healthy live Douglas-fir (Pseudotsuga menziesii) roots (Schmidt_2021_OR_root_PseMen_healthy.csv) have a crimson-colored inner bark, darkening to a brownish red in dead Douglas-fir roots. Both are distinctive colors. Live roots exhibited plastic responses to bending and strong adherence of bark, whereas dead roots displayed brittle behavior with bending and poor adherence of bark to the underlying woody material. Measured tensile strengths of damaged root threads with fungal infections following selective tree harvest using yarding operations that damaged bark of standing trees expressed significantly lower tensile strengths than their ultimate living tensile strengths (Schmidt_2021_OR_root_PseMen_damaged.csv). The CB1 site was clear cut logged in 1987 and replanted with Douglas fir saplings in 1989. Vegetation in the vicinity of the failure scarp is dominated by young Douglas fir saplings planted two years after the clear cut, blue elderberry (Sambucus caerulea), thimbleberry (Rubus parviflorus), foxglove (Digitalis purpurea), and Himalayan blackberry (Rubus discolor). The remaining seven species are provided for context of more regional studies. The CB1 site is a hillslope hollow that failed as a shallow landslide and mobilized as a debris flow during heavy rainfall in November 1996. Prior to debris flow mobilization, the ~5-m wide slide with a source area of roughly 860 m^2 and an average slope of 43° displaced and broke numerous roots. Following landsliding, field observations noted a preponderance of exposed, blunt broken root stubs within the scarp. Roots were not straight and smooth, but rather exhibited tortuous growth paths with firmly anchored, interlocking structures. The planform geometry represented by a tape and compass field survey is presented as starting and ending points of slide margin segments of roughly equal colluvial soil depths above saprolite or bedrock (Schmidt_2021_CB1_scarp_geometry.csv and Schmidt_2021_CB1_scarp_pts.shp). The graphic Schmidt_2021_CB1_scarp_pts_poly.png shows the horse-shoe shaped profile and its numbered scarp segments. Segment numbers enclosed within parentheses indicate segments where roots were not counted owing to occlusion by prior ground disturbance. The shapefile Schmidt_2021_CB1_scarp_poly.shp also represents the scarp line segments. The file Schmidt_2021_CB1_segment_info.csv presents the segment information as left and right cumulative lengths, averaged colluvium soils depths for each segment, and inclinations of the ground surface slope relative to horizontal along the perimeter (P) and the slide scarp face (F). Lastly, Schmidt_2021_CB1_rootdata_scarp.csv represents root diameter of individual threads measured by a micrometer, species, depth below ground surface, live vs. dead roots, general root orientation (parallel or perpendicular) relative to scarp perimeter, and cumulative perimeter distance within the scarp segments. At CB1 specifically and more generally across the Oregon Coast Range, root reinforcement occurs primarily by lateral reinforcement with typically much smaller basal reinforcements.
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.