100+ datasets found
  1. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  2. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  3. Test Data Dummy CSV

    • figshare.com
    txt
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tori Duckworth (2023). Test Data Dummy CSV [Dataset]. http://doi.org/10.6084/m9.figshare.24500965.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Tori Duckworth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This CSV represents a dummy dataset to test the functionality of trusted repository search capabilities and of research data governance practices. The associated dummy dissertation is entitled Financial Econometrics Dummy Dissertation. The dummy file is a 7KB CSV containing 5000 rows of notional demographic tabular data.

  4. MOT testing data for Great Britain

    • s3.amazonaws.com
    • gov.uk
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Driver and Vehicle Standards Agency (2022). MOT testing data for Great Britain [Dataset]. https://s3.amazonaws.com/thegovernmentsays-files/content/179/1797262.html
    Explore at:
    Dataset updated
    Mar 24, 2022
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Driver and Vehicle Standards Agency
    Area covered
    Great Britain, United Kingdom
    Description

    About this data set

    This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).

    It is not classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.

    MOT test results by class

    The MOT test checks that your vehicle meets road safety and environmental standards. Different types of vehicles (for example, cars and motorcycles) fall into different ‘classes’.

    This data table shows the number of initial tests. It does not include abandoned tests, aborted tests, or retests.

    The initial fail rate is the rate for vehicles as they were brought for the MOT. The final fail rate excludes vehicles that pass the test after rectification of minor defects at the time of the test.

    This data table is updated every 3 months.

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    Initial failures by defect category

    These tables give data for the following classes of vehicles:

    • class 1 and 2 vehicles - motorcycles
    • class 3 and 4 vehicles - cars and light vans up to 3,000kg
    • class 5 vehicles - private passenger vehicles with more than 12 seats
    • class 7 vehicles - goods vehicles between 3,000kg and 3,500kg gross vehicle weight

    All figures are for vehicles as they were brought in for the MOT.

    A failed test usually has multiple failure items.

    The percentage of tests is worked out as the number of tests with one or more failure items in the defect as a percentage of total tests.

    The percentage of defects is worked out as the total defects in the category as a percentage of total defects for all categories.

    The average defects per initial test failure is worked out as the total failure items as a percentage of total tests failed plus tests that passed after rectification of a minor defect at the time of the test.

    These data tables are updated every 3 months.

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    https://www.gov.uk/assets/whitehall/pub-cover-spreadsheet-471052e0d03e940bbc62528a05ac204a884b553e4943e63c8bffa6b8baef8967.png">

    MOT class 3 and 4 vehicles: initial failures by defect category</h3

  5. i

    Sample Dataset for Testing

    • ieee-dataport.org
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Alex Outman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    10

  6. Z

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Dec 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    de Castro e Sousa, Albano (2022). Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6965146
    Explore at:
    Dataset updated
    Dec 24, 2022
    Dataset provided by
    Ozden, Selimcan
    Hartloper, Alexander R.
    de Castro e Sousa, Albano
    Lignos, Dimitrios G.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials

    Background

    This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.

    The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).

    Usage

    The data is licensed through the Creative Commons Attribution 4.0 International.

    If you have used our data and are publishing your work, we ask that you please reference both:

    this database through its DOI, and

    any publication that is associated with the experiments. See the Overall_Summary and Database_References files for the associated publication references.

    Included Files

    Overall_Summary_2022-08-25_v1-0-0.csv: summarises the specimen information for all experiments in the database.

    Summarized_Mechanical_Props_Campaign_2022-08-25_v1-0-0.csv: summarises the average initial yield stress and average initial elastic modulus per campaign.

    Unreduced_Data-#_v1-0-0.zip: contain the original (not downsampled) data

    Where # is one of: 1, 2, 3, 4, 5, 6. The unreduced data is broken into separate archives because of upload limitations to Zenodo. Together they provide all the experimental data.

    We recommend you un-zip all the folders and place them in one "Unreduced_Data" directory similar to the "Clean_Data"

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the unreduced data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Clean_Data_v1-0-0.zip: contains all the downsampled data

    The experimental data is provided through .csv files for each test that contain the processed data. The experiments are organised by experimental campaign and named by load protocol and specimen. A .pdf file accompanies each test showing the stress-strain graph.

    There is a "db_tag_clean_data_map.csv" file that is used to map the database summary with the clean data.

    The computed yield stresses and elastic moduli are stored in the "yield_stress" directory.

    Database_References_v1-0-0.bib

    Contains a bibtex reference for many of the experiments in the database. Corresponds to the "citekey" entry in the summary files.

    File Format: Downsampled Data

    These are the "LP_Specimen_processed_data.csv" files in the "Clean_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

    The header of the first column is empty: the first column corresponds to the index of the sample point in the original (unreduced) data

    Time[s]: time in seconds since the start of the test

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: the surface temperature in degC

    These data files can be easily loaded using the pandas library in Python through:

    import pandas data = pandas.read_csv(data_file, index_col=0)

    The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.

    File Format: Unreduced Data

    These are the "LP_Specimen_processed_data.csv" files in the "Unreduced_Data" directory. The is the load protocol designation and the is the specimen number for that load protocol and material source. Each file contains the following columns:

    The first column is the index of each data point

    S/No: sample number recorded by the DAQ

    System Date: Date and time of sample

    Time[s]: time in seconds since the start of the test

    C_1_Force[kN]: load cell force

    C_1_Déform1[mm]: extensometer displacement

    C_1_Déplacement[mm]: cross-head displacement

    Eng_Stress[MPa]: engineering stress

    Eng_Strain[]: engineering strain

    e_true: true strain

    Sigma_true: true stress in MPa

    (optional) Temperature[C]: specimen surface temperature in degC

    The data can be loaded and used similarly to the downsampled data.

    File Format: Overall_Summary

    The overall summary file provides data on all the test specimens in the database. The columns include:

    hidden_index: internal reference ID

    grade: material grade

    spec: specifications for the material

    source: base material for the test specimen

    id: internal name for the specimen

    lp: load protocol

    size: type of specimen (M8, M12, M20)

    gage_length_mm_: unreduced section length in mm

    avg_reduced_dia_mm_: average measured diameter for the reduced section in mm

    avg_fractured_dia_top_mm_: average measured diameter of the top fracture surface in mm

    avg_fractured_dia_bot_mm_: average measured diameter of the bottom fracture surface in mm

    fy_n_mpa_: nominal yield stress

    fu_n_mpa_: nominal ultimate stress

    t_a_deg_c_: ambient temperature in degC

    date: date of test

    investigator: person(s) who conducted the test

    location: laboratory where test was conducted

    machine: setup used to conduct test

    pid_force_k_p, pid_force_t_i, pid_force_t_d: PID parameters for force control

    pid_disp_k_p, pid_disp_t_i, pid_disp_t_d: PID parameters for displacement control

    pid_extenso_k_p, pid_extenso_t_i, pid_extenso_t_d: PID parameters for extensometer control

    citekey: reference corresponding to the Database_References.bib file

    yield_stress_mpa_: computed yield stress in MPa

    elastic_modulus_mpa_: computed elastic modulus in MPa

    fracture_strain: computed average true strain across the fracture surface

    c,si,mn,p,s,n,cu,mo,ni,cr,v,nb,ti,al,b,zr,sn,ca,h,fe: chemical compositions in units of %mass

    file: file name of corresponding clean (downsampled) stress-strain data

    File Format: Summarized_Mechanical_Props_Campaign

    Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,

    tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv', index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1], keep_default_na=False, na_values='')

    citekey: reference in "Campaign_References.bib".

    Grade: material grade.

    Spec.: specifications (e.g., J2+N).

    Yield Stress [MPa]: initial yield stress in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Elastic Modulus [MPa]: initial elastic modulus in MPa

    size, count, mean, coefvar: number of experiments in campaign, number of experiments in mean, mean value for campaign, coefficient of variation for campaign

    Caveats

    The files in the following directories were tested before the protocol was established. Therefore, only the true stress-strain is available for each:

    A500

    A992_Gr50

    BCP325

    BCR295

    HYP400

    S460NL

    S690QL/25mm

    S355J2_Plates/S355J2_N_25mm and S355J2_N_50mm

  7. HPA - Sample Submission With Extra Metadata

    • kaggle.com
    Updated Feb 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2021). HPA - Sample Submission With Extra Metadata [Dataset]. https://www.kaggle.com/dschettler8845/hpa-sample-submission-with-extra-metadata/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Darien Schettler
    Description

    This is the Sample Submission CSV file after running the CellSegmentator tool on the images and recording relevant outputs.

    The extra data included is: - RLE Masks (for each cell) - Submission Style RLE Masks (for each cell) - Bounding Boxes (for each cell)

  8. h

    warvan-ml-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    warvan, warvan-ml-dataset [Dataset]. https://huggingface.co/datasets/warvan/warvan-ml-dataset
    Explore at:
    Authors
    warvan
    Description

    Dataset Name

    This dataset contains structured data for machine learning and analysis purposes.

      Contents
    

    data/sample.csv: Sample dataset file. data/train.csv: Training dataset. data/test.csv: Testing dataset. scripts/preprocess.py: Script for preprocessing the dataset. scripts/analyze.py: Script for data analysis.

      Usage
    

    Load the dataset using Pandas: import pandas as pd df = pd.read_csv('data/sample.csv')

    Run preprocessing: python scripts/preprocess.py… See the full description on the dataset page: https://huggingface.co/datasets/warvan/warvan-ml-dataset.

  9. d

    can-csv

    • data.dtu.dk
    zip
    Updated Dec 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brooke Elizabeth Lampe (2023). can-csv [Dataset]. http://doi.org/10.11583/DTU.24805509.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2023
    Dataset provided by
    Technical University of Denmark
    Authors
    Brooke Elizabeth Lampe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    can-csvThis dataset contains controller area network (CAN) traffic for the 2017 Subaru Forester, the 2016 Chevrolet Silverado, the 2011 Chevrolet Traverse, and the 2011 Chevrolet Impala.For each vehicle, there are samples of attack-free traffic--that is, normal traffic--as well as samples of various types of attacks. The spoofing attacks, such as RPM spoofing, speed spoofing, etc., have an observable effect on the vehicle under test.This repository contains only .csv files. It is a subset of the can-dataset repository.

  10. m

    TRTH JSE AGLJ.J Intraday Transaction Test Data

    • data.mendeley.com
    Updated May 2, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tim Gebbie (2019). TRTH JSE AGLJ.J Intraday Transaction Test Data [Dataset]. http://doi.org/10.17632/4rrk89c3b2.2
    Explore at:
    Dataset updated
    May 2, 2019
    Authors
    Tim Gebbie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    An example of TRTH intraday top-of-book transaction data for a single Johannesburg Stock Exchange (JSE) listed equity. The data is for teaching, learning and research projects sourced from the legacy Tick History v1 SOAP API interface from https://tickhistory.thomsonreuters.com/TickHistory in May 2016. Related raw data and similar data-structures can now be accessed using Tick History v2 and the REST API https://hosted.datascopeapi.reuters.com/RestApi/v1.

    Configuration control: the test dataset contains 16 CSV files with names: "

    Attributes: The data set is for the ticker: AGLJ.J from May 2010 until May 2016. The files include the following attributes: RIC, Local Date-Time, Event Type, Price at the Event, Volume at the Event, Best Bid Changes, Best Ask Changes, and Trade Event Sign: RIC, DateTimeL, Type, Price, Volume, L1 Bid, L1 Ask, Trade Sign. The Local Date-Time (DateTimeL) is a serial date number where 1 corresponds to Jan-1-0000, for example, 736333.382013 corresponds to 4-Jan-2016 09:10:05 (or 20160104T091005 in ISO 8601 format). The trade event sign (Trade Sign) indicates whether the transaction was buyer (or seller) initiated as +1 (-1) and was prepared using the method of Lee and Ready (2008).

    Disclaimer: The data is not up-to-date, is incomplete, it has been pre-processed; as such it is not fit for any other purpose than teaching and learning, and algorithm testing. For complete, up-to-date, and error-free data please use the Tick History v2 interface directly.

    Research Objectives: The data has been used to build empirical evidence in support of hierarchical causality and universality in financial markets by considering price impact on different time and averaging scales, feature selection on different scales as inputs into scale dependent machine learning applications, and for various aspects of agent-based model calibration and market ecology studies on different time and averaging scales.

    Acknowledgements to: Diane Wilcox, Dieter Hendricks, Michael Harvey, Fayyaaz Loonat, Michael Gant, Nicholas Murphy and Donovan Platt.

  11. f

    ScanGrow Manuscript files

    • figshare.com
    txt
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Espina; Ross Worth (2023). ScanGrow Manuscript files [Dataset]. http://doi.org/10.6084/m9.figshare.16822924.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Authors
    Laura Espina; Ross Worth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets relative to the manuscript describing the ScanGrow [Proof of Concept] application:

    Worth RM and Espina L (2022) ScanGrow: Deep Learning-Based Live Tracking of Bacterial Growth in Broth. Front. Microbiol. 13:900596.

    doi: 10.3389/fmicb.2022.900596

    The contents of the three compressed folders are described below.

    1. TRAINING_MODEL.ZIP Collection of images and spreadsheets that was used in the training of the image classification model that ScanGrow [PoC] uses by default. This training dataset should be subjected to the pre-processing workflow provided with ScanGrow to obtain the grouped images to be fed to the model training utility.

    2. TEST_MODEL.ZIP

    Collection of images and spreadsheets comprising the Test dataset used in the evaluation of the image classification model. This includes: - New scans and spreadsheets (represented in Figure 3 as gray triangles). - Evaluation.csv: combined results of the output files from command "Test Model" when run with: * Dataset Test: these scans and spreadsheets (not used for training), * Dataset Training: the dataset used for training the model, or * Dataset Validation: the Training dataset after having flipped horizontally and offsetting the images and adjusted the spectrophotometric values according to the newly inverted well positions.

    1. SAMPLE_RUN.ZIP

    Data from a sample run used to test ScanGrow on a microplate containing different concentrations of several antibiotics. This includes:

    • Scans used to in the "Sample run" with added antibiotics in the bacterial cultures.
    • Sample_run_raw.csv: Data exported from the Table view after the run.
    • Sample_run_processed.csv: Data from the Sample_run_raw.csv file after the introduction of metadata (eg. contents of each well) and calculation of the AUC (area under the curve).
    • Sample_run_json.json: JSON file showing the results of this run. It can be loaded into a ScanGrow session by clicking on "Show Graphs" -> "Open".
    • ImageMask.csv: alternative ImageMask to substitute the original one in "C:\Program Files\Riverwell Consultancy Services Ltd\Scan Grow\Configuration". In this alternative ImageMask file, well C11 was modified to overcome an artefact in the scan.
  12. DCASE 2022 Challenge Task 2 Development Dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Jun 14, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kota Dohi; Kota Dohi; Keisuke Imoto; Keisuke Imoto; Yuma Koizumi; Yuma Koizumi; Noboru Harada; Noboru Harada; Daisuke Niizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Masaaki Yamamoto; Yohei Kawaguchi; Yohei Kawaguchi; Daisuke Niizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Masaaki Yamamoto (2022). DCASE 2022 Challenge Task 2 Development Dataset [Dataset]. http://doi.org/10.5281/zenodo.6355122
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 14, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kota Dohi; Kota Dohi; Keisuke Imoto; Keisuke Imoto; Yuma Koizumi; Yuma Koizumi; Noboru Harada; Noboru Harada; Daisuke Niizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Masaaki Yamamoto; Yohei Kawaguchi; Yohei Kawaguchi; Daisuke Niizumi; Tomoya Nishida; Harsh Purohit; Takashi Endo; Masaaki Yamamoto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This dataset is the "development dataset" for the DCASE 2022 Challenge Task 2 "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques".

    The data consists of the normal/anomalous operating sounds of seven types of real/toy machines. Each recording is a single-channel 10-second audio that includes both a machine's operating sound and environmental noise. The following seven types of real/toy machines are used in this task:

    • Fan
    • Gearbox
    • Bearing
    • Slide rail
    • ToyCar
    • ToyTrain
    • Valve

    Overview of the task

    Anomalous sound detection (ASD) is the task of identifying whether the sound emitted from a target machine is normal or anomalous. Automatic detection of mechanical failure is an essential technology in the fourth industrial revolution, which involves artificial intelligence (AI)-based factory automation. Prompt detection of machine anomalies by observing sounds is useful for monitoring the condition of machines.

    This task is the follow-up to DCASE 2020 Task 2 and DCASE 2021 Task 2. The task this year is to detect anomalous sounds under three main conditions:

    1. Only normal sound clips are provided as training data (i.e., unsupervised learning scenario). In real-world factories, anomalies rarely occur and are highly diverse. Therefore, exhaustive patterns of anomalous sounds are impossible to create or collect and unknown anomalous sounds that were not observed in the given training data must be detected. This condition is the same as in DCASE 2020 Task 2 and DCASE 2021 Task 2.

    2. Factors other than anomalies change the acoustic characteristics between training and test data (i.e., domain shift). In real-world cases, operational conditions of machines or environmental noise often differ between the training and testing phases. For example, the operation speed of a conveyor can change due to seasonal demand, or environmental noise can fluctuate depending on the states of surrounding machines. This condition is the same as in DCASE 2021 Task 2.

    3. In test data, samples unaffected by domain shifts (source domain data) and those affected by domain shifts (target domain data) are mixed, and the source/target domain of each sample is not specified. Therefore, the model must detect anomalies regardless of the domain (i.e., domain generalization).

    Definition

    We first define key terms in this task: "machine type," "section," "source domain," "target domain," and "attributes.".

    • "Machine type" indicates the kind of machine, which in this task is one of seven: fan, gearbox, bearing, slide rail, valve, ToyCar, and ToyTrain.
    • A section is defined as a subset of the dataset for calculating performance metrics. Each section is dedicated to a specific type of domain shift.
    • The source domain is the domain under which most of the training data and part of the test data were recorded, and the target domain is a different set of domains under which a few of the training data and part of the test data were recorded. There are differences between the source and target domains in terms of operating speed, machine load, viscosity, heating temperature, type of environmental noise, SNR, etc.
    • Attributes are parameters that define states of machines or types of noise.

    Dataset

    This dataset consists of three sections for each machine type (Sections 00, 01, and 02), and each section is a complete set of training and test data. For each section, this dataset provides (i) 990 clips of normal sounds in the source domain for training, (ii) ten clips of normal sounds in the target domain for training, and (iii) 100 clips each of normal and anomalous sounds for the test. The source/target domain of each sample is provided. Additionally, the attributes of each sample in the training and test data are provided in the file names and attribute csv files.

    File names and attribute csv files

    File names and attribute csv files provide reference labels for each clip. The given reference labels for each training/test clip include machine type, section index, normal/anomaly information, and attributes regarding the condition other than normal/anomaly. The machine type is given by the directory name. The section index is given by their respective file names. For the datasets other than the evaluation dataset, the normal/anomaly information and the attributes are given by their respective file names. Attribute csv files are for easy access to attributes that cause domain shifts. In these files, the file names, name of parameters that cause domain shifts (domain shift parameter, dp), and the value or type of these parameters (domain shift value, dv) are listed. Each row takes the following format:

    [filename (string)], [d1p (string)], [d1v (int | float | string)], [d2p], [d2v]...

    Recording procedure

    Normal/anomalous operating sounds of machines and its related equipment are recorded. Anomalous sounds were collected by deliberately damaging target machines. For simplifying the task, we use only the first channel of multi-channel recordings; all recordings are regarded as single-channel recordings of a fixed microphone. We mixed a target machine sound with environmental noise, and only noisy recordings are provided as training/test data. The environmental noise samples were recorded in several real factory environments. We will publish papers on the dataset to explain the details of the recording procedure by the submission deadline.

    Directory structure

    - /dev_data
    - /fan
    - /train (only normal clips)
    - /section_00_source_train_normal_0000_

    Baseline system

    Two baseline systems are available on the Github repository baseline_ae and baseline_mobile_net_v2. The baseline systems provide a simple entry-level approach that gives a reasonable performance in the dataset of Task 2. They are good starting points, especially for entry-level researchers who want to get familiar with the anomalous-sound-detection task.

    Condition of use

    This dataset was created jointly by Hitachi, Ltd. and NTT Corporation and is available under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

    Citation

    If you use this dataset, please cite all the following three papers.

    • Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Masaaki Yamamoto, Yohei Kawaguchi, Description and Discussion on DCASE 2022 Challenge Task 2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Applying Domain Generalization Techniques. In arXiv e-prints: 2206.05876, 2022. [URL]
    • Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi. MIMII DG: sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task. In arXiv e-prints: 2205.13879, 2022. [URL]
    • Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito. ToyADMOS2: another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift conditions. In

  13. CODE-test: An annotated 12-lead ECG dataset

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jun 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro (2021). CODE-test: An annotated 12-lead ECG dataset [Dataset]. http://doi.org/10.5281/zenodo.3765780
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Antonio H Ribeiro; Antonio H Ribeiro; Manoel Horta Ribeiro; Manoel Horta Ribeiro; Gabriela M. Paixão; Gabriela M. Paixão; Derick M. Oliveira; Derick M. Oliveira; Paulo R. Gomes; Paulo R. Gomes; Jéssica A. Canazart; Jéssica A. Canazart; Milton P. Ferreira; Milton P. Ferreira; Carl R. Andersson; Carl R. Andersson; Peter W. Macfarlane; Peter W. Macfarlane; Wagner Meira Jr.; Wagner Meira Jr.; Thomas B. Schön; Thomas B. Schön; Antonio Luiz P. Ribeiro; Antonio Luiz P. Ribeiro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    # Annotated 12 lead ECG dataset
    
    Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic diagnosis of the 12-lead ECG using a deep neural network". https://www.nature.com/articles/s41467-020-15432-4.
    
    It contain annotations about 6 different ECGs abnormalities:
    - 1st degree AV block (1dAVb);
    - right bundle branch block (RBBB);
    - left bundle branch block (LBBB);
    - sinus bradycardia (SB);
    - atrial fibrillation (AF); and,
    - sinus tachycardia (ST).
    
    Companion python scripts are available in:
    https://github.com/antonior92/automatic-ecg-diagnosis
    
    --------
    
    Citation
    ```
    Ribeiro, A.H., Ribeiro, M.H., Paixão, G.M.M. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun 11, 1760 (2020). https://doi.org/10.1038/s41467-020-15432-4
    ```
    
    Bibtex:
    ```
    @article{ribeiro_automatic_2020,
     title = {Automatic Diagnosis of the 12-Lead {{ECG}} Using a Deep Neural Network},
     author = {Ribeiro, Ant{\^o}nio H. and Ribeiro, Manoel Horta and Paix{\~a}o, Gabriela M. M. and Oliveira, Derick M. and Gomes, Paulo R. and Canazart, J{\'e}ssica A. and Ferreira, Milton P. S. and Andersson, Carl R. and Macfarlane, Peter W. and Meira Jr., Wagner and Sch{\"o}n, Thomas B. and Ribeiro, Antonio Luiz P.},
     year = {2020},
     volume = {11},
     pages = {1760},
     doi = {https://doi.org/10.1038/s41467-020-15432-4},
     journal = {Nature Communications},
     number = {1}
    }
    ```
    -----
    
    
    ## Folder content:
    
    - `ecg_tracings.hdf5`: The HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exams in the following order: `{DI, DII, DIII, AVR, AVL, AVF, V1, V2, V3, V4, V5, V6}`. 
    
    The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V.
    
    In python, one can read this file using the following sequence:
    ```python
    import h5py
    with h5py.File(args.tracings, "r") as f:
      x = np.array(f['tracings'])
    ```
    
    - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It
    contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line.
    - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST`
    corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`).
     1. `cardiologist[1,2].csv` contain annotations from two different cardiologist.
     2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 
     3. `dnn.csv` prediction from the deep neural network described in the paper. THe threshold is set in such way it maximizes the F1 score.
     4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset).
     5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset).
     6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
    
  14. h

    stt-english-test-dataset-sample

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marian Ashraf Boshra (2025). stt-english-test-dataset-sample [Dataset]. https://huggingface.co/datasets/Marianne0Habib/stt-english-test-dataset-sample
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Marian Ashraf Boshra
    Description

    🗣️ English Speech Audio Dataset (Sample)

    This dataset contains English speech samples, annotated by dialect, speaking rate, environmental condition, and includes ground truth transcriptions. It is intended to support research and applications in Automatic Speech Recognition (ASR), and Spoken language understanding.

      📁 Dataset Structure
    

    Audio segments are stored in .wav format Accompanied by a CSV file (En_dataset.csv) with rich metadata

      📊Dataset Statistics… See the full description on the dataset page: https://huggingface.co/datasets/Marianne0Habib/stt-english-test-dataset-sample.
    
  15. Embryo classification based on microscopic images

    • kaggle.com
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Dutta (2023). Embryo classification based on microscopic images [Dataset]. https://www.kaggle.com/datasets/gauravduttakiit/embryo-classification-based-on-microscopic-images/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Dutta
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Description Welcome to the "Hung Vuong Hospital Embryo Classification" dataset. This page provides a comprehensive overview of the data files, their formats, and the essential columns you'll encounter in this competition. Taking a moment to understand the data will help you navigate the challenge effectively and make informed decisions during your analysis and modeling.

    The dataset comprises the following key files:

    train folder - Contains images of embryos at day-3 and day-5 for training purposes. test folder - Contains images of embryos at day-3 and day-5 for testing purposes. train.csv - Contains information about the training set. test.csv - Contains information about the test set. sample_submission.csv - A sample submission file that demonstrates the correct submission format. Data Format Expectations

    The embryo images are arranged within subfolders under the train and test directories. Each image is saved in JPG format and is labeled with a prefix. Images corresponding to day-3 embryos have the prefix D3 while images related to day-5 embryos bear the prefix D5. This prefix-based categorization allows for easy identification of the embryo's developmental stage.

    Expected Output

    Your task in this competition is to create a deep learning model that can accurately classify embryo images as 1 for good or 0 for not good for both day-3 and day-5 stages. The model should be trained on the training set and then used to predict the embryo quality in the test set. The ID column assigns an ID to each image. You will create the Class column as the result of model classification. The submission file contains only 2 columns: ID and Class (See the sample submission file)

    Columns

    You will encounter the following columns throughout the dataset:

    ID - Refers to the ID of the images in the test set. Image - Refers to the file name of the embryo images in the train or test folder. Class - Represents the evaluation of the embryo images. This column provides the ground truth label for each image, indicating whether the embryo is classified as 'good' or 'not good'. We encourage you to explore, analyze, and preprocess the provided data to build a robust model for accurate embryo quality classification. Good luck, and may your innovative solutions contribute to advancements in reproductive science!

  16. o

    Yahoo Answers 10 categories for NLP CSV

    • opendatabay.com
    .undefined
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Yahoo Answers 10 categories for NLP CSV [Dataset]. https://www.opendatabay.com/data/ai-ml/d892a07d-269c-4f84-9183-4821b036731f
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 23, 2025
    Dataset authored and provided by
    Datasimple
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Area covered
    Art & Digital Creations
    Description

    The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

    The file classes.txt contains a list of classes corresponding to each label.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

    Original Data Source: Yahoo Answers 10 categories for NLP CSV

  17. u

    Data from: Pesticide Data Program (PDP)

    • agdatacommons.nal.usda.gov
    txt
    Updated Nov 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Department of Agriculture (USDA), Agricultural Marketing Service (AMS) (2023). Pesticide Data Program (PDP) [Dataset]. http://doi.org/10.15482/USDA.ADC/1520764
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 30, 2023
    Dataset provided by
    Ag Data Commons
    Authors
    U.S. Department of Agriculture (USDA), Agricultural Marketing Service (AMS)
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    The Pesticide Data Program (PDP) is a national pesticide residue database program. Through cooperation with State agriculture departments and other Federal agencies, PDP manages the collection, analysis, data entry, and reporting of pesticide residues on agricultural commodities in the U.S. food supply, with an emphasis on those commodities highly consumed by infants and children. This dataset provides information on where each tested sample was collected, where the product originated from, what type of product it was, and what residues were found on the product, for calendar years 1992 through 2020. The data can measure residues of individual compounds and classes of compounds, as well as provide information about the geographic distribution of the origin of samples, from growers, packers and distributors. The dataset also includes information on where the samples were taken, what laboratory was used to test them, and all testing procedures (by sample, so can be linked to the compound that is identified). The dataset also contains a reference variable for each compound that denotes the limit of detection for a pesticide/commodity pair (LOD variable). The metadata also includes EPA tolerance levels or action levels for each pesticide/commodity pair. The dataset will be updated on a continual basis, with a new resource data file added annually after the PDP calendar-year survey data is released. Resources in this dataset:Resource Title: CSV Data Dictionary for PDP. File Name: PDP_DataDictionary.csvResource Description: Machine-readable Comma Separated Values (CSV) format data dictionary for PDP Database Zip files. Defines variables for the sample identity and analytical results data tables/files. The ## characters in the Table and Text Data File name refer to the 2-digit year for the PDP survey, like 97 for 1997 or 01 for 2001. For details on table linking, see PDF. Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/en-us/microsoft-365/excel Resource Title: Data dictionary for Pesticide Data Program. File Name: PDP DataDictionary.pdfResource Description: Data dictionary for PDP Database Zip files.Resource Software Recommended: Adobe Acrobat,url: https://www.adobe.com Resource Title: 2019 PDP Database Zip File. File Name: 2019PDPDatabase.zipResource Title: 2018 PDP Database Zip File. File Name: 2018PDPDatabase.zipResource Title: 2017 PDP Database Zip File. File Name: 2017PDPDatabase.zipResource Title: 2016 PDP Database Zip File. File Name: 2016PDPDatabase.zipResource Title: 2015 PDP Database Zip File. File Name: 2015PDPDatabase.zipResource Title: 2014 PDP Database Zip File. File Name: 2014PDPDatabase.zipResource Title: 2013 PDP Database Zip File. File Name: 2013PDPDatabase.zipResource Title: 2012 PDP Database Zip File. File Name: 2012PDPDatabase.zipResource Title: 2011 PDP Database Zip File. File Name: 2011PDPDatabase.zipResource Title: 2010 PDP Database Zip File. File Name: 2010PDPDatabase.zipResource Title: 2009 PDP Database Zip File. File Name: 2009PDPDatabase.zipResource Title: 2008 PDP Database Zip File. File Name: 2008PDPDatabase.zipResource Title: 2007 PDP Database Zip File. File Name: 2007PDPDatabase.zipResource Title: 2005 PDP Database Zip File. File Name: 2005PDPDatabase.zipResource Title: 2004 PDP Database Zip File. File Name: 2004PDPDatabase.zipResource Title: 2003 PDP Database Zip File. File Name: 2003PDPDatabase.zipResource Title: 2002 PDP Database Zip File. File Name: 2002PDPDatabase.zipResource Title: 2001 PDP Database Zip File. File Name: 2001PDPDatabase.zipResource Title: 2000 PDP Database Zip File. File Name: 2000PDPDatabase.zipResource Title: 1999 PDP Database Zip File. File Name: 1999PDPDatabase.zipResource Title: 1998 PDP Database Zip File. File Name: 1998PDPDatabase.zipResource Title: 1997 PDP Database Zip File. File Name: 1997PDPDatabase.zipResource Title: 1996 PDP Database Zip File. File Name: 1996PDPDatabase.zipResource Title: 1995 PDP Database Zip File. File Name: 1995PDPDatabase.zipResource Title: 1994 PDP Database Zip File. File Name: 1994PDPDatabase.zipResource Title: 1993 PDP Database Zip File. File Name: 1993PDPDatabase.zipResource Title: 1992 PDP Database Zip File. File Name: 1992PDPDatabase.zipResource Title: 2006 PDP Database Zip File. File Name: 2006PDPDatabase.zipResource Title: 2020 PDP Database Zip File. File Name: 2020PDPDatabase.zipResource Description: Data and supporting files for PDP 2020 surveyResource Software Recommended: Microsoft Access,url: https://products.office.com/en-us/access

  18. h

    Yelp_Reviews_for_Sentiment_Analysis_fine_grained_5_classes

    • huggingface.co
    Updated Mar 6, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yassir acharki (2012). Yelp_Reviews_for_Sentiment_Analysis_fine_grained_5_classes [Dataset]. https://huggingface.co/datasets/yassiracharki/Yelp_Reviews_for_Sentiment_Analysis_fine_grained_5_classes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2012
    Authors
    yassir acharki
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    The Yelp reviews full star dataset is constructed by randomly taking 130,000 training samples and 10,000 testing samples for each review star from 1 to 5. In total there are 650,000 trainig samples and 50,000 testing samples.

      Dataset Description
    

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 2 columns in them, corresponding to class index (1 to 5) and review text. The review texts are… See the full description on the dataset page: https://huggingface.co/datasets/yassiracharki/Yelp_Reviews_for_Sentiment_Analysis_fine_grained_5_classes.

  19. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dylan Westfall; Mullins James (2023). Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. http://doi.org/10.5061/dryad.w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    HIV Prevention Trials Network
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  20. d

    Root thread strength, landslide headscarp geometry, and observed root...

    • catalog.data.gov
    • data.usgs.gov
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Root thread strength, landslide headscarp geometry, and observed root characteristics at the monitored CB1 landslide, Oregon, USA [Dataset]. https://catalog.data.gov/dataset/root-thread-strength-landslide-headscarp-geometry-and-observed-root-characteristics-at-the
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Oregon, United States
    Description

    This data release supports interpretations of field-observed root distributions within a shallow landslide headscarp (CB1) located below Mettman Ridge within the Oregon Coast Range, approximately 15 km northeast of Coos Bay, Oregon, USA. (Schmidt_2021_CB1_topo_far.png and Schmidt_2021_CB1_topo_close.png). Root species, diameter (greater than or equal to 1 mm), general orientation relative to the slide scarp, and depth below ground surface were characterized immediately following landsliding in response to large-magnitude precipitation in November 1996 which triggered thousands of landslides within the area (Montgomery and others, 2009). The enclosed data includes: (1) tests of root-thread failure as a function of root diameter and tensile load for different plant species applicable to the broader Oregon Coast Range and (2) tape and compass survey of the planform geometry of the CB1 landslide and the roots observed in the slide scarp. Root diameter and load measurements were principally collected in the general area of the CB1 slide for 12 species listed in: Schmidt_2021_OR_root_species_list.csv. Methodology of the failure tests included identifying roots of a given plant species, trimming root threads into 15-20 cm long segments, measuring diameters including bark (up to 6.5 mm) with a micrometer at multiple points along the segment to arrive at an average, clamping a segment end to a calibrated spring and loading roots until failure recording the maximum load. Files containing the tensile failure tests described in Schmidt and others (2001) include root diameter (mm), critical tensile load at failure (kg), root cross-sectional area (m^2), and tensile strength (MPa). Tensile strengths were calculated as: (critical tensile load at failure * gravitational acceleration)/root cross-sectional area. The files are labeled: Schmidt_2021_OR_root_AceCir.csv, Schmidt_2021_OR_root_AceMac.csv, Schmidt_2021_OR_root_AlnRub.csv, Schmidt_2021_OR_root_AnaMar.csv, Schmidt_2021_OR_root_DigPur.csv, Schmidt_2021_OR_root_MahNer.csv, Schmidt_2021_OR_root_PolMun.csv, Schmidt_2021_OR_root_PseMen_damaged.csv, Schmidt_2021_OR_root_PseMen_healthy.csv, Schmidt_2021_OR_root_RubDis.csv, Schmidt_2021_OR_root_RubPar.csv, Schmidt_2021_OR_root_SamCae.csv, and Schmidt_2021_OR_root_TsuHet.csv. File naming follows the convention of adopting the first three letters of the binomial system defining genus and species of their Latin names. Live and damaged roots were identified based on their color, texture, plasticity, adherence of bark to woody material, and compressibility. For example, healthy live Douglas-fir (Pseudotsuga menziesii) roots (Schmidt_2021_OR_root_PseMen_healthy.csv) have a crimson-colored inner bark, darkening to a brownish red in dead Douglas-fir roots. Both are distinctive colors. Live roots exhibited plastic responses to bending and strong adherence of bark, whereas dead roots displayed brittle behavior with bending and poor adherence of bark to the underlying woody material. Measured tensile strengths of damaged root threads with fungal infections following selective tree harvest using yarding operations that damaged bark of standing trees expressed significantly lower tensile strengths than their ultimate living tensile strengths (Schmidt_2021_OR_root_PseMen_damaged.csv). The CB1 site was clear cut logged in 1987 and replanted with Douglas fir saplings in 1989. Vegetation in the vicinity of the failure scarp is dominated by young Douglas fir saplings planted two years after the clear cut, blue elderberry (Sambucus caerulea), thimbleberry (Rubus parviflorus), foxglove (Digitalis purpurea), and Himalayan blackberry (Rubus discolor). The remaining seven species are provided for context of more regional studies. The CB1 site is a hillslope hollow that failed as a shallow landslide and mobilized as a debris flow during heavy rainfall in November 1996. Prior to debris flow mobilization, the ~5-m wide slide with a source area of roughly 860 m^2 and an average slope of 43° displaced and broke numerous roots. Following landsliding, field observations noted a preponderance of exposed, blunt broken root stubs within the scarp. Roots were not straight and smooth, but rather exhibited tortuous growth paths with firmly anchored, interlocking structures. The planform geometry represented by a tape and compass field survey is presented as starting and ending points of slide margin segments of roughly equal colluvial soil depths above saprolite or bedrock (Schmidt_2021_CB1_scarp_geometry.csv and Schmidt_2021_CB1_scarp_pts.shp). The graphic Schmidt_2021_CB1_scarp_pts_poly.png shows the horse-shoe shaped profile and its numbered scarp segments. Segment numbers enclosed within parentheses indicate segments where roots were not counted owing to occlusion by prior ground disturbance. The shapefile Schmidt_2021_CB1_scarp_poly.shp also represents the scarp line segments. The file Schmidt_2021_CB1_segment_info.csv presents the segment information as left and right cumulative lengths, averaged colluvium soils depths for each segment, and inclinations of the ground surface slope relative to horizontal along the perimeter (P) and the slide scarp face (F). Lastly, Schmidt_2021_CB1_rootdata_scarp.csv represents root diameter of individual threads measured by a micrometer, species, depth below ground surface, live vs. dead roots, general root orientation (parallel or perpendicular) relative to scarp perimeter, and cumulative perimeter distance within the scarp segments. At CB1 specifically and more generally across the Oregon Coast Range, root reinforcement occurs primarily by lateral reinforcement with typically much smaller basal reinforcements.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Organization logo

CSV file used in statistical analyses

Explore at:
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License

https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description

A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

Search
Clear search
Close search
Google apps
Main menu