100+ datasets found
  1. Soccer Universe

    • kaggle.com
    zip
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    willian oliveira (2024). Soccer Universe [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/soccer-universe
    Explore at:
    zip(21133975 bytes)Available download formats
    Dataset updated
    Jan 18, 2024
    Authors
    willian oliveira
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Ff0d45220cad473000b1e59942548dd45%2Fanimated_bubble_chart.gif?generation=1705615116968842&alt=media" alt="">This comprehensive football dataset, derived primarily from Transfermarkt, serves as a valuable resource for football enthusiasts, offering structured information on competitions, clubs, and players. With over 60,000 games across major global competitions, the dataset delves into the performance metrics of 400+ clubs and detailed statistics for more than 30,000 players.

    Structured in CSV files, each with unique IDs, users can seamlessly join datasets to perform in-depth analyses. The dataset encompasses market values, historical valuations, and detailed player statistics, including physical attributes, contract statuses, and individual performances. A specialized Python-based web scraper ensures consistent updates, with data meticulously processed through Python scripts and SQL databases.

    To use the dataset effectively, users are encouraged to understand the relevant files, join datasets using unique IDs, and leverage compatible software tools like Python's pandas or R's ggplot2 for analysis. The guide emphasizes the potential for fantasy football predictions, tracking player value over time, assessing market value versus performance, and exploring the impact of cards on match outcomes.

    Research ideas include player performance analysis for fantasy football or recruitment purposes, studying market value trends for economic insights, evaluating club performance for strategic decision-making, developing predictive models for match outcomes, and conducting social network analysis to understand interactions among clubs and players.

    Acknowledging the dataset's unknown license, users are encouraged to credit the original authors, particularly David Cereijo, if used in research. The dataset's dedication to accessibility is evident through active discussions on GitHub for improvements and bug fixes.

    In conclusion, this football dataset offers a wealth of information, empowering users to explore diverse analyses and research ideas, bridging the gap between structured data and the dynamic world of football.

  2. m

    A dataset for machine learning research in the field of stress analyses of...

    • data.mendeley.com
    • narcis.nl
    Updated Jul 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaroslav Matej (2020). A dataset for machine learning research in the field of stress analyses of mechanical structures [Dataset]. http://doi.org/10.17632/wzbzznk8z3.2
    Explore at:
    Dataset updated
    Jul 25, 2020
    Authors
    Jaroslav Matej
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is prepared and intended as a data source for development of a stress analysis method based on machine learning. It consists of finite element stress analyses of randomly generated mechanical structures. The dataset contains more than 270,794 pairs of stress analyses images (von Mises stress) of randomly generated 2D structures with predefined thickness and material properties. All the structures are fixed at their bottom edges and loaded with gravity force only. See PREVIEW directory with some examples. The zip file contains all the files in the dataset.

  3. f

    Dissecting the Space-Time Structure of Tree-Ring Datasets Using the Partial...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 23, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martinez-Meier, Alejandro; Godefroid, Martin; Rossi, Jean-Pierre; Rozenberg, Philippe; Sergent, Anne-Sophie; Ruiz-Diaz, Manuela; Nardin, Maxime; PĂąques, Luc (2014). Dissecting the Space-Time Structure of Tree-Ring Datasets Using the Partial Triadic Analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001257495
    Explore at:
    Dataset updated
    Sep 23, 2014
    Authors
    Martinez-Meier, Alejandro; Godefroid, Martin; Rossi, Jean-Pierre; Rozenberg, Philippe; Sergent, Anne-Sophie; Ruiz-Diaz, Manuela; Nardin, Maxime; PĂąques, Luc
    Description

    Tree-ring datasets are used in a variety of circumstances, including archeology, climatology, forest ecology, and wood technology. These data are based on microdensity profiles and consist of a set of tree-ring descriptors, such as ring width or early/latewood density, measured for a set of individual trees. Because successive rings correspond to successive years, the resulting dataset is a ring variables × trees × time datacube. Multivariate statistical analyses, such as principal component analysis, have been widely used for extracting worthwhile information from ring datasets, but they typically address two-way matrices, such as ring variables × trees or ring variables × time. Here, we explore the potential of the partial triadic analysis (PTA), a multivariate method dedicated to the analysis of three-way datasets, to apprehend the space-time structure of tree-ring datasets. We analyzed a set of 11 tree-ring descriptors measured in 149 georeferenced individuals of European larch (Larix decidua Miller) during the period of 1967–2007. The processing of densitometry profiles led to a set of ring descriptors for each tree and for each year from 1967–2007. The resulting three-way data table was subjected to two distinct analyses in order to explore i) the temporal evolution of spatial structures and ii) the spatial structure of temporal dynamics. We report the presence of a spatial structure common to the different years, highlighting the inter-individual variability of the ring descriptors at the stand scale. We found a temporal trajectory common to the trees that could be separated into a high and low frequency signal, corresponding to inter-annual variations possibly related to defoliation events and a long-term trend possibly related to climate change. We conclude that PTA is a powerful tool to unravel and hierarchize the different sources of variation within tree-ring datasets.

  4. Z

    Biases of STRUCTURE software when exploring introduction routes: Datasets,...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jan 21, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lombaert, Eric; Guillemaud, Thomas; Deleury, Emeline (2020). Biases of STRUCTURE software when exploring introduction routes: Datasets, STRUCTURE outputs, simulation and analysis pipeline [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_1002657
    Explore at:
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    INRA
    Authors
    Lombaert, Eric; Guillemaud, Thomas; Deleury, Emeline
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive is associated with the article "Biases of STRUCTURE software when exploring introduction routes of invasive species". Authors: Eric Lombaert, Thomas Guillemaud & Emeline Deleury.

    The file contains the 22,500 simulated datasets, the corresponding 900,000 STRUCTURE outputs and the summary statistics files. It also contains SIM_STRUCT which is a home-made pipeline developed for the purpose of carrying out analyzes as described in the manuscript. It can be used to simulate and summarize datasets, and to perform STRUCTURE analyses in batch on those simulated datasets. It is currently based on several softwares such as DIYABC, ARLSUMSTAT and STRUCTURE, as well as on some home-made PERL scripts. A tutorial is included. See the Readme file for details.

  5. Complete DAX Practice Dataset

    • kaggle.com
    zip
    Updated Oct 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Mahmoud Ali (2025). Complete DAX Practice Dataset [Dataset]. https://www.kaggle.com/datasets/thesnak/complete-dax-practice-dataset
    Explore at:
    zip(2980320 bytes)Available download formats
    Dataset updated
    Oct 29, 2025
    Authors
    Mohamed Mahmoud Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    🧼 Complete DAX Practice Dataset — Power BI / DAX Learning Resource

    📘 Overview

    This synthetic dataset is designed specifically for Power BI and DAX (Data Analysis Expressions) learners and professionals. It provides a complete star schema for practicing DAX measures, relationships, filters, and time intelligence — just like in real-world business analytics projects.

    The dataset simulates a multi-year sales environment with customers, employees, products, geographies, and dates — allowing you to perform calculations across multiple business dimensions.

    đŸ§© Dataset Structure

    This dataset contains 6 CSV files, forming a clean star schema:

    Table NameTypeDescription
    FactSalesFactContains transactional sales data with quantities, amounts, profits, discounts, and references to all dimension keys.
    DimDateDimensionA complete date table (2018–2024) including Year, Quarter, Month, DayOfWeek, Weekend/Holiday flags, etc.
    DimProductDimensionProduct catalog with Category, SubCategory, Color, Size, StandardCost, and ListPrice.
    DimCustomerDimensionCustomer information including name, gender, signup date, loyalty tier, and geographic key.
    DimEmployeeDimensionSales employee data including name, role, hire date, and region.
    DimGeographyDimensionGeographic data covering countries, regions, and cities.

    đŸ—‚ïž Fact Table Fields

    ColumnDescription
    SalesKeyUnique identifier for each transaction
    OrderDateKey, ShipDateKeyForeign keys to DimDate
    ProductKey, CustomerKey, EmployeeKey, GeographyKeyForeign keys to respective dimensions
    QuantityNumber of units sold
    UnitPricePrice per unit
    DiscountDiscount applied to the sale
    SalesAmountTotal sales value after discount
    TotalCostTotal cost of goods sold
    ProfitSalesAmount – TotalCost
    ChannelOnline, Retail, or Distributor
    PaymentMethodCredit, Cash, or Transfer
    OrderPriorityLow, Medium, or High priority

    📅 DimDate Fields

    Includes:

    • DateKey (YYYYMMDD)
    • Date
    • Year, Quarter, Month, MonthName, Day, DayOfWeek
    • IsWeekend, IsHoliday

    Perfect for DAX time intelligence functions like: TOTALYTD, SAMEPERIODLASTYEAR, DATESINPERIOD, and PARALLELPERIOD.

    🌍 Business Scenario

    Imagine a mid-sized electronics retailer operating across multiple regions and sales channels. The dataset captures 7 years of simulated performance — including seasonal patterns, regional sales variations, and customer loyalty effects.

    🧠 Learning Objectives

    This dataset is designed for:

    • Practicing Power BI data modeling
    • Learning and mastering DAX functions
    • Building interactive dashboards
    • Applying time intelligence and advanced calculations
    • Teaching data modeling concepts in analytics courses

    💡 Example DAX Practice Topics

    You can use this dataset to practice almost every DAX concept:

    Basic Aggregations

    Total Sales = SUM(FactSales[SalesAmount])
    Total Profit = SUM(FactSales[Profit])
    

    Context & Filters

    Online Sales = CALCULATE([Total Sales], FactSales[Channel] = "Online")
    

    Time Intelligence

    YTD Sales = TOTALYTD([Total Sales], DimDate[Date])
    Sales YoY % = DIVIDE([Total Sales] - [Previous Year Sales], [Previous Year Sales])
    

    Relationship Functions

    Shipped Sales = CALCULATE([Total Sales], USERELATIONSHIP(FactSales[ShipDateKey], DimDate[DateKey]))
    

    Ranki...

  6. Raw data from datasets used in SIMON analysis

    • data.europa.eu
    unknown
    Updated Jan 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Raw data from datasets used in SIMON analysis [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-2580414?locale=hr
    Explore at:
    unknown(312591)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here you can find raw data and information about each of the 34 datasets generated by the mulset algorithm and used for further analysis in SIMON. Each dataset is stored in separate folder which contains 4 files: json_info: This file contains, number of features with their names and number of subjects that are available for the same dataset data_testing: data frame with data used to test trained model data_training: data frame with data used to train models results: direct unfiltered data from database Files are written in feather format. Here is an example of data structure for each file in repository. File was compressed using 7-Zip available at https://www.7-zip.org/.

  7. Dataset for Targeted GC-MS Analysis of Firefighters' Exhaled Breath

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for Targeted GC-MS Analysis of Firefighters' Exhaled Breath [Dataset]. https://catalog.data.gov/dataset/dataset-for-targeted-gc-ms-analysis-of-firefighters-exhaled-breath
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset includes a table of the VOC concentrations detected in firefighter breath samples. QQ-plots for benzene, toluene, and ethylbenzene levels in breath samples as well as box-and-whisker plots of pre-, post-, and 1 h post-exposure breath levels of VOCs for firefighters participating in attack, search, and outside ventilation positions are provided. Graphs detailing the responses of individuals to pre-, post-, and 1 h post-exposure concentrations of benzene, toluene, and ethylbenzene are shown. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The original dataset contains identification information for the firefighters who participated in the controlled structure burns. The analyzed tables and graphs can be made publicly available. Format: The original dataset contains identification information for the firefighters who participated in the controlled structure burns. The analyzed tables and graphs can be made publicly available. This dataset is associated with the following publication: Wallace, A., J. Pleil, K. Oliver, D. Whitaker, S. Mentese, K. Fent, and G. Horn. Targeted GC-MS analysis of firefighters’ exhaled breath: Exploring biomarker response at the individual level. JOURNAL OF OCCUPATIONAL AND ENVIRONMENTAL HYGIENE. Taylor & Francis, Inc., Philadelphia, PA, USA, 16(5): 355-366, (2019).

  8. m

    THVD (Talking Head Video Dataset)

    • data.mendeley.com
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Peedor (2025). THVD (Talking Head Video Dataset) [Dataset]. http://doi.org/10.17632/ykhw8r7bfx.2
    Explore at:
    Dataset updated
    Apr 29, 2025
    Authors
    Mario Peedor
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    About

    We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 20,841 unique identities from around the world.

    Distribution

    Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.7TB

    -Total Videos: 47,547

    -Identities Covered: 20,841

    -Resolution: 60% 4k(1980), 33% fullHD(1080)

    -Formats: MP4

    -Full-length videos with visible mouth movements in every frame.

    -Minimum face size of 400 pixels.

    -Video durations range from 20 seconds to 5 minutes.

    -Faces have not been cut out, full screen videos including backgrounds.

    Usage

    This dataset is ideal for a variety of applications:

    Face Recognition & Verification: Training and benchmarking facial recognition models.

    Action Recognition: Identifying human activities and behaviors.

    Re-Identification (Re-ID): Tracking identities across different videos and environments.

    Deepfake Detection: Developing methods to detect manipulated videos.

    Generative AI: Training high-resolution video generation models.

    Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.

    Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.

    Coverage

    Explaining the scope and coverage of the dataset:

    Geographic Coverage: Worldwide

    Time Range: Time range and size of the videos have been noted in the CSV file.

    Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.

    Languages Covered (Videos):

    English: 23,038 videos

    Portuguese: 1,346 videos

    Spanish: 677 videos

    Norwegian: 1,266 videos

    Swedish: 1,056 videos

    Korean: 848 videos

    Polish: 1,807 videos

    Indonesian: 1,163 videos

    French: 1,102 videos

    German: 1,276 videos

    Japanese: 1,433 videos

    Dutch: 1,666 videos

    Indian: 1,163 videos

    Czech: 590 videos

    Chinese: 685 videos

    Italian: 975 videos

    Philipeans: 920 videos

    Bulgaria: 340 videos

    Romanian: 1144 videos

    Arabic: 1691 videos

    Who Can Use It

    List examples of intended users and their use cases:

    Data Scientists: Training machine learning models for video-based AI applications.

    Researchers: Studying human behavior, facial analysis, or video AI advancements.

    Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.

    Additional Notes

    Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file.

  9. m

    Analysis code and raw data for 'Automatic Etch Pit Density Analysis in...

    • data.mendeley.com
    • narcis.nl
    Updated Jul 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Fleck (2020). Analysis code and raw data for 'Automatic Etch Pit Density Analysis in Multicrystalline Silicon' [Dataset]. http://doi.org/10.17632/dv43z9x72t.1
    Explore at:
    Dataset updated
    Jul 10, 2020
    Authors
    Martin Fleck
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "etch_pit_density_analysis.zip" contains the analysis code. See the README.txt for more information.

    "secco_etched_mc_Si_wafer_image.png" - Optical microscope image, depicting a 2.5cm*1.2cm Secco etched multicrystalline Silicon wafer. Dark spots are etch pits, typically associated with dislocation lines that intersect with the wafer surface. Dark lines are grain boundaries. The leftmost 20% of the wafer have been in contact with the sample carrier during defect etching, explaining the uneven etch result.

  10. Students Data Analysis

    • kaggle.com
    zip
    Updated Jul 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MOMONO (2022). Students Data Analysis [Dataset]. https://www.kaggle.com/datasets/erqizhou/students-data-analysis
    Explore at:
    zip(2174 bytes)Available download formats
    Dataset updated
    Jul 20, 2022
    Authors
    MOMONO
    Description

    A little paragraph from one real dataset, with a few little changes to protect students' private information. Permissions are given.

    Goals

    You are going to help teachers with only the data: 1. Prediction: To tell what makes a brilliant student who can apply for a graduate school, whether abroad or not. 2. Application: To help those who fails to apply for a graduate school with advice in job searching.

    Tips

    1. Educational data may have subtle structures, hierarchies and heterogeneity are probably involved. Simple regressions can hardly make any difference. Also, you should keep an eye on the collinearity in some indicators collected by teachers who have already forgot statistics.
    2. Not all students are free to choose to apply for a graduate school, but some were born with privileges.
    3. Some of the students are trying (or planning to try) to apply for a graduate school for years, you should be responsible to give advice accurately under their circumstances

    About the Data

    Some of the original structure are deleted or censored. For those are left: Basic data like: - ID - class: categorical, initially students were divided into 2 classes, yet teachers suspect that of different classes students may performance significant differently. - gender - race: categorical and censored - GPA: real numbers, float

    Some teachers assume that scores of math curriculums can represent one's likelihood perfectly: - Algebra: real numbers, Advanced Algebra - ......

    Some assume that background of students can affect their choices and likelihood significantly, which are all censored as: - from1: students' home locations - from2: a probably bad indicator for preference on mathematics - from 3: how did students apply for this university (undergraduate) - from4: a probably bad indicator for family background. 0 with more wealth, 4 with more poverty

    The final indicator y: - 0, one fails to apply for the graduate school, who may apply again or search jobs in the future - 1, success, inland - 2, success, abroad

  11. Data for "Training data composition affects performance of protein structure...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Oct 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Derry; Alexander Derry; Kristy A. Carpenter; Kristy A. Carpenter; Russ B. Altman; Russ B. Altman (2021). Data for "Training data composition affects performance of protein structure analysis algorithms" by A. Derry, K. A. Carpenter, & R. B. Altman [Dataset]. http://doi.org/10.5281/zenodo.5542201
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 1, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alexander Derry; Alexander Derry; Kristy A. Carpenter; Kristy A. Carpenter; Russ B. Altman; Russ B. Altman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description

    This repository contains all data used in "Training data composition affects performance of protein structure analysis algorithms", published in the Pacific Symposium on Biocomputing 2022 by A. Derry, K. A. Carpenter, & R. B. Altman.

    The data consists of the following files:

    • ema_zenodo_data.tar.gz: train, validation, and test splits for Estimation of Model Accuracy task, in LMDB format
    • design_zenodo_data.tar.gz: train, validation, and test splits for Protein Sequence Design task, in JSON format
    • enz_cat_res_zenodo_data.tar.gz: train, validation, and test splits for Catalytic Residue and Enzyme Prediction task, in TF record format

    Details on dataset construction can be found in our paper and dataloaders can be found in our Github repo.

    Reference

    A. Derry*, K. A. Carpenter*, & R. B. Altman, "Training data composition affects performance of protein structure analysis algorithms", 2021.

    Dataset References

    Datasets used were derived from the following works:

    Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2019). Critical assessment of methods of protein structure prediction (CASP)—Round XIII. In Proteins: Structure, Function and Bioinformatics (Vol. 87, Issue 12, pp. 1011–1020). https://doi.org/10.1002/prot.25823

    Ingraham, J., Garg, V. K., Barzilay, R., & Jaakkola, T. (2019). Generative Models for Graph-Based Protein Design. https://openreview.net/pdf?id=SJgxrLLKOE

    Furnham, N., Holliday, G. L., de Beer, T. A. P., Jacobsen, J. O. B., Pearson, W. R., & Thornton, J. M. (2014). The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Research, 42 (Database issue), D485–D489.

  12. d

    Data from: Genomic structural differences between cattle and River Buffalo...

    • catalog.data.gov
    • agdatacommons.nal.usda.gov
    • +1more
    Updated Apr 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agricultural Research Service (2025). Data from: Genomic structural differences between cattle and River Buffalo identified through comparative genomic and transcriptomic analysis [Dataset]. https://catalog.data.gov/dataset/data-from-genomic-structural-differences-between-cattle-and-river-buffalo-identified-throu-10fbb
    Explore at:
    Dataset updated
    Apr 21, 2025
    Dataset provided by
    Agricultural Research Service
    Description

    Water buffalo (Bubalus bubalis L.) is an important livestock species worldwide. Like many other livestock species, water buffalo lacks high quality and continuous reference genome assembly, required for fine-scale comparative genomics studies. In this work, we present a dataset, which characterizes genomic differences between water buffalo genome and the extensively studied cattle (Bos taurus Taurus) reference genome. This data set is obtained after alignment of 14 river buffalo whole genome sequencing datasets to the cattle reference. This data set consisted of 13, 444 deletion CNV regions, and 11,050 merged mobile element insertion (MEI) events within the upstream regions of annotated cattle genes. Gene expression data from cattle and buffalo were also presented for genes impacted by these regions. This study sought to characterize differences in gene content, regulation and structure between taurine cattle and river buffalo (2n=50) (one extant type of water buffalo) using the extensively annotated UMD3.1 cattle reference genome as a basis for comparisons. Using 14 WGS datasets from river buffalo, we identified 13,444 deletion CNV regions (Supplemental Table 1) in river buffalo, but not identified in cattle. We also presented 11,050 merged mobile element insertion (MEI) events (Supplemental Table 2) in river buffalo, out of which, 568 of them are within the upstream regions of annotated cattle genes. Furthermore, our tissue transcriptomics analysis provided expression profiles of genes impacted by MEI (Supplemental Tables 3–6) and CNV (Supplemental Table 7) events identified in this study. This data provides the genomic coordinates of identified CNV-deletions and MEI events. Additionally, normalized read count of impacted genes, along with their adjusted p-values of statistical analysis were presented (Supplemental Tables 3–6). Genomic coordinates of identified CNV-deletion and MEI events, and Ensemble gene names of impacted genes (Supplemental Tables 1 and 2) Gene expression profiles and statistical significance (adjusted p-values) of genes impacted by MEI in liver (Supplemental Tables 3 and 4) Gene expression profiles and statistical significance (adjusted p-values) of genes impacted by MEI in muscle (Supplemental Tables 5 and 6) Gene expression profiles and statistical significance (adjusted p-values) of genes impacted by CNV deletions in river buffalo (Supplemental Table 7) Public assessment of this dataset will allow for further analyses and functional annotation of genes that are potentially associated with phenotypic difference between cattle and water buffalo. Raw read data of whole genome and transcriptome sequencing were deposited to NCBI Bioprojects. Resources in this dataset:Resource Title: Genomic structural differences between cattle and River Buffalo identified through comparative genomic and transcriptomic analysis. File Name: Web Page, url: https://www.sciencedirect.com/science/article/pii/S2352340918305183 Data in Brief presenting a dataset which characterizes genomic differences between water buffalo genome and the extensively studied cattle (Bos taurus Taurus) reference genome. This data set is obtained after alignment of 14 river buffalo whole genome sequencing datasets to the cattle reference. This data set consisted of 13, 444 deletion CNV regions, and 11,050 merged mobile element insertion (MEI) events within the upstream regions of annotated cattle genes. Gene expression data from cattle and buffalo were also presented for genes impacted by these regions. Tables are with this article. Raw read data of whole genome and transcriptome sequencing were deposited to NCBI Bioprojects as the following: PRJNA350833 (https://www.ncbi.nlm.nih.gov/bioproject/?term=350833) PRJNA277147 (https://www.ncbi.nlm.nih.gov/bioproject/?term=277147) PRJEB4351 (https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB4351)

  13. Dataset for: Experiment for validation of fluid-structure interaction models...

    • wiley.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andreas Hessenthaler; N Gaddum; Ondrej Holub; Ralph Sinkus; Oliver Röhrle; David Nordsletten (2023). Dataset for: Experiment for validation of fluid-structure interaction models and algorithms [Dataset]. http://doi.org/10.6084/m9.figshare.4141836.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Andreas Hessenthaler; N Gaddum; Ondrej Holub; Ralph Sinkus; Oliver Röhrle; David Nordsletten
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In this paper a fluid-structure interaction (FSI) experiment is presented. The aim of this experiment is to provide a challenging yet easy-to-setup FSI test case that addresses the need for rigorous testing of FSI algorithms and modeling frameworks. Steady-state and periodic steady-state test cases with constant and periodic inflow were established. Focus of the experiment is on biomedical engineering applications with flow being in the laminar regime with Reynolds numbers 1283 and 651. Flow and solid domains were defined using CAD tools. The experimental design aimed at providing a straight-forward boundary condition definition. Material parameters and mechanical response of a moderately viscous Newtonian fluid and a nonlinear incompressible solid were experimentally determined. A comprehensive data set was acquired by employing magnetic resonance imaging to record the interaction between the fluid and the solid, quantifying flow and solid motion.

  14. causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in...

    • zenodo.org
    zip
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl Willy Mehling; Carl Willy Mehling; Sven Pieper; Sven Pieper; Tobias LĂŒke; Tobias LĂŒke (2025). causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in Machinery [Dataset]. http://doi.org/10.5281/zenodo.15876410
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carl Willy Mehling; Carl Willy Mehling; Sven Pieper; Sven Pieper; Tobias LĂŒke; Tobias LĂŒke
    License

    http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0

    Description

    causRCA: Real-World Dataset for Causal Discovery and Root Cause Analysis in Machinery

    causRCA is a collection of time series datasets recorded from the CNC control of an industrial vertical lathe.

    The datasets comprise real-world recordings from normal factory operation and labeled fault data from a hardware-in-the-loop simulation. The fault datasets come with labels for the underlying (simulated) cause of the failure, a labeled diagnosis, and a causal model of all variables in the datasets.

    The extensive metadata and provided ground truth causal structure enable benchmarking of methods in causal discovery, root cause analysis, anomaly detection, and fault diagnosis in general.

    Use Cases & Applications

    • Causal Discovery: Benchmark learned causal graphs against an expert-derived causal graph.
    • Supervised Root Cause Analysis: Train and test models on labeled diagnosis for different fault scenarios.
    • Unsupervised Root Cause Analysis: Identify manipulated variables in different fault scenarios with known ground truth.

    Data & File Overview

    data/
     ┣ real_op/
     ┣ dig_twin/
     ┃ ┣ exp_coolant/
     ┃ ┣ exp_hydraulics/
     ┃ ┗ exp_probe/
     ┣ expert_graph/
     ┗ README_DATASET.md

    The data folder contains:

    • real_op/: CSV files with time series data from normal operation.
    • dig_twin/: Data from the digital twin experiments. Each group (coolant,hydraulics,probe) contains a causal subgraph as ground truth, different fault scenarios and multiple runs per scenario:
      • exp_coolant/: Coolant system faults
      • exp_hydraulics/: Hydraulic system faults
      • exp_probe/: Probe system faults
    • expert_graph/: GML and interactive HTML file with the expert-derived causal graph and lists of nodes and edges.
    • README_DATASET.md: Dataset description

    Datasets summary

    (Sub-)graph#Nodes#Edges#Datasets normal#Datasets Fault#Fault Scenarios#Different Diagnoses#Causing Variables
    Lathe (Full graph)92104170100191014
    --Probe111517034632
    --Hydraulics171817041956
    --Coolant151017025426
    --(Other Vars)4961170----

    *datasets from normal operation contain all machine variables and therefore all subgraphs and their respective variables within it.

    Methodological Information

    Real Operation Data (real_op)

    Data were recorded through an OPC UA interface during normal production cycles on a vertical lathe. These files capture baseline machine behavior under standard operating conditions, without induced or known faults.

    Digital Twin Data (dig_twin)

    A hardware-in-the-loop digital twin was developed by connecting the original machine controller to a real-time simulation. Faults (e.g., valve leaks, filter clogs) were injected by manipulating specific twin variables, providing known ground-truth causes. Data were recorded via the same OPC UA interface to ensure consistent structure.

    Known limitations

    Data was sampled via an OPC UA interface. The timestamps only reflect the published time of value change by the CNC and do not necessarily reflect the exact time of value changes.

    Consequently, the chronological order of changes across different variables is not strictly guaranteed. This may impact time-series analyses that are highly sensitive to precise temporal ordering.

    Methods for Processing

    Acknowledgements

    The authors gratefully acknowledge the contributions of:

    • KAMAX Holding GmbH & Co. KG for providing real production data from the vertical lathe.
    • Schuster Maschinenbau GmbH for supporting the digital twin development with knowledge and the PLC project.
    • ISG Industrielle Steuerungstechnik GmbH for developing the digital twin implementation.
    • SEITEC GmbH for hosting the hardware-in-the-loop setup and developing the OPC UA data recording solution.

    Declaration of GenAI and AI-assisted Technologies

    During the preparation of the dataset, the author(s) used generative AI tools to enhance the dataset's applicability by structuring data in an accessible format with extensive metadata, assist in coding transformations, and draft description content. All AI-generated output was reviewed and edited under human oversight, and no original dataset content was created by AI.

  15. Exploratory data analysis of a clinical study group: Development of a...

    • plos.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Ɓukasz ƁaczmaƄski (2023). Exploratory data analysis of a clinical study group: Development of a procedure for exploring multidimensional data [Dataset]. http://doi.org/10.1371/journal.pone.0201950
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Bogumil M. Konopka; Felicja Lwow; Magdalena Owczarz; Ɓukasz ƁaczmaƄski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Thorough knowledge of the structure of analyzed data allows to form detailed scientific hypotheses and research questions. The structure of data can be revealed with methods for exploratory data analysis. Due to multitude of available methods, selecting those which will work together well and facilitate data interpretation is not an easy task. In this work we present a well fitted set of tools for a complete exploratory analysis of a clinical dataset and perform a case study analysis on a set of 515 patients. The proposed procedure comprises several steps: 1) robust data normalization, 2) outlier detection with Mahalanobis (MD) and robust Mahalanobis distances (rMD), 3) hierarchical clustering with Ward’s algorithm, 4) Principal Component Analysis with biplot vectors. The analyzed set comprised elderly patients that participated in the PolSenior project. Each patient was characterized by over 40 biochemical and socio-geographical attributes. Introductory analysis showed that the case-study dataset comprises two clusters separated along the axis of sex hormone attributes. Further analysis was carried out separately for male and female patients. The most optimal partitioning in the male set resulted in five subgroups. Two of them were related to diseased patients: 1) diabetes and 2) hypogonadism patients. Analysis of the female set suggested that it was more homogeneous than the male dataset. No evidence of pathological patient subgroups was found. In the study we showed that outlier detection with MD and rMD allows not only to identify outliers, but can also assess the heterogeneity of a dataset. The case study proved that our procedure is well suited for identification and visualization of biologically meaningful patient subgroups.

  16. m

    3D quantification of vascular-like structures in z-stack confocal images:...

    • data.mendeley.com
    Updated Oct 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Bray (2020). 3D quantification of vascular-like structures in z-stack confocal images: Supplementary Material [Dataset]. http://doi.org/10.17632/btrrwrmt7z.1
    Explore at:
    Dataset updated
    Oct 28, 2020
    Authors
    Laura Bray
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an example dataset as part of the Supplementary Material for our manuscript "3D quantification of vascular-like structures in z-stack confocal images" in STAR Protocols". The dataset provides an example raw confocal image stack, demonstrates the data visualisation at major steps throughout the protocol, as well as the received output from WinFiber3D.

  17. Dataset for targeted and non-targeted analysis of firefighter breath samples...

    • catalog.data.gov
    • datasets.ai
    Updated Nov 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2020). Dataset for targeted and non-targeted analysis of firefighter breath samples [Dataset]. https://catalog.data.gov/dataset/dataset-for-targeted-and-non-targeted-analysis-of-firefighter-breath-samples
    Explore at:
    Dataset updated
    Nov 12, 2020
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    This dataset includes a list of chemicals used to create the ChromGenius retention time prediction model used for validation of non-targeted compounds. The list of identified non-targeted compounds in the samples is also provided. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: By viewing the analyzed spreadsheets attached to the Journal Article. Format: The original dataset contains identification information for the firefighters who participated in the controlled structure burns. The analyzed data can be made publicly available. This dataset is associated with the following publication: Wallace, A., J. Pleil, K. Oliver, D. Whitaker, S. Mentese, K. Fent, and G. Horn. Non-targeted GC/MS analysis of exhaled breath samples: Exploring human biomarkers of exogenous exposure and endogenous response from professional firefighting activity. JOURNAL OF TOXICOLOGY AND ENVIRONMENTAL HEALTH - PART A: CURRENT ISSUES. Taylor & Francis, Inc., Philadelphia, PA, USA, 82(4): 244-260, (2019).

  18. F

    Analysis of failure mechanisms of additively manufactured graded lattice...

    • data.uni-hannover.de
    • service.tib.eu
    7z, jpeg, xlsx, zip
    Updated Dec 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut fĂŒr Produktentwicklung und GerĂ€tebau (2024). Analysis of failure mechanisms of additively manufactured graded lattice structures [Dataset]. https://data.uni-hannover.de/dataset/analysis-of-failure-mechanisms-of-additively-manufactured-graded-lattice-structures
    Explore at:
    7z, jpeg, zip, xlsxAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset authored and provided by
    Institut fĂŒr Produktentwicklung und GerĂ€tebau
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset was created as part of a student project and contains microCT data from additively manufactured tensile specimens with different lattice structures, images of the failure mechanisms, and other experimental data.

    Information on file structuring:

    The specimen designations can be found in the Excel table SampleDesignations.xlsx. Using this specimen designation, the measured values of the tensile test can be identified, as well as the microCT recordings of the specimens. Additionally, photos and videos of the failure mechanisms are provided (partially) unstructured. The material parameters are shown in MaterialProperties and an overview of the design space of the sample in SampleGeometry+LatticeDesignSpace.

    https://i.ibb.co/9by1hxX/allebrueche.jpg" alt="Alt text">

  19. i

    Drone-based Infrastructure Assessment

    • india-data.org
    mp4 data
    Updated Jan 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IIIT Hyderabad, IHUB (2025). Drone-based Infrastructure Assessment [Dataset]. https://india-data.org/googleSEO-list-dataset-search
    Explore at:
    mp4 dataAvailable download formats
    Dataset updated
    Jan 2, 2025
    Dataset authored and provided by
    IIIT Hyderabad, IHUB
    License

    https://india-data.org/terms-conditionshttps://india-data.org/terms-conditions

    Area covered
    India
    Description

    This dataset contains drone-captured video footage (.mp4 format) for automated building infrastructure assessment tasks. The dataset is organized into seven modules: 1. Window Detection: Identifying and segmenting windows on building facades. 2. Storey Count: Estimating and counting the number of floors (stories) in buildings. 3. Roof Area Estimation: Calculating the total area of building roofs from drone footage. 4. Roof Layout and Occupancy Estimation: Analyzing roof layouts and occupancy patterns. 5. Distance Between Adjacent Buildings: Measuring the spatial distance between neighboring buildings. 6. Crack Detection: Detecting and localizing cracks or structural damage on building surfaces. 7. Building Tilt/Slope Estimation: Estimating the tilt or slope of buildings for structural analysis.

  20. b

    Data for Ply-orientation measurements in composites using structure-tensor...

    • data.bris.ac.uk
    Updated Oct 17, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Data for Ply-orientation measurements in composites using structure-tensor analysis - Datasets - data.bris [Dataset]. https://data.bris.ac.uk/data/dataset/3v25hapbaw48j25r3zm237f3ug
    Explore at:
    Dataset updated
    Oct 17, 2017
    Description

    The NDT for High value manufacturing of Composites project is an EPSRC fellowship in Manufacturing aimed at developing new 3D non-destructive characterisation algorithms for ultrasonic data inversion. These will map 3D fibre-tow orientation and porosity and will offer the ability to create Finite Element Analysis models of the actual as-manufactured structure to determine strength and performance.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
willian oliveira (2024). Soccer Universe [Dataset]. https://www.kaggle.com/datasets/willianoliveiragibin/soccer-universe
Organization logo

Soccer Universe

Football Dataset: Insights, Structure, and Analysis Guide

Explore at:
zip(21133975 bytes)Available download formats
Dataset updated
Jan 18, 2024
Authors
willian oliveira
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F16731800%2Ff0d45220cad473000b1e59942548dd45%2Fanimated_bubble_chart.gif?generation=1705615116968842&alt=media" alt="">This comprehensive football dataset, derived primarily from Transfermarkt, serves as a valuable resource for football enthusiasts, offering structured information on competitions, clubs, and players. With over 60,000 games across major global competitions, the dataset delves into the performance metrics of 400+ clubs and detailed statistics for more than 30,000 players.

Structured in CSV files, each with unique IDs, users can seamlessly join datasets to perform in-depth analyses. The dataset encompasses market values, historical valuations, and detailed player statistics, including physical attributes, contract statuses, and individual performances. A specialized Python-based web scraper ensures consistent updates, with data meticulously processed through Python scripts and SQL databases.

To use the dataset effectively, users are encouraged to understand the relevant files, join datasets using unique IDs, and leverage compatible software tools like Python's pandas or R's ggplot2 for analysis. The guide emphasizes the potential for fantasy football predictions, tracking player value over time, assessing market value versus performance, and exploring the impact of cards on match outcomes.

Research ideas include player performance analysis for fantasy football or recruitment purposes, studying market value trends for economic insights, evaluating club performance for strategic decision-making, developing predictive models for match outcomes, and conducting social network analysis to understand interactions among clubs and players.

Acknowledging the dataset's unknown license, users are encouraged to credit the original authors, particularly David Cereijo, if used in research. The dataset's dedication to accessibility is evident through active discussions on GitHub for improvements and bug fixes.

In conclusion, this football dataset offers a wealth of information, empowering users to explore diverse analyses and research ideas, bridging the gap between structured data and the dynamic world of football.

Search
Clear search
Close search
Google apps
Main menu