60 datasets found
  1. Linear Transformation

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    csv
    Updated Sep 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao Ruohan; Gao Ruohan (2020). Linear Transformation [Dataset]. http://doi.org/10.6084/m9.figshare.12992972.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 23, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Gao Ruohan; Gao Ruohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.

  2. Train Jason File

    • kaggle.com
    zip
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Ahmed Ali Khan (2024). Train Jason File [Dataset]. https://www.kaggle.com/datasets/datascientist97/train-jason-file
    Explore at:
    zip(1888079 bytes)Available download formats
    Dataset updated
    Jul 26, 2024
    Authors
    Raja Ahmed Ali Khan
    Description

    Json Files This dataset comprises a collection of JSON files designed for use in various Python projects. Each JSON file contains structured data, making it ideal for tasks such as data analysis, machine learning, and application development. The data within these files can be easily manipulated using Python's extensive libraries, such as json, pandas, and numpy.

    Whether you are training a machine learning model, developing an API, or working on data transformation tasks, this dataset provides the flexibility and structure needed to work effectively with JSON data in Python.

  3. d

    Soil images in DICOM format including Python programs for data...

    • search.dataone.org
    • datadryad.org
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ralf Wieland (2025). Soil images in DICOM format including Python programs for data transformation, 3D analysis, CNN traininig, CNN analysis [Dataset]. http://doi.org/10.5061/dryad.66t1g1k0c
    Explore at:
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Ralf Wieland
    Time period covered
    Jan 1, 2020
    Description

    The 'Use of Deep Learning for structural analysis of CT-images of soil samples' used a set of soil sample data (CT-images). All the data and programs used here are open source and were created with the help of open source software. All steps are made by Python programs which are included in the data set.

  4. m

    A dataset for conduction heat transer and deep learning

    • data.mendeley.com
    Updated Jun 25, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammad Edalatifar (2020). A dataset for conduction heat transer and deep learning [Dataset]. http://doi.org/10.17632/rw9yk3c559.1
    Explore at:
    Dataset updated
    Jun 25, 2020
    Authors
    Mohammad Edalatifar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data images for conduction heat transfer The related paper has been published here: M. Edalatifar, M.B. Tavakoli, M. Ghalambaz, F. Setoudeh, Using deep learning to learn physics of conduction heat transfer, Journal of Thermal Analysis and Calorimetry; 2020. https://doi.org/10.1007/s10973-020-09875-6 Steps to reproduce: The dataset is saved in two format, .npz for python and .mat for matlab. *.mat has large size, then it is compressed with winzip. ReadDataset_Python.py and ReadDataset_Matlab.m are examples of read data using python and matlab respectively. For use dataset in matlab download Dataset/HeatTransferPhenomena_35_58.zip, unzip it and then use ReadDataset_Matlab.m as an example. In case of python, download Dataset/HeatTransferPhenomena_35_58.npz and run ReadDataset_Python.py.

  5. Avokado gelişim

    • kaggle.com
    zip
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ABD�LKAD�R UY�UR (2025). Avokado gelişim [Dataset]. https://www.kaggle.com/datasets/abdlkadruyur/avokado-geliim
    Explore at:
    zip(5439 bytes)Available download formats
    Dataset updated
    May 22, 2025
    Authors
    ABD�LKAD�R UY�UR
    Description

    This dataset has been created for educational purposes, specifically to help learners practice SQL-like operations using Python’s pandas library. It is ideal for beginners who want to improve their data manipulation, querying, and transformation skills in a notebook environment such as Kaggle.

    The dataset simulates a simple personnel and department system. It includes two tables:

    personel: Contains employee data such as names, ages, salaries, and department IDs. departman: Contains department IDs and corresponding department names. Throughout this project, key SQL operations have been demonstrated with their pandas equivalents. These include:

    Basic commands like SELECT, INSERT, UPDATE, DELETE Table structure operations: ALTER, DROP, TRUNCATE, COPY Filtering and logical expressions: WHERE, AND, OR, IN, IS NULL, BETWEEN, LIKE Aggregations and sorting: COUNT(), ORDER BY, LIMIT, DISTINCT String functions: LOWER, TRIM, REPLACE, SPLIT, LENGTH Joins: INNER JOIN, LEFT JOIN Comparison operators: =, !=, <, > The goal is to provide a hands-on, interactive environment for practicing SQL logic using real Python code. This dataset does not represent real individuals or businesses — it is entirely fictional and meant for training, teaching, and experimentation purposes only.

  6. Z

    Wrist-mounted IMU data towards the investigation of free-living human eating...

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios (2022). Wrist-mounted IMU data towards the investigation of free-living human eating behavior - the Free-living Food Intake Cycle (FreeFIC) dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4420038
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Harokopio University of Athens
    Aristotle University of Thessaloniki
    Authors
    Kyritsis, Konstantinos; Diou, Christos; Delopoulos, Anastasios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.

    Description

    FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.

    The script(s) and the pickle file must be located in the same directory.

    Tested with Python 3.6.4

    Requirements: Numpy, Pickle and Matplotlib

    Calculate and echo dataset statistics

    $ python stats_dataset.py

    Visualize signals and ground truth

    $ python viz_dataset.py

    FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.

    Publications

    If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:

    @article{kyritsis2020data,
    title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
    author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
    journal={IEEE Journal of Biomedical and Health Informatics}, year={2020},
    publisher={IEEE}}

    @inproceedings{kyritsis2017automated, title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch}, author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios}, booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
    year={2019}, organization={IEEE}}

    Technical details

    We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:

    import pickle as pkl import numpy as np

    with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)

    The dataset variable in the snipet above is a dictionary with (5) keys. Namely:

    'subject_id'

    'session_id'

    'signals_raw'

    'signals_proc'

    'meal_gt'

    The contents under a specific key can be obtained by:

    sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth

    The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.

    sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).

    ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.

    raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.

    proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).

    meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).

    Ethics and funding

    Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.

    Contact

    Any inquiries regarding the FreeFIC dataset should be addressed to:

    Dr. Konstantinos KYRITSIS

    Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124

    Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr

  7. DATS 6401 - Final Project - Yon ho Cheong.zip

    • figshare.com
    zip
    Updated Dec 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yon ho Cheong (2018). DATS 6401 - Final Project - Yon ho Cheong.zip [Dataset]. http://doi.org/10.6084/m9.figshare.7471007.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Yon ho Cheong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau

  8. Classicmodels

    • kaggle.com
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javier Landaeta (2024). Classicmodels [Dataset]. https://www.kaggle.com/datasets/javierlandaeta/classicmodels
    Explore at:
    zip(65751 bytes)Available download formats
    Dataset updated
    Dec 15, 2024
    Authors
    Javier Landaeta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.

    The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.

    Methodology 1. Data Extraction:

    • A connection is established with the PostgreSQL database to extract the relevant data from the orders, orderdetails, customers, products and employees tables.
    • A reusable function is created to read each table and load it into a Pandas DataFrame.

    2. Data Cleansing and Transformation:

    • An exploratory analysis of the data is performed to identify missing values, inconsistencies, and outliers.
    • New variables are calculated, such as the total value of each sale, cost, and profit.
    • Different DataFrames are joined using primary and foreign keys to obtain a complete view of sales.

    3. Exploratory Data Analysis (EDA):

    • Key metrics such as total sales, number of unique customers, and average order value are calculated.
    • Data is grouped by different dimensions (products, customers, dates) to identify patterns and trends.
    • Results are visualized using relevant graphics (histograms, bar charts, etc.).

    4. Modeling and Prediction:

    • Although the main focus of the project is descriptive, predictive modeling techniques (e.g., time series) could be explored to forecast future sales.

    5. Report Generation:

    • Detailed reports are created in Pandas DataFrames format that answer specific business questions.
    • These reports are stored in new PostgreSQL tables for further analysis and visualization.

    Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.

    Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of ​​sales analysis.

    Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence

  9. Python scripts with instructions for the extraction and transformation of...

    • plos.figshare.com
    zip
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timur Olzhabaev; Lukas Müller; Daniel Krause; Dominik Schwudke; Andrew Ernest Torda (2025). Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012892.s006
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 7, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Timur Olzhabaev; Lukas Müller; Daniel Krause; Dominik Schwudke; Andrew Ernest Torda
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints.

  10. Z

    Photocatalysis Ontology - Dataset and RO-Crates packages

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oier Beaskoetxea Aldazabal (2022). Photocatalysis Ontology - Dataset and RO-Crates packages [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7097811
    Explore at:
    Dataset updated
    Sep 26, 2022
    Authors
    Oier Beaskoetxea Aldazabal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this package are the datasets extracted from the Artleafs database, as well as the RO-Crate packages and the RDF dataset generated from them for the project to create RO-Crates using the PHCAT ontology. You can also find python scripts used to transform the extracted CSV data into a new RDF dataset allowing you to create more RO-Crate packages if desired.

    -./data: contains the set of data extracted from the database in CSV format.

    • ./resources: contains the generated RO-Crate packages as well as the mapping files used and the RDF subsets of each article.

    • ./OutputPhotocatalysisMapping.ttl: is the file in turtle format in charge of storing the global RDF data set after the translation of the database data.

    The rest of the folders and files contain mapping rules and scripts used in the data transformation process. For more information check the following GitHub repository: https://github.com/oeg-upm/photocatalysis-ontology.

  11. d

    Data from: Grammar transformations of topographic feature type annotations...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Oct 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Grammar transformations of topographic feature type annotations of the U.S. to structured graph data. [Dataset]. https://catalog.data.gov/dataset/grammar-transformations-of-topographic-feature-type-annotations-of-the-u-s-to-structured-g
    Explore at:
    Dataset updated
    Oct 29, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States
    Description

    These data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.

  12. Yabadaba: Yay, a base database

    • data.nist.gov
    Updated Jul 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Yabadaba: Yay, a base database [Dataset]. http://doi.org/10.18434/mds2-3989
    Explore at:
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The yabadaba Python package provides an abstraction of database and data interactions. This makes it possible to interact with multiple different database infrastructures, and to manage the data interpretation and transformations of multiple different data schemas. The end goal is to provide a tool that makes it easy for data generators and maintainers to build user-friendly APIs that use common query methods and can be added in a modular fashion.

  13. DataScience for Work - Human Resources

    • kaggle.com
    zip
    Updated Apr 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beytullah Soylev (2024). DataScience for Work - Human Resources [Dataset]. https://www.kaggle.com/datasets/soylevbeytullah/ds4work-human-resources
    Explore at:
    zip(51278 bytes)Available download formats
    Dataset updated
    Apr 28, 2024
    Authors
    Beytullah Soylev
    Description

    Case Study: Improving Human Resources with Data Science

    Objective: Utilize data science to predict employee turnover and enhance the Human Resources department.

    Key Learnings:

    Leveraging Data Science for HR Transformation: Understand how data science can reduce employee turnover and revolutionize HR.

    Logistic Regression and Random Forest Classifiers: Grasp the theory behind these classifiers and implement them using scikit-learn.

    Sigmoid Functions and Pandas DataFrames: Extract probability values using sigmoid functions and manipulate datasets with Pandas.

    Python Functions and Pandas Dataframe Applications: Develop and apply Python functions to Pandas dataframes.

    Exploratory Data Analysis with Matplotlib and Seaborn: Perform EDA using Matplotlib and Seaborn, generating KDE plots, box plots, and count plots.

    Categorical Variable Transformation and Data Set Division: Convert categorical variables into dummy variables and divide datasets into training and testing sets using scikit-learn.

    Artificial Neural Networks for Classification: Understand the theory and application of artificial neural networks in classification tasks.

    Classification Model Evaluation and Result Interpretation: Evaluate classification models using confusion matrices and classification reports, distinguishing between precision, recall, and F1 scores.

    Embark on this data-driven journey to transform Human Resources!

  14. t

    Data from: Indoor MIMO Channel Measurements of the Near-Field to Far-Field...

    • researchdata.tuwien.ac.at
    • researchdata.tuwien.at
    application/gzip +2
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Prüller; Richard Prüller; Robert Langwieser; Markus Rupp; Markus Rupp; Robert Langwieser; Robert Langwieser; Robert Langwieser (2025). Indoor MIMO Channel Measurements of the Near-Field to Far-Field Transition in FR3 [Dataset]. http://doi.org/10.48436/hzn2f-hbb79
    Explore at:
    application/gzip, text/markdown, text/x-pythonAvailable download formats
    Dataset updated
    Jul 8, 2025
    Dataset provided by
    TU Wien
    Authors
    Richard Prüller; Richard Prüller; Robert Langwieser; Markus Rupp; Markus Rupp; Robert Langwieser; Robert Langwieser; Robert Langwieser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Overview

    This dataset provides detailed measurements of wireless channel sounding in the 6 GHz to 24 GHz frequency range, focused on the near-field to far-field transition in MIMO systems. the measurements were performed using virtual uniform linear arrays (ULAs) for Tx and Rx. All channel traces, including the ones for the noise estimation, are S21 parameters over frequency and were recorded using a vector network analyzer.

    The dataset includes the following components:

    1. channel.json.gz: The captured channel data in JSON format and compressed using GZIP.
    2. noise.json.gz: A noise meaurement to judge system performance in the same format as the channel data, also a GZIP compressed JSON file.
    3. demo.py: A Python script demonstrating how to interact with the dataset for analysis.
    4. README.md: The readme file.

    Dataset Components

    The two main files of the dataset are the channel data file and the noise data file.
    Both are GZIP compressed, UTF-8/ASCII encoded, JSON files that are structured as key/value pairs.
    The values are always arrays with at least one dimension.

    The channel traces in both files are complex numbers, split into two real-valued arrays in the JSON file, where the last two dimensions are MxN channel matrices.

    Channel Data File: channel.json.gz

    Dimensions

    A list of shorthands for the dimensions of the arrays appearing in the channel data file.

    • P / 6: Rx positions, higher values are closer to the Tx
    • A / 9: Far-field factors, 0.1 - 10, higher values are more in the far-field
    • L / 6001: Frequency points
    • M / 9: MIMO channel Rx antennas/ports
    • N / 9: MIMO channel Tx antennas/ports
    • 3: The cartesian coordinates of 3D space

    JSON - Key/Value pairs

    A list of key [value dimensions] for the entries in the channel data file.

    • frequency [L]: The frequency points in Hz
    • far_field_factor [A]: The far-field factors
    • room_size [3]: The bounding box size of the measurement environment in m
    • position_rx [P, A, M, 3]: The Rx antenna positions in m and room coordinates
    • rotation_rx [3, 3]: The rotation matrix for the transformation from room coordinates to Rx antenna local coordinates
    • position_tx [P, A, N, 3]: The Tx antenna positions in m and room coordinates
    • rotation_tx [3, 3]: The rotation matrix for the transformation from room coordinates to Tx antenna local coordinates
    • channel_real [P, A, L, M, N]: The real part of the channel transfer matrices
    • channel_imag [P, A, L, M, N]: The imaginary part of the channel transfer matrices.

    Noise Data File: noise.json.gz

    Dimensions

    A list of shorthands for the dimensions of the arrays appearing in the noise data file.

    • R / 200: Measurement repetitions for statistical evaluation
    • L / 6001: Frequency points
    • M / 3: MIMO channel Rx antennas/ports
    • N / 3: MIMO channel Tx antennas/ports
    • 3: The cartesian coordinates of 3D space

    JSON - Key/Value pairs

    A list of key [value dimensions] for the entries in the noise data file.

    • frequency [L]: The frequency points in Hz
    • room_size [3]: The bounding box size of the measurement environment in m
    • position_rx [M, 3]: The Rx antenna positions in m and room coordinates
    • rotation_rx [3, 3]: The rotation matrix for the transformation from room coordinates to Rx antenna local coordinates
    • position_tx [N, 3]: The Tx antenna positions in m and room coordinates
    • rotation_tx [3, 3]: The rotation matrix for the transformation from room coordinates to Tx antenna local coordinates
    • channel_real [R, L, M, N]: The real part of the channel transfer matrices
    • channel_imag [R, L, M, N]: The imaginary part of the channel transfer matrices.

    Demo Code: demo.py

    The demo code is a Python script designed to demonstrate how to load, process, and analyze the channel data.

    Capabilities

    • Loading channel data and noise data from the GZIP compressed JSON files
    • Plot frequency- and delay-domain channel traces
    • Plot an analysis of the near-field to far-field transition
    • Plot the positions and orientations of Tx and Rx antennas
    • Plot the SNR over frequency, estimated from the noise data

    Requirements

    To run the demo code, the following Python setup is required:

    python >= 3.12
    numpy >= 2.0.0
    matplotlib >= 3.9

    The code was tested with these packages, however, it likely also works with previous versions.

    Acknowledgment

    This work has been funded by the Christian Doppler Laboratory for Digital Twin assisted AI for sustainable Radio Access Networks.
    The financial support by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged.

    License

    The data files channel.json.gz, noise.json.gz and README.md are licensed under CC BY 4.0. The code file demo.py is licensed under the MIT License.

  15. J

    Replication Data and Code for: 'Machine learning isotropic g values of...

    • data-legacy.fz-juelich.de
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jülich DATA (2024). Replication Data and Code for: 'Machine learning isotropic g values of radical polymers' [Dataset]. http://doi.org/10.26165/JUELICH-DATA/TOBXWP
    Explore at:
    xyz(7432), application/x-ipynb+json(24454), xyz(7426), xyz(7590), xyz(7473), xyz(7220), xyz(7438), xyz(7440), xyz(7325), xyz(7431), xyz(7405), xyz(7437), xyz(7673), xyz(7377), xyz(7587), xyz(7439), xyz(7548), xyz(7585), xyz(7433), xyz(7456), xyz(7672), xyz(7448), xyz(7451), bin(1062), xyz(7392), xyz(7400), xyz(7370), xyz(7413), xyz(7586), xyz(7395), xyz(7402), xyz(7316), xyz(7122), xyz(7481), xyz(7564), xyz(7404), xyz(7209), xyz(7631), xyz(7529), pdf(441670), xyz(7435), xyz(7403), xyz(7314), xyz(7349), xyz(7503), xyz(7469), xyz(7407), xyz(7353), xyz(7040), xyz(7644), xyz(7441), xyz(7465), xyz(7665), xyz(7198), xyz(7464), xyz(7381), xyz(7434), xyz(7397), xyz(7463), xyz(7414), xyz(7531), bin(1328), xyz(7367), bin(371648), xyz(10059), xyz(7506), xyz(7043), bin(5702528), xyz(7466), xyz(7380), xyz(7358), xyz(7352), bin(2032), xyz(7444), xyz(7499), xyz(7488), xyz(7635), bin(224), bin(894128), xyz(7555), xyz(7562), xyz(7092), xyz(7422), xyz(7524), xyz(7577), xyz(7317), xyz(7211), xyz(7183), xyz(7280), xyz(7390), xyz(7335), xyz(7417), xyz(7070), xyz(7127), xyz(7643), xyz(7406), xyz(7201), xyz(7130), xyz(7470), xyz(7326), xyz(7046), xyz(7674), bin(804728), xyz(7592), xyz(7624), xyz(7312), xyz(7099), xyz(7475), xyz(7119), xyz(7482), xyz(7423), xyz(7126), xyz(7420), xyz(7063), xyz(7371), xyz(7374), xyz(7354), xyz(7457), xyz(7494), xyz(7342), xyz(7339), bin(1208), xyz(7453), xyz(7554), xyz(7500), png(127746), xyz(7580), xyz(7399), xyz(10081), xyz(7447), xyz(7461), xyz(7429), xyz(7669), xyz(7510), xyz(7331), xyz(7647), txt(2985), xyz(7625), xyz(7450), xyz(7452), xyz(7068), xyz(7534), xyz(7572), xyz(7446), xyz(7372), xyz(7389), text/x-python(2048), xyz(7140), xyz(7283), xyz(7269), xyz(7103), bin(8359328), xyz(7639), xyz(7293), xyz(7421), xyz(7327), xyz(7472), xyz(7100), bin(2755), xyz(7474), xyz(7459), xyz(7018), xyz(7536), xyz(7145), xyz(7468), xyz(7401), xyz(7462), bin(708927), application/x-ipynb+json(3971), xyz(7366), xyz(7436), xyz(7073), xyz(7428), xyz(7393), xyz(7556), xyz(7386), xyz(7398), xyz(7368), xyz(7376), bin(650678), xyz(7642), xyz(7087), xyz(7291), bin(3240128), xyz(7615), xyz(7520), xyz(7442), xyz(7454), xyz(7384), xyz(7443), xyz(7343), bin(1936719), xyz(7409), xyz(7412), xyz(7101), xyz(7516), bin(4089), xyz(7582)Available download formats
    Dataset updated
    Mar 11, 2024
    Dataset provided by
    Jülich DATA
    Dataset funded by
    DFG
    RWTH Aachen University
    Description

    This data repository contains the data sets and python scripts associated with the manuscript 'Machine learning isotropic g values of radical polymers '. Electron paramagnetic resonance measurements allow for obtaining experimental g values of radical polymers. Analogous to chemical shifts, g values give insight into the identity and environment of the paramagnetic center. In this work, Machine learning based prediction of g values is explored as a viable alternative to computationally expensive density functional theory (DFT) methods. Description of folder contents (switch to tree view): Datasets : Contains PTMA polymer structures from TR, TE-1, and TE-2 data sets transformed using a molecular descriptor (SOAP, MBTR or DAD) and corresponding DFT-calculated g values. Filenames contain 'PTMA_X' where X denotes the number of monomers which are radicals. Structure data sets have 'structure_data' in the title, DFT calculated g values have 'giso_DFT_data' in the title. The files are in .npy (NumPy) format. Models : ERT models trained on SOAP, MBTR and DAD feature vectors. Scripts : Contains scripts which can be used to predict g values from XYZ files of PTMA structures with 6 monomer units and varying radical density. The script 'prediction_functions.py' contains the functions which transform the XYZ coordinates into an appropriate feature vector which the trained model uses to predict. Description of individual functions are also given as docstrings (python documentation strings) in the code. The folder also contains additional files needed for the ERT-DAD model in .pkl format. XYZ_files : Contains atomic coordinates of PTMA structures in XYZ format. Two subfolders : WSD and TE-2 correspond to structures present in the whole structure data set and TE-2 test data set (see main text in the manuscript for details). Filenames in the folder 'XYZ_files/TE-2/PTMA-X/' are of the type 'chainlength_6ptma_Y'_Y''.xyz' where 'chainlength_6ptma' denotes the length of polymer chain (6 monomers), Y' denotes the proportion of monomers which are radicals (for instance, Y' = 50 means 3 out of 6 monomers are radicals) and Y'' denotes the order of the MD time frame. Actual time frame values of Y'' in ps is given in the manuscript. PTMA-ML.ipynb : Jupyter notebook detailing the workflow of generating the trained model. The file includes steps to load data sets, transform xyz files using molecular descriptors, optimise hyperparameters , train the model, cross validate using the training data set and evaluate the model. PTMA-ML.pdf : PTMA-ML.ipynb in PDF format. List of abbreviations : PTMA : poly(2,2,6,6-tetramethyl-1-piperidinyloxy-4-yl methacrylate) TR : Training data set TE-1 : Test data set 1 TE-2 : Test data set 2 ERT : Extremely randomized trees WSD : Whole structure data set SOAP : Smooth overlap of atomic orbitals MBTR : Many-body tensor representation DAD : Distances-Angles-Dihedrals

  16. Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. http://doi.org/10.5281/zenodo.6568778
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli; Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

    We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.

    An example showing how to use the dataset is shown below.

    # code for testing robustness of a model
    import os.path
    
    from torchvision import datasets, transforms, models
    import torch.utils.data
    
    
    class ImageFolderWithEmptyDirs(datasets.ImageFolder):
      """
      This is required for handling empty folders from the ImageFolder Class.
      """
    
      def find_classes(self, directory):
        classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
        if not classes:
          raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
        class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
                len(os.listdir(os.path.join(directory, cls_name))) > 0}
        return classes, class_to_idx
    
    
    # extract and unzip the dataset, then write top folder here
    dataset_folder = 'data/ImageNet-Patch'
    
    available_labels = {
      487: 'cellular telephone',
      513: 'cornet',
      546: 'electric guitar',
      585: 'hair spray',
      804: 'soap dispenser',
      806: 'sock',
      878: 'typewriter keyboard',
      923: 'plate',
      954: 'banana',
      968: 'cup'
    }
    
    # select folder with specific target
    target_label = 954
    
    dataset_folder = os.path.join(dataset_folder, str(target_label))
    normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
    transforms = transforms.Compose([
      transforms.ToTensor(),
      normalizer
    ])
    
    dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
    model = models.resnet50(pretrained=True)
    loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
    model.eval()
    
    batches = 10
    correct, attack_success, total = 0, 0, 0
    for batch_idx, (images, labels) in enumerate(loader):
      if batch_idx == batches:
        break
      pred = model(images).argmax(dim=1)
      correct += (pred == labels).sum()
      attack_success += sum(pred == target_label)
      total += pred.shape[0]
    
    accuracy = correct / total
    attack_sr = attack_success / total
    
    print("Robust Accuracy: ", accuracy)
    print("Attack Success: ", attack_sr)
    

  17. Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping

    • figshare.com
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maryam Binti Haji Abdul Halim (2025). Enhancing UNCDF Operations: Power BI Dashboard Development and Data Mapping [Dataset]. http://doi.org/10.6084/m9.figshare.28147451.v1
    Explore at:
    Dataset updated
    Jan 6, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Maryam Binti Haji Abdul Halim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.

  18. compressed data directory for tcm_lib_search package

    • figshare.com
    application/gzip
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ya Wang; Qiang Song (2023). compressed data directory for tcm_lib_search package [Dataset]. http://doi.org/10.6084/m9.figshare.5504944.v3
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Ya Wang; Qiang Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This compressed file contains the data directory for building the tcm_lib_search package from scratch. This data directory contains the SQLite database file with extracted formulas, variable transformation and logistic regression model for predicting herbal tokens. After obtaining the source of tcm_lib_search package from https://github.com/wang-shuyu/tcm_lib_search/, download the tcm_lib_search-data-v1.0.tar.gz, and decompress them into the data directory. cd tcm_lib_searchtar xfvz tcm_lib_search-data-v1.0.tar.gz

  19. RailEnV-PASMVS: a dataset for multi-view stereopsis training and...

    • zenodo.org
    • resodate.org
    • +2more
    bin, csv, png, txt +1
    Updated Jul 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe (2024). RailEnV-PASMVS: a dataset for multi-view stereopsis training and reconstruction applications [Dataset]. http://doi.org/10.5281/zenodo.5233840
    Explore at:
    bin, csv, txt, zip, pngAvailable download formats
    Dataset updated
    Jul 18, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    André Broekman; André Broekman; Petrus Johannes Gräbe; Petrus Johannes Gräbe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.

    File descriptions

    • RailEnV-PASMVS.blend (227 Mb) - Blender file (developed using Blender version 2.8.1) used to generate the dataset. The Blender file packs only one of the HDR environmental textures to use as an example, along with all the other asset textures.
    • RailEnV-PASMVS_sample.png (28 Mb) - A visual collage of 30 scenes, illustrating the variability introduced by using different models, illumination, material properties and camera focal lengths.
    • geometry.zip (2 Mb) - Geometry CSV files used for scenes 01 to 20. The Bezier curve defines the geometry of the rail profile (10 mm intervals).
    • PhysicalDataset.7z (2.0 Gb) - A smaller, secondary dataset of 320 manually annotated photographs of railway environments; only the railway profiles are annotated.
    • 01.7z-40.7z (2.0 Gb each) - Archive of every scene (01 through 40).
    • all_list.txt, training_list.txt, validation_list.txt - Text files containing the all the scene names, together with those used for validation (validation_list.txt) and training (training_list.txt), used by MVSNet.
    • index.csv - CSV file provides a convenient reference for all the sample files, linking the corresponding file and relative data path.

    Steps to reproduce

    The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.

    import numpy as np
    from scipy.spatial.transform import Rotation as R
    
    #The intrinsic matrix K is constructed using the following formulation:
    focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
    K = [[focalLengthPixel, 0, dimX/2],
       [0, focalPixel, dimY/2],
       [0, 0, 1]]
    
    #The rotation vector as provided by Blender was first transformed to a rotation matrix:
    r = R.from_euler('xyz', vectR, degrees=True)
    matR = r.as_matrix()
    
    #Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
    R_world2bcam = np.transpose(matR)
    
    #The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
    R_bcam2cv = np.array([[1, 0, 0],
                   [0, -1, 0],
                   [0, 0, -1]])
    
    #Thus the representation from WORLD to CV/STANDARD coordinates is:
    R_world2cv = R_bcam2cv.dot(R_world2bcam)
    
    #The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
    T_world2bcam = -1 * R_world2bcam.dot(vectC)
    T_world2cv = R_bcam2cv.dot(T_world2bcam)

    The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.

    Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.

  20. ONE DATA Data Sience Workflows

    • zenodo.org
    json
    Updated Sep 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer; Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer (2021). ONE DATA Data Sience Workflows [Dataset]. http://doi.org/10.5281/zenodo.4633704
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Sep 17, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer; Lorenz Wendlinger; Emanuel Berndl; Michael Granitzer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The ONE DATA data science workflow dataset ODDS-full comprises 815 unique workflows in temporally ordered versions.
    A version of a workflow describes its evolution over time, so whenever a workflow is altered meaningfully, a new version of this respective workflow is persisted.
    Overall, 16035 versions are available.

    The ODDS-full workflows represent machine learning workflows expressed as node-heterogeneous DAGs with 156 different node types.
    These node types represent various kinds of processing steps of a general machine learning workflow and are grouped into 5 categories, which are listed below.

    • Load Processors for loading or generating data (e.g. via a random number generator).
    • Save Processors for persisting data (possible in various data formats, via external connections or as a contained result within the ONE DATA platform) or for providing data to other places as a service.
    • Transformation Processors for altering and adapting data. This includes e.g. database-like operations such as renaming columns or joining tables as well as fully fledged dataset queries.
    • Quantitative Methods Various aggregation or correlation analysis, bucketing, and simple forecasting.
    • Advanced Methods Advanced machine learning algorithms such as BNN or Linear Regression. Also includes special meta processors that for example allow the execution of external workflows within the original workflow.

    Any metadata beyond the structure and node types of a workflow has been removed for anonymization purposes

    ODDS, a filtered variant, which enforces weak connectedness and only contains workflows with at least 5 different versions and 5 nodes, is available as the default version for supervised and unsupvervised learning.

    Workflows are served as JSON node-link graphs via networkx.

    They can be loaded into python as follows:

    import pandas as pd
    import networkx as nx
    import json
    
    with open('ODDS.json', 'r') as f:
      graphs = pd.Series(list(map(nx.node_link_graph, json.load(f)['graphs'])))

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gao Ruohan; Gao Ruohan (2020). Linear Transformation [Dataset]. http://doi.org/10.6084/m9.figshare.12992972.v1
Organization logo

Linear Transformation

Explore at:
csvAvailable download formats
Dataset updated
Sep 23, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Gao Ruohan; Gao Ruohan
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.

Search
Clear search
Close search
Google apps
Main menu