Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.
Facebook
TwitterJson Files This dataset comprises a collection of JSON files designed for use in various Python projects. Each JSON file contains structured data, making it ideal for tasks such as data analysis, machine learning, and application development. The data within these files can be easily manipulated using Python's extensive libraries, such as json, pandas, and numpy.
Whether you are training a machine learning model, developing an API, or working on data transformation tasks, this dataset provides the flexibility and structure needed to work effectively with JSON data in Python.
Facebook
TwitterThe 'Use of Deep Learning for structural analysis of CT-images of soil samples' used a set of soil sample data (CT-images). All the data and programs used here are open source and were created with the help of open source software. All steps are made by Python programs which are included in the data set.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Big data images for conduction heat transfer The related paper has been published here: M. Edalatifar, M.B. Tavakoli, M. Ghalambaz, F. Setoudeh, Using deep learning to learn physics of conduction heat transfer, Journal of Thermal Analysis and Calorimetry; 2020. https://doi.org/10.1007/s10973-020-09875-6 Steps to reproduce: The dataset is saved in two format, .npz for python and .mat for matlab. *.mat has large size, then it is compressed with winzip. ReadDataset_Python.py and ReadDataset_Matlab.m are examples of read data using python and matlab respectively. For use dataset in matlab download Dataset/HeatTransferPhenomena_35_58.zip, unzip it and then use ReadDataset_Matlab.m as an example. In case of python, download Dataset/HeatTransferPhenomena_35_58.npz and run ReadDataset_Python.py.
Facebook
TwitterThis dataset has been created for educational purposes, specifically to help learners practice SQL-like operations using Python’s pandas library. It is ideal for beginners who want to improve their data manipulation, querying, and transformation skills in a notebook environment such as Kaggle.
The dataset simulates a simple personnel and department system. It includes two tables:
personel: Contains employee data such as names, ages, salaries, and department IDs. departman: Contains department IDs and corresponding department names. Throughout this project, key SQL operations have been demonstrated with their pandas equivalents. These include:
Basic commands like SELECT, INSERT, UPDATE, DELETE Table structure operations: ALTER, DROP, TRUNCATE, COPY Filtering and logical expressions: WHERE, AND, OR, IN, IS NULL, BETWEEN, LIKE Aggregations and sorting: COUNT(), ORDER BY, LIMIT, DISTINCT String functions: LOWER, TRIM, REPLACE, SPLIT, LENGTH Joins: INNER JOIN, LEFT JOIN Comparison operators: =, !=, <, > The goal is to provide a hands-on, interactive environment for practicing SQL logic using real Python code. This dataset does not represent real individuals or businesses — it is entirely fictional and meant for training, teaching, and experimentation purposes only.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction
The Free-living Food Intake Cycle (FreeFIC) dataset was created by the Multimedia Understanding Group towards the investigation of in-the-wild eating behavior. This is achieved by recording the subjects’ meals as a small part part of their everyday life, unscripted, activities. The FreeFIC dataset contains the (3D) acceleration and orientation velocity signals ((6) DoF) from (22) in-the-wild sessions provided by (12) unique subjects. All sessions were recorded using a commercial smartwatch ((6) using the Huawei Watch 2™ and the MobVoi TicWatch™ for the rest) while the participants performed their everyday activities. In addition, FreeFIC also contains the start and end moments of each meal session as reported by the participants.
Description
FreeFIC includes (22) in-the-wild sessions that belong to (12) unique subjects. Participants were instructed to wear the smartwatch to the hand of their preference well ahead before any meal and continue to wear it throughout the day until the battery is depleted. In addition, we followed a self-report labeling model, meaning that the ground truth is provided from the participant by documenting the start and end moments of their meals to the best of their abilities as well as the hand they wear the smartwatch on. The total duration of the (22) recordings sums up to (112.71) hours, with a mean duration of (5.12) hours. Additional data statistics can be obtained by executing the provided python script stats_dataset.py. Furthermore, the accompanying python script viz_dataset.py will visualize the IMU signals and ground truth intervals for each of the recordings. Information on how to execute the Python scripts can be found below.
$ python stats_dataset.py
$ python viz_dataset.py
FreeFIC is also tightly related to Food Intake Cycle (FIC), a dataset we created in order to investigate the in-meal eating behavior. More information about FIC can be found here and here.
Publications
If you plan to use the FreeFIC dataset or any of the resources found in this page, please cite our work:
@article{kyritsis2020data,
title={A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
journal={IEEE Journal of Biomedical and Health Informatics},
year={2020},
publisher={IEEE}}
@inproceedings{kyritsis2017automated,
title={Detecting Meals In the Wild Using the Inertial Data of a Typical Smartwatch},
author={Kyritsis, Konstantinos and Diou, Christos and Delopoulos, Anastasios},
booktitle={2019 41th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)},
year={2019},
organization={IEEE}}
Technical details
We provide the FreeFIC dataset as a pickle. The file can be loaded using Python in the following way:
import pickle as pkl import numpy as np
with open('./FreeFIC_FreeFIC-heldout.pkl','rb') as fh: dataset = pkl.load(fh)
The dataset variable in the snipet above is a dictionary with (5) keys. Namely:
'subject_id'
'session_id'
'signals_raw'
'signals_proc'
'meal_gt'
The contents under a specific key can be obtained by:
sub = dataset['subject_id'] # for the subject id ses = dataset['session_id'] # for the session id raw = dataset['signals_raw'] # for the raw IMU signals proc = dataset['signals_proc'] # for the processed IMU signals gt = dataset['meal_gt'] # for the meal ground truth
The sub, ses, raw, proc and gt variables in the snipet above are lists with a length equal to (22). Elements across all lists are aligned; e.g., the (3)rd element of the list under the 'session_id' key corresponds to the (3)rd element of the list under the 'signals_proc' key.
sub: list Each element of the sub list is a scalar (integer) that corresponds to the unique identifier of the subject that can take the following values: ([1, 2, 3, 4, 13, 14, 15, 16, 17, 18, 19, 20]). It should be emphasized that the subjects with ids (15, 16, 17, 18, 19) and (20) belong to the held-out part of the FreeFIC dataset (more information can be found in ( )the publication titled "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al). Moreover, the subject identifier in FreeFIC is in-line with the subject identifier in the FIC dataset (more info here and here); i.e., FIC’s subject with id equal to (2) is the same person as FreeFIC’s subject with id equal to (2).
ses: list Each element of this list is a scalar (integer) that corresponds to the unique identifier of the session that can range between (1) and (5). It should be noted that not all subjects have the same number of sessions.
raw: list Each element of this list is dictionary with the 'acc' and 'gyr' keys. The data under the 'acc' key is a (N_{acc} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw accelerometer measurements in (g) (second, third and forth columns - representing the (x, y ) and (z) axis, respectively). The data under the 'gyr' key is a (N_{gyr} \times 4) numpy.ndarray that contains the timestamps in seconds (first column) and the (3D) raw gyroscope measurements in ({degrees}/{second})(second, third and forth columns - representing the (x, y ) and (z) axis, respectively). All sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). Finally, the length of the raw accelerometer and gyroscope numpy.ndarrays is different ((N_{acc} eq N_{gyr})). This behavior is predictable and is caused by the Android platform.
proc: list Each element of this list is an (M\times7) numpy.ndarray that contains the timestamps, (3D) accelerometer and gyroscope measurements for each meal. Specifically, the first column contains the timestamps in seconds, the second, third and forth columns contain the (x,y) and (z) accelerometer values in (g) and the fifth, sixth and seventh columns contain the (x,y) and (z) gyroscope values in ({degrees}/{second}). Unlike elements in the raw list, processed measurements (in the proc list) have a constant sampling rate of (100) Hz and the accelerometer/gyroscope measurements are aligned with each other. In addition, all sensor streams are transformed in such a way that reflects all participants wearing the smartwatch at the same hand with the same orientation, thusly achieving data uniformity. This transformation is in par with the signals in the FIC dataset (more info here and here). No other preprocessing is performed on the data; e.g., the acceleration component due to the Earth's gravitational field is present at the processed acceleration measurements. The potential researcher can consult the article "A Data Driven End-to-end Approach for In-the-wild Monitoring of Eating Behavior Using Smartwatches" by Kyritsis et al. on how to further preprocess the IMU signals (i.e., smooth and remove the gravitational component).
meal_gt: list Each element of this list is a (K\times2) matrix. Each row represents the meal intervals for the specific in-the-wild session. The first column contains the timestamps of the meal start moments whereas the second one the timestamps of the meal end moments. All timestamps are in seconds. The number of meals (K) varies across recordings (e.g., a recording exist where a participant consumed two meals).
Ethics and funding
Informed consent, including permission for third-party access to anonymised data, was obtained from all subjects prior to their engagement in the study. The work has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreement No 727688 - BigO: Big data against childhood obesity.
Contact
Any inquiries regarding the FreeFIC dataset should be addressed to:
Dr. Konstantinos KYRITSIS
Multimedia Understanding Group (MUG) Department of Electrical & Computer Engineering Aristotle University of Thessaloniki University Campus, Building C, 3rd floor Thessaloniki, Greece, GR54124
Tel: +30 2310 996359, 996365 Fax: +30 2310 996398 E-mail: kokirits [at] mug [dot] ee [dot] auth [dot] gr
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AbstractThe H1B is an employment-based visa category for temporary foreign workers in the United States. Every year, the US immigration department receives over 200,000 petitions and selects 85,000 applications through a random process and the U.S. employer must submit a petition for an H1B visa to the US immigration department. This is the most common visa status applied to international students once they complete college or higher education and begin working in a full-time position. The project provides essential information on job titles, preferred regions of settlement, foreign applicants and employers' trends for H1B visa application. According to locations, employers, job titles and salary range make up most of the H1B petitions, so different visualization utilizing tools will be used in order to analyze and interpreted in relation to the trends of the H1B visa to provide a recommendation to the applicant. This report is the base of the project for Visualization of Complex Data class at the George Washington University, some examples in this project has an analysis for the different relevant variables (Case Status, Employer Name, SOC name, Job Title, Prevailing Wage, Worksite, and Latitude and Longitude information) from Kaggle and Office of Foreign Labor Certification(OFLC) in order to see the H1B visa changes in the past several decades. Keywords: H1B visa, Data Analysis, Visualization of Complex Data, HTML, JavaScript, CSS, Tableau, D3.jsDatasetThe dataset contains 10 columns and covers a total of 3 million records spanning from 2011-2016. The relevant columns in the dataset include case status, employer name, SOC name, jobe title, full time position, prevailing wage, year, worksite, and latitude and longitude information.Link to dataset: https://www.kaggle.com/nsharan/h-1b-visaLink to dataset(FY2017): https://www.foreignlaborcert.doleta.gov/performancedata.cfmRunning the codeOpen Index.htmlData ProcessingDoing some data preprocessing to transform the raw data into an understandable format.Find and combine any other external datasets to enrich the analysis such as dataset of FY2017.To make appropriated Visualizations, variables should be Developed and compiled into visualization programs.Draw a geo map and scatter plot to compare the fastest growth in fixed value and in percentages.Extract some aspects and analyze the changes in employers’ preference as well as forecasts for the future trends.VisualizationsCombo chart: this chart shows the overall volume of receipts and approvals rate.Scatter plot: scatter plot shows the beneficiary country of birth.Geo map: this map shows All States of H1B petitions filed.Line chart: this chart shows top10 states of H1B petitions filed. Pie chart: this chart shows comparison of Education level and occupations for petitions FY2011 vs FY2017.Tree map: tree map shows overall top employers who submit the greatest number of applications.Side-by-side bar chart: this chart shows overall comparison of Data Scientist and Data Analyst.Highlight table: this table shows mean wage of a Data Scientist and Data Analyst with case status certified.Bubble chart: this chart shows top10 companies for Data Scientist and Data Analyst.Related ResearchThe H-1B Visa Debate, Explained - Harvard Business Reviewhttps://hbr.org/2017/05/the-h-1b-visa-debate-explainedForeign Labor Certification Data Centerhttps://www.foreignlaborcert.doleta.govKey facts about the U.S. H-1B visa programhttp://www.pewresearch.org/fact-tank/2017/04/27/key-facts-about-the-u-s-h-1b-visa-program/H1B visa News and Updates from The Economic Timeshttps://economictimes.indiatimes.com/topic/H1B-visa/newsH-1B visa - Wikipediahttps://en.wikipedia.org/wiki/H-1B_visaKey FindingsFrom the analysis, the government is cutting down the number of approvals for H1B on 2017.In the past decade, due to the nature of demand for high-skilled workers, visa holders have clustered in STEM fields and come mostly from countries in Asia such as China and India.Technical Jobs fill up the majority of Top 10 Jobs among foreign workers such as Computer Systems Analyst and Software Developers.The employers located in the metro areas thrive to find foreign workforce who can fill the technical position that they have in their organization.States like California, New York, Washington, New Jersey, Massachusetts, Illinois, and Texas are the prime location for foreign workers and provide many job opportunities. Top Companies such Infosys, Tata, IBM India that submit most H1B Visa Applications are companies based in India associated with software and IT services.Data Scientist position has experienced an exponential growth in terms of H1B visa applications and jobs are clustered in West region with the highest number.Visualization utilizing programsHTML, JavaScript, CSS, D3.js, Google API, Python, R, and Tableau
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Abstract This project presents a comprehensive analysis of a company's annual sales, using the classic dataset classicmodels as the database. Python is used as the main programming language, along with the Pandas, NumPy and SQLAlchemy libraries for data manipulation and analysis, and PostgreSQL as the database management system.
The main objective of the project is to answer key questions related to the company's sales performance, such as: Which were the most profitable products and customers? Were sales goals met? The results obtained serve as input for strategic decision making in future sales campaigns.
Methodology 1. Data Extraction:
2. Data Cleansing and Transformation:
3. Exploratory Data Analysis (EDA):
4. Modeling and Prediction:
5. Report Generation:
Results - Identification of top products and customers: The best-selling products and the customers that generate the most revenue are identified. - Analysis of sales trends: Sales trends over time are analyzed and possible factors that influence sales behavior are identified. - Calculation of key metrics: Metrics such as average profit margin and sales growth rate are calculated.
Conclusions This project demonstrates how Python and PostgreSQL can be effectively used to analyze large data sets and obtain valuable insights for business decision making. The results obtained can serve as a starting point for future research and development in the area of sales analysis.
Technologies Used - Python: Pandas, NumPy, SQLAlchemy, Matplotlib/Seaborn - Database: PostgreSQL - Tools: Jupyter Notebook - Keywords: data analysis, Python, PostgreSQL, Pandas, NumPy, SQLAlchemy, EDA, sales, business intelligence
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Python scripts with instructions for the extraction and transformation of original datasets; Transformed datasets; Dataset FA/ LCB constraints.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this package are the datasets extracted from the Artleafs database, as well as the RO-Crate packages and the RDF dataset generated from them for the project to create RO-Crates using the PHCAT ontology. You can also find python scripts used to transform the extracted CSV data into a new RDF dataset allowing you to create more RO-Crate packages if desired.
-./data: contains the set of data extracted from the database in CSV format.
./resources: contains the generated RO-Crate packages as well as the mapping files used and the RDF subsets of each article.
./OutputPhotocatalysisMapping.ttl: is the file in turtle format in charge of storing the global RDF data set after the translation of the database data.
The rest of the folders and files contain mapping rules and scripts used in the data transformation process. For more information check the following GitHub repository: https://github.com/oeg-upm/photocatalysis-ontology.
Facebook
TwitterThese data were used to examine grammatical structures and patterns within a set of geospatial glossary definitions. Objectives of our study were to analyze the semantic structure of input definitions, use this information to build triple structures of RDF graph data, upload our lexicon to a knowledge graph software, and perform SPARQL queries on the data. Upon completion of this study, SPARQL queries were proven to effectively convey graph triples which displayed semantic significance. These data represent and characterize the lexicon of our input text which are used to form graph triples. These data were collected in 2024 by passing text through multiple Python programs utilizing spaCy (a natural language processing library) and its pre-trained English transformer pipeline. Before data was processed by the Python programs, input definitions were first rewritten as natural language and formatted as tabular data. Passages were then tokenized and characterized by their part-of-speech, tag, dependency relation, dependency head, and lemma. Each word within the lexicon was tokenized. A stop-words list was utilized only to remove punctuation and symbols from the text, excluding hyphenated words (ex. bowl-shaped) which remained as such. The tokens’ lemmas were then aggregated and totaled to find their recurrences within the lexicon. This procedure was repeated for tokenizing noun chunks using the same glossary definitions.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The yabadaba Python package provides an abstraction of database and data interactions. This makes it possible to interact with multiple different database infrastructures, and to manage the data interpretation and transformations of multiple different data schemas. The end goal is to provide a tool that makes it easy for data generators and maintainers to build user-friendly APIs that use common query methods and can be added in a modular fashion.
Facebook
TwitterObjective: Utilize data science to predict employee turnover and enhance the Human Resources department.
Key Learnings:
Leveraging Data Science for HR Transformation: Understand how data science can reduce employee turnover and revolutionize HR.
Logistic Regression and Random Forest Classifiers: Grasp the theory behind these classifiers and implement them using scikit-learn.
Sigmoid Functions and Pandas DataFrames: Extract probability values using sigmoid functions and manipulate datasets with Pandas.
Python Functions and Pandas Dataframe Applications: Develop and apply Python functions to Pandas dataframes.
Exploratory Data Analysis with Matplotlib and Seaborn: Perform EDA using Matplotlib and Seaborn, generating KDE plots, box plots, and count plots.
Categorical Variable Transformation and Data Set Division: Convert categorical variables into dummy variables and divide datasets into training and testing sets using scikit-learn.
Artificial Neural Networks for Classification: Understand the theory and application of artificial neural networks in classification tasks.
Classification Model Evaluation and Result Interpretation: Evaluate classification models using confusion matrices and classification reports, distinguishing between precision, recall, and F1 scores.
Embark on this data-driven journey to transform Human Resources!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides detailed measurements of wireless channel sounding in the 6 GHz to 24 GHz frequency range, focused on the near-field to far-field transition in MIMO systems. the measurements were performed using virtual uniform linear arrays (ULAs) for Tx and Rx. All channel traces, including the ones for the noise estimation, are S21 parameters over frequency and were recorded using a vector network analyzer.
The dataset includes the following components:
The two main files of the dataset are the channel data file and the noise data file.
Both are GZIP compressed, UTF-8/ASCII encoded, JSON files that are structured as key/value pairs.
The values are always arrays with at least one dimension.
The channel traces in both files are complex numbers, split into two real-valued arrays in the JSON file, where the last two dimensions are MxN channel matrices.
A list of shorthands for the dimensions of the arrays appearing in the channel data file.
A list of key [value dimensions] for the entries in the channel data file.
A list of shorthands for the dimensions of the arrays appearing in the noise data file.
A list of key [value dimensions] for the entries in the noise data file.
The demo code is a Python script designed to demonstrate how to load, process, and analyze the channel data.
To run the demo code, the following Python setup is required:
python >= 3.12numpy >= 2.0.0matplotlib >= 3.9
The code was tested with these packages, however, it likely also works with previous versions.
This work has been funded by the Christian Doppler Laboratory for Digital Twin assisted AI for sustainable Radio Access Networks.
The financial support by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged.
The data files channel.json.gz, noise.json.gz and README.md are licensed under CC BY 4.0. The code file demo.py is licensed under the MIT License.
Facebook
TwitterThis data repository contains the data sets and python scripts associated with the manuscript 'Machine learning isotropic g values of radical polymers '. Electron paramagnetic resonance measurements allow for obtaining experimental g values of radical polymers. Analogous to chemical shifts, g values give insight into the identity and environment of the paramagnetic center. In this work, Machine learning based prediction of g values is explored as a viable alternative to computationally expensive density functional theory (DFT) methods. Description of folder contents (switch to tree view): Datasets : Contains PTMA polymer structures from TR, TE-1, and TE-2 data sets transformed using a molecular descriptor (SOAP, MBTR or DAD) and corresponding DFT-calculated g values. Filenames contain 'PTMA_X' where X denotes the number of monomers which are radicals. Structure data sets have 'structure_data' in the title, DFT calculated g values have 'giso_DFT_data' in the title. The files are in .npy (NumPy) format. Models : ERT models trained on SOAP, MBTR and DAD feature vectors. Scripts : Contains scripts which can be used to predict g values from XYZ files of PTMA structures with 6 monomer units and varying radical density. The script 'prediction_functions.py' contains the functions which transform the XYZ coordinates into an appropriate feature vector which the trained model uses to predict. Description of individual functions are also given as docstrings (python documentation strings) in the code. The folder also contains additional files needed for the ERT-DAD model in .pkl format. XYZ_files : Contains atomic coordinates of PTMA structures in XYZ format. Two subfolders : WSD and TE-2 correspond to structures present in the whole structure data set and TE-2 test data set (see main text in the manuscript for details). Filenames in the folder 'XYZ_files/TE-2/PTMA-X/' are of the type 'chainlength_6ptma_Y'_Y''.xyz' where 'chainlength_6ptma' denotes the length of polymer chain (6 monomers), Y' denotes the proportion of monomers which are radicals (for instance, Y' = 50 means 3 out of 6 monomers are radicals) and Y'' denotes the order of the MD time frame. Actual time frame values of Y'' in ps is given in the manuscript. PTMA-ML.ipynb : Jupyter notebook detailing the workflow of generating the trained model. The file includes steps to load data sets, transform xyz files using molecular descriptors, optimise hyperparameters , train the model, cross validate using the training data set and evaluate the model. PTMA-ML.pdf : PTMA-ML.ipynb in PDF format. List of abbreviations : PTMA : poly(2,2,6,6-tetramethyl-1-piperidinyloxy-4-yl methacrylate) TR : Training data set TE-1 : Test data set 1 TE-2 : Test data set 2 ERT : Extremely randomized trees WSD : Whole structure data set SOAP : Smooth overlap of atomic orbitals MBTR : Many-body tensor representation DAD : Distances-Angles-Dihedrals
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.
We release our dataset as a set of folders indicating the patch target label (e.g., `banana`), each containing 1000 subfolders as the ImageNet output classes.
An example showing how to use the dataset is shown below.
# code for testing robustness of a model
import os.path
from torchvision import datasets, transforms, models
import torch.utils.data
class ImageFolderWithEmptyDirs(datasets.ImageFolder):
"""
This is required for handling empty folders from the ImageFolder Class.
"""
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
len(os.listdir(os.path.join(directory, cls_name))) > 0}
return classes, class_to_idx
# extract and unzip the dataset, then write top folder here
dataset_folder = 'data/ImageNet-Patch'
available_labels = {
487: 'cellular telephone',
513: 'cornet',
546: 'electric guitar',
585: 'hair spray',
804: 'soap dispenser',
806: 'sock',
878: 'typewriter keyboard',
923: 'plate',
954: 'banana',
968: 'cup'
}
# select folder with specific target
target_label = 954
dataset_folder = os.path.join(dataset_folder, str(target_label))
normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
transforms = transforms.Compose([
transforms.ToTensor(),
normalizer
])
dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms)
model = models.resnet50(pretrained=True)
loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5)
model.eval()
batches = 10
correct, attack_success, total = 0, 0, 0
for batch_idx, (images, labels) in enumerate(loader):
if batch_idx == batches:
break
pred = model(images).argmax(dim=1)
correct += (pred == labels).sum()
attack_success += sum(pred == target_label)
total += pred.shape[0]
accuracy = correct / total
attack_sr = attack_success / total
print("Robust Accuracy: ", accuracy)
print("Attack Success: ", attack_sr)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This compressed file contains the data directory for building the tcm_lib_search package from scratch. This data directory contains the SQLite database file with extracted formulas, variable transformation and logistic regression model for predicting herbal tokens. After obtaining the source of tcm_lib_search package from https://github.com/wang-shuyu/tcm_lib_search/, download the tcm_lib_search-data-v1.0.tar.gz, and decompress them into the data directory. cd tcm_lib_searchtar xfvz tcm_lib_search-data-v1.0.tar.gz
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.
File descriptions
Steps to reproduce
The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.
import numpy as np
from scipy.spatial.transform import Rotation as R
#The intrinsic matrix K is constructed using the following formulation:
focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
K = [[focalLengthPixel, 0, dimX/2],
[0, focalPixel, dimY/2],
[0, 0, 1]]
#The rotation vector as provided by Blender was first transformed to a rotation matrix:
r = R.from_euler('xyz', vectR, degrees=True)
matR = r.as_matrix()
#Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
R_world2bcam = np.transpose(matR)
#The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
R_bcam2cv = np.array([[1, 0, 0],
[0, -1, 0],
[0, 0, -1]])
#Thus the representation from WORLD to CV/STANDARD coordinates is:
R_world2cv = R_bcam2cv.dot(R_world2bcam)
#The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
T_world2bcam = -1 * R_world2bcam.dot(vectC)
T_world2cv = R_bcam2cv.dot(T_world2bcam)
The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.
Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The ONE DATA data science workflow dataset ODDS-full comprises 815 unique workflows in temporally ordered versions.
A version of a workflow describes its evolution over time, so whenever a workflow is altered meaningfully, a new version of this respective workflow is persisted.
Overall, 16035 versions are available.
The ODDS-full workflows represent machine learning workflows expressed as node-heterogeneous DAGs with 156 different node types.
These node types represent various kinds of processing steps of a general machine learning workflow and are grouped into 5 categories, which are listed below.
Any metadata beyond the structure and node types of a workflow has been removed for anonymization purposes
ODDS, a filtered variant, which enforces weak connectedness and only contains workflows with at least 5 different versions and 5 nodes, is available as the default version for supervised and unsupvervised learning.
Workflows are served as JSON node-link graphs via networkx.
They can be loaded into python as follows:
import pandas as pd
import networkx as nx
import json
with open('ODDS.json', 'r') as f:
graphs = pd.Series(list(map(nx.node_link_graph, json.load(f)['graphs'])))
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a csv file resulted from a linear transformation y = 3*x+6 of 1000 randomly generated number between 0 - 100. It was generated by applying a linear transformation on 1000 data points generated from random.randint() function.