Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freesound Loop Dataset
This dataset contains 9,455 loops from Freesound.org and the corresponding annotations. These loops have tempo, key, genre and instrumentation annotation.
Dataset Construction
To collect this dataset, the following steps were performed:
Freesound was queried with "loop" and "bpm", so as to collect loops which have a beats-per-minute(BPM) annotations.
The sounds were analysed with AudioCommons extractor, so as to obtain key information.
The textual metadata of each sound was analysed, to obtain the BPM proposed by the user, and to obtain genre information.
Annotators used a web interface to annotate around 3,000 loops.
Dataset Organisation
The dataset contains two folders and two files in the root directory:
'FSL10K' encloses the audio files and their metadata and analysis. The audios are in the 'audio' folder and are named '
'annotations' holds the expert provided annotation for the sounds in the dataset. The annotations are separated in a folder for each annotator and each annotation is stored as a .json file, named 'sound-
Licenses
All the sounds have some kind of Creative Commons license. The license of each sound in the dataset can be obtained from the 'FSL10K/metadata.json' file
Authors and Contact
This dataset was developed by António Ramires et. al.
Any questions related to this dataset please contact:
António Ramires
References
Please cite this paper if you use this dataset:
@inproceedings{ramires2020, author = "Antonio Ramires and Frederic Font and Dmitry Bogdanov and Jordan B. L. Smith and Yi-Hsuan Yang and Joann Ching and Bo-Yu Chen and Yueh-Kao Wu and Hsu Wei-Han and Xavier Serra", title = "The Freesound Loop Dataset and Annotation Tool", booktitle = "Proc. of the 21st International Society for Music Information Retrieval (ISMIR)", year = "2020" }
Acknowledgements
This work has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068 (MIP-Frontiers).
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Overview:Three new datasets available here represent normal household areas with common objects - lounge, kitchen and garden - with varying trajectories.Description:Lounge: The lounge dataset with common household objects.Lounge_oc: The lounge dataset with object occlusions near the end of trajectory.Kitchen: The kitchen dataset with common household objects.Kitchen_oc: The kitchen dataset with object occlusions near the end of trajectory.Garden: The garden dataset with common household objects.Garden_oc: The garden dataset with object occlusions near the end of trajectory.convert.py: Python script to convert a video file into jpgs.Paper:The datasets were used for the paper "SymbioLCD: Ensemble-Based Loop Closure Detection using CNN-Extracted Objects and Visual Bag-of-Words", accepted at 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems.Abstract:Loop closure detection is an essential tool of Simultaneous Localization and Mapping (SLAM) to minimize drift in its localization. Many state-of-the-art loop closure detection (LCD) algorithms use visual Bag-of-Words (vBoW), which is robust against partial occlusions in a scene but cannot perceive the semantics or spatial relationships between feature points. CNN object extraction can address those issues, by providing semantic labels and spatial relationships between objects in a scene. Previous work has mainly focused on replacing vBoW with CNN derived features.In this paper we propose SymbioLCD, a novel ensemble-based LCD that utilizes both CNN-extracted objects and vBoW features for LCD candidate prediction. When used in tandem, the added elements of object semantics and spatial-awareness creates a more robust and symbiotic loop closure detection system. The proposed SymbioLCD uses scale-invariant spatial and semantic matching, Hausdorff distance with temporal constraints, and a Random Forest that utilizes combined information from both CNN-extracted objects and vBoW features for predicting accurate loop closure candidates. Evaluation of the proposed method shows it outperforms other Machine Learning (ML) algorithms - such as SVM, Decision Tree and Neural Network, and demonstrates that there is a strong symbiosis between CNN-extracted object information and vBoW features which assists accurate LCD candidate prediction. Furthermore, it is able to perceive loop closure candidates earlier than state-of-the-art SLAM algorithms, utilizing added spatial and semantic information from CNN-extracted objects.Citation:Please use the bibtex below for citing the paper:@inproceedings{kim2021symbiolcd,title = {SymbioLCD: Ensemble-Based Loop Closure Detection using CNN-Extracted Objects and Visual Bag-of-Words},author = {Jonathan Kim and Martin Urschler and Pat Riddle and J"{o}rg Wicker},year = {2021},date = {2021-09-27},booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems},keywords = {},pubstate = {forthcoming},tppubtype = {inproceedings}}
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains behavioral events and intracranial electrophysiology recordings from a delayed free recall task with closed-loop stimulation at encoding, using a classifier trained on encoding data. The experiment consists of participants studying a list of words, presented visually one at a time, completing simple arithmetic problems that function as a distractor, and then freely recalling the words from the just-presented list in any order. The data was collected at clinical sites across the country as part of a collaboration with the Computational Memory Lab at the University of Pennsylvania. This dataset is a closed-loop stimulation version of the FR1 and FR2 datasets.
This study contains closed-loop electrical stimulation of the brain during encoding. There is no stimulation during the distractor or retrieval phases. Stimulation is delivered to a single electrode at a time, and the stimulation parameters are included in the bevavioral events tsv files, denoting the anode/cathode labels, amplitude, pulse frequency, pulse width, and pulse count.
The L2 logistic regression classifier is trained to predict whether an encoded item will be subsequently recalled based on the neural features during encoding, using data from a participant's FR1 sessions. The bipolar recordings during the 0-1366 ms interval after word presentation are filtered with a Butterworth band stop filter (58-62 Hz, 4th order) to remove 60 Hz line noise, and then a Morlet wavelet transformation (wavenumber = 5) is applied to the signal to estimate spectral power, using 8 log-spaced wavelets between 3-180 Hz (center frequencies 3.0, 5.4, 9.7, 17.4, 31.1, 55.9, 100.3, 180 Hz) and 1365 ms mirrored buffers. The powers are log-transformed prior to removal of the buffer, and then z-transformed based on the within-session mean and standard deviation across all encoding events. These z-transformed log power values represent the feature matrix, and the label vector is the recalled status of the encoded items. The penalty parameter is chosen based on the value that leads to the highest average AUC for all prior participants with at least two FR1 sessiona, and is inversely weighted according to the class (i.e., recalled v. not recalled) imbalance to ensure the best fit values of the penalty parameter are comparable across different class distributions (recall rates). Class weights are computed as: (1/Na) / ((1/Na + 1/Nb) / 2) where Na and Nb are the number of events in each class.
After at least 3 training sessions with a minimum of 15 lists, each participant's classifier is tested using leave-one-session-out (LOSO) cross validation, and the true AUC is compared to a 200-sample AUC distribution generated from classification of label-permuted data. p < 0.05 (one-sided) is used as the significance threshold for continuing to the closed-loop task.
Each session contains 26 lists (the first being a practice list) and there is no stimulation on the first 4 lists. The classifier ouput for each presented item on the first 4 lists is compared to the classifier output when tested on data from all previous sessions using a two-sample Kolmogorov-Smirnov test. The null hypothesis that the current session and the training data come from the same distribution must not be rejected (p > 0.05) for the closed-loop task to continue.
The remaining 22 lists are equally divided into stimulation and no stimulation lists, with conditions balanced in each half of the session. On stimulation lists, classifier output is evaluated during the 0-1366 ms interval following word presentation onset. The input values are normalized using the mean and standard deviation across encoding events on all prior no stimulation lists in the session. If the classifier output is below the median classifier output from the training sessions, stimulation occurs immediately following the 1366 ms decoding interval and lasts for 500 ms. With a 750-1000 ms inter-stimulus interval, there is enough time for stimulation artifacts to subside before the next word onset (next classifier decoding).
For questions or inquiries, please contact sas-kahana-sysadmin@sas.upenn.edu.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Features of the Poses Loop PPRI
A benchmark dataset contains 4,364 real-world Solidity smart contracts, which are manually labeled with ten types of vulnerabilities.
The address.delegatecall() function allows a smart contract to dynamically load external contracts from address at runtime. If the attacker can control the external contract and affect the current contract status, the contract is vulnerable to DC.
An arithmetic overflow or underflow, often called Integer Overflow or Underflow (IOU), occurs when an arithmetic operation attempts to create a numeric variable value that is larger than the maximum value or smaller than the minimum value of the variable type. If the arithmetic operation may pass a variable type’s maximum or minimum value and is performed without using SafeMath, the contract is vulnerable to IOU.
The function containing the loop has a high risk of exceeding its gas limitation and causing an out-of-gas error. If the attacker can control the loop iteration and causes the out-of-gas error, the contract is vulnerable to NC,
The contract vulnerable to RE uses the call() function to transfer ether to an external contract. The external contract can reenter the vulnerable contract by fallback function. If the state variable change is after the call() function, the reentrance will cause status inconsistency.
The contract uses the timestamp as the deciding factor for critical operations, e.g., sending ether. If the attacker can get ether from the contract by manipulating the timestamp or affecting the critical operations, the contract is vulnerable to TD.
If the contract only uses tx.origin to verify the caller's identification for critical operations, it is vulnerable to TO.
The contract may send out ether differently according to different values of a global state variable or different balance values of the contract. If the attackers can get ether from the contract by manipulating the transaction sequences, the contract is vulnerable to TOD.
The contract uses the function call() or send() without result checking. If the send() or call() function fails and leads to status inconsistency, the contract is vulnerable to UcC.
If an attacker can self-destruct the contract by calling the selfdestruct(address) function, the contract is vulnerable to UpS.
If the contract can receive ether but cannot transfer it by itself, it is vulnerable to FE.
For the purpose of protection for smart contracts, the dataset can be available after request.
This set of data files is one of the four test data sets acquired by the USDOT Data Capture and Management program. It contains the following data for the six months from May 1 2011 to October 31 2011: -Raw and cleaned data for traffic detectors deployed by Washington Department of Transportation (WSDOT) along I-5 in Seattle. Data includes 20-second raw reports. -Incident response records from the WSDOT's Washington Incident Tracking System (WITS). -A record of all messages and travel times posted on WSDOT's Active Traffic -Management signs and conventional variable message signs on I-5. -Loop detector volume and occupancy data from arterials parallel to I-5, estimated travel times on arterials derived from Automatic License Plate Reader (ALPR) data, and arterial signal timing plans. -Scheduled and actual bus arrival times from King County Metro buses and Sound Transit buses. -Incidents on I-5 during the six month period -Seattle weather data for the six month period -A dataset of GPS breadcrumb data from commercial trucks described in the documentation is not available to the public because of data ownership and privacy issues. This legacy dataset was created before data.transportation.gov and is only currently available via the attached file(s). Please contact the dataset owner if there is a need for users to work with this data using the data.transportation.gov analysis features (online viewing, API, graphing, etc.) and the USDOT will consider modifying the dataset to fully integrate in data.transportation.gov. Note: All extras are attached in Seattle Freeway Travel Times https://data.transportation.gov/Automobiles/Seattle-Freeway-Travel-Times/9v5g-t8u8
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a comprehensive dataset to assess cognitive states, workload, situational awareness, stress, and performance in human-in-the-loop process control rooms. The dataset includes objective and subjective measures from various data collection tools such as NASA-TLX, SART, eye tracking, EEG, Health Monitoring Watch, surveys, and think-aloud situational awareness assessments. It is based on an experimental study of a formaldehyde production plant based on participants' interactions in a controlled control room experimental setting.
The study compared three different setups of human system interfaces in four human-in-the-loop (HITL) configurations, incorporating two alarm design formats (Prioritised vs non-prioritised) and three procedural guidance setups (e.g. one presenting paper procedures, one offering digitised screen-based procedures, and lastly an AI-based procedural guidance system).
The dataset provides an opportunity for various applications, including:
The dataset is instrumental for researchers, decision-makers, system engineers, human factor engineers, and teams developing guidelines and standards. It is also applicable for validating proposed solutions for the industry and for researchers in similar or close domains.
The concatenated Excel file for the dataset may include the following detailed data:
Demographic and Educational Background Data:
SPAM Metrics:
NASA-TLX Responses:
SART Data:
AI Decision Support System Feedback:
Performance Metrics:
This detailed breakdown provides a comprehensive view of the specific data elements that could be included in the concatenated Excel file, allowing for thorough analysis and exploration of the participants' experiences, cognitive states, workload, and decision-making processes in control room environments.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is designed to facilitate the task of detecting and recognizing characters on a water meter. It includes annotations for individual digits (0-9) as well as any other textual or numeric information present. The goal is to provide comprehensive labels for training object detection models to accurately identify and interpret the numeric readings.
The "Number" class includes any sequences or isolated numeric or alphabetic characters that are not part of the specific digit classes (0-9). These can be alphanumeric codes, labels, or any other text visible on the water meter that is not a single digit.
The digit "0" is represented as a single closed loop character, often circular or oval in shape, found on the water meter display.
The digit "1" typically appears as a single vertical line, occasionally with a small base or top serif, depending on the font style.
The digit "2" usually has a rounded top loop and a descending diagonal stroke ending in a horizontal base.
The digit "3" consists of two rounded loops stacked vertically with their centers aligned.
The digit "4" appears with a vertical line intersected by a diagonal line forming a triangle and a horizontal base.
The digit "5" features a top horizontal line, a curved back, and a flat base, resembling an incomplete circle with a flat top.
The digit "6" includes a closed loop at the bottom with an open top loop, appearing as a partially twisted circle.
The digit "7" has a flat top line connected to a diagonal descending line, often lacking additional embellishments.
The digit "8" consists of two equal-sized closed loops stacked vertically.
The digit "9" appears as a top loop with a straight or slightly curved descending tail, resembling an upside-down "6".
This dataset is comprised of mean and variance of the surface velocity field of the Gulf of Mexico, obtained from a large set of historical surface drifter data from the Gulf of Mexico—3770 trajectories spanning 28 years and more than a dozen data sources— which were uniformly processed and quality controlled, and assimilated into a spatially and temporally gridded dataset. A gridded product, called GulfFlow, is created by averaging all available data from the GulfDrifters dataset within quarter-degree spatial bins, and within overlapping monthlong temporal bins having a semimonthly spacing. The dataset spans monthly time bins centered on July 16, 1992 through July 1, 2020, for a total of 672 overlapping time slices. Odd- numbered slices correspond to calendar months, while even-numbered slices run from halfway through one month to halfway through the following month. A higher spatial resolution version, GulfFlow-1/12 degree is created in the identical way but using 1/12 degree bins instead of quarter-degree bins. In addition to the average velocities within each 3D bin, the count of sources contributing to each bin is also distributed, as is the subgridscale velocity variance. The count variable is a four-dimensional array of integers, the fourth dimension of which has length 45. This variable gives the number of hourly observations from each source dataset contributing to each three-dimensional bin. Values 1–15 are the count of velocity observations from drifters from each of the 15 experiments that are flagged as having retained their drogues, values 16–30 are for observation from drifters that are flagged as having lost their drogues, and values 31–45 are for observations from drifters of an unknown drogue status. In defining averaged quantities, we represent the velocity as a vector, u = [u v]T , where the superscript “T” denotes the transpose. Let an overbar, \(\overline {\bf u}\) , denote an average over a spatial bin and over all times, while angled brackets, <u>, denote an average over a spatial bin and a particular temporal bin. Thus, <u>, is a function of time while \(\overline {\bf u}\) is not. We refer to <u>, as the local average, \(\overline {\bf u}\) as the global average, and \(\overline {<\bf u>}\) as the double average. Given the inhomogeneity of the drifter data, turns out the global average is biased towards intensive but short duration programs, hence the double average results in a much better representation of the true mean velocity field. The dataset includes the global average \(\overline {<\bf u>}\), the local covariance defined as
\(\bf{ε}=<(u − )(𝐮−< 𝐮 >)^T>\)
and \(\epsilon^2\)which is the trace of \(\overline{\bf ε}\)
\(\epsilon^2\)=\(tr\{\overline{\bf ε}\}\)
The data is distributed in two separate netcdCDF files, one for each grid resolution.
Here the article describing this dataset.
Lilly, J. M. and P. Pérez-Brunius (2021). A gridded surface current product for the Gulf of Mexico from consolidated drifter measurements. Earth System Science Data, 13: 645–669. https://doi.org/10.5194/essd-13-645-2021.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
0: The digit 0
1: The digit 1
2: The digit 2
3: The digit 3
4: The digit 4
5: The digit 5
6: The digit 6
7: The digit 7
8: The digit 8
9: The digit 9
This dataset contains images of water meter readings with the purpose of digitizing the numeric values. There are 10 classes representing the digits 0 through 9. Annotators will label the digits as they appear on the meters to facilitate accurate recognition.
The digit "0" is characterized by its oval or circular shape, often with a distinctive horizontal thickness.
The digit "1" typically appears as a straight vertical line, sometimes with a short horizontal base.
The digit "2" has a curved top and straight middle section, finishing with a horizontal or diagonal stroke at the base.
The digit "3" is identified by two stacked curved sections without intersecting lines.
The digit "4" often features intersecting horizontal and vertical lines with a triangle-like top section.
The digit "5" combines a prominent upper loop with a lower horizontal stroke and a straight vertical line.
The digit "6" features a closed top loop with an extended lower curve that continues downward.
The digit "7" is characterized by a horizontal top line connecting to a diagonal downward stroke.
The digit "8" resembles two stacked circles or loops, one above the other.
The digit "9" starts with a circular or elliptical loop at the top, leading into a straight downward stroke.
Problem Statement
👉 Download the case studies here
Investors and buyers in the real estate market faced challenges in accurately assessing property values and market trends. Traditional valuation methods were time-consuming and lacked precision, making it difficult to make informed investment decisions. A real estate firm sought a predictive analytics solution to provide accurate property price forecasts and market insights.
Challenge
Developing a real estate price prediction system involved addressing the following challenges:
Collecting and processing vast amounts of data, including historical property prices, economic indicators, and location-specific factors.
Accounting for diverse variables such as neighborhood quality, proximity to amenities, and market demand.
Ensuring the model’s adaptability to changing market conditions and economic fluctuations.
Solution Provided
A real estate price prediction system was developed using machine learning regression models and big data analytics. The solution was designed to:
Analyze historical and real-time data to predict property prices accurately.
Provide actionable insights on market trends, enabling better investment strategies.
Identify undervalued properties and potential growth areas for investors.
Development Steps
Data Collection
Collected extensive datasets, including property listings, sales records, demographic data, and economic indicators.
Preprocessing
Cleaned and structured data, removing inconsistencies and normalizing variables such as location, property type, and size.
Model Development
Built regression models using techniques such as linear regression, decision trees, and gradient boosting to predict property prices. Integrated feature engineering to account for location-specific factors, amenities, and market trends.
Validation
Tested the models using historical data and cross-validation to ensure high prediction accuracy and robustness.
Deployment
Implemented the prediction system as a web-based platform, allowing users to input property details and receive price estimates and market insights.
Continuous Monitoring & Improvement
Established a feedback loop to update models with new data and refine predictions as market conditions evolved.
Results
Increased Prediction Accuracy
The system delivered highly accurate property price forecasts, improving investor confidence and decision-making.
Informed Investment Decisions
Investors and buyers gained valuable insights into market trends and property values, enabling better strategies and reduced risks.
Enhanced Market Insights
The platform provided detailed analytics on neighborhood trends, demand patterns, and growth potential, helping users identify opportunities.
Scalable Solution
The system scaled seamlessly to include new locations, property types, and market dynamics.
Improved User Experience
The intuitive platform design made it easy for users to access predictions and insights, boosting engagement and satisfaction.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The Giotto Radio Science Experiment data set consists of four tables. Each table contains a measurement value listed as a function of time. The measurements are: closed-loop receiver carrier signal amplitude, closed-loop receiver carrier frequency residual, open-loop receiver carrier signal amplitude, and open-loop receiver carrier frequency.
The Digital Geologic-GIS Map of The Loop and Druid Arch Quadrangles, Utah is composed of GIS data layers and GIS tables, and is available in the following GRI-supported GIS data formats: 1.) an ESRI file geodatabase (thdr_geology.gdb), a 2.) Open Geospatial Consortium (OGC) geopackage, and 3.) 2.2 KMZ/KML file for use in Google Earth, however, this format version of the map is limited in data layers presented and in access to GRI ancillary table information. The file geodatabase format is supported with a 1.) ArcGIS Pro map file (.mapx) file (thdr_geology.mapx) and individual Pro layer (.lyrx) files (for each GIS data layer). The OGC geopackage is supported with a QGIS project (.qgz) file. Upon request, the GIS data is also available in ESRI shapefile format. Contact Stephanie O'Meara (see contact information below) to acquire the GIS data in these GIS data formats. In addition to the GIS data and supporting GIS files, three additional files comprise a GRI digital geologic-GIS dataset or map: 1.) a readme file (cany_geology_gis_readme.pdf), 2.) the GRI ancillary map information document (.pdf) file (cany_geology.pdf) which contains geologic unit descriptions, as well as other ancillary map information and graphics from the source map(s) used by the GRI in the production of the GRI digital geologic-GIS data for the park, and 3.) a user-friendly FAQ PDF version of the metadata (thdr_geology_metadata_faq.pdf). Please read the cany_geology_gis_readme.pdf for information pertaining to the proper extraction of the GIS data and other map files. Google Earth software is available for free at: https://www.google.com/earth/versions/. QGIS software is available for free at: https://www.qgis.org/en/site/. Users are encouraged to only use the Google Earth data for basic visualization, and to use the GIS data for any type of data analysis or investigation. The data were completed as a component of the Geologic Resources Inventory (GRI) program, a National Park Service (NPS) Inventory and Monitoring (I&M) Division funded program that is administered by the NPS Geologic Resources Division (GRD). For a complete listing of GRI products visit the GRI publications webpage: https://www.nps.gov/subjects/geology/geologic-resources-inventory-products.htm. For more information about the Geologic Resources Inventory Program visit the GRI webpage: https://www.nps.gov/subjects/geology/gri.htm. At the bottom of that webpage is a "Contact Us" link if you need additional information. You may also directly contact the program coordinator, Jason Kenworthy (jason_kenworthy@nps.gov). Source geologic maps and data used to complete this GRI digital dataset were provided by the following: U.S. Geological Survey. Detailed information concerning the sources used and their contribution the GRI product are listed in the Source Citation section(s) of this metadata record (thdr_geology_metadata.txt or thdr_geology_metadata_faq.pdf). Users of this data are cautioned about the locational accuracy of features within this dataset. Based on the source map scale of 1:24,000 and United States National Map Accuracy Standards features are within (horizontally) 12.2 meters or 40 feet of their actual _location as presented by this dataset. Users of this data should thus not assume the _location of features is exactly where they are portrayed in Google Earth, ArcGIS Pro, QGIS or other software used to display this dataset. All GIS and ancillary tables were produced as per the NPS GRI Geology-GIS Geodatabase Data Model v. 2.3. (available at: https://www.nps.gov/articles/gri-geodatabase-model.htm).
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains surface current data collected from UGOS MASTR Drifters (Far Horizon Drifters) deployed in the Gulf. They were specifically positioned to capture dynamic features such as the Loop Current, Dry Tortugas Eddy, Florida Current, Gulf Stream, and Eddy Denali. The dataset includes quality-controlled, hourly measurements intended for model assimilation and validation. Data covers the temporal range from January to July 2024.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the accompanying dataset that was generated by the GitHub project: https://github.com/tonyreina/tdc-tcr-epitope-antibody-binding. In that repository I show how to create a machine learning models for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other.
A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.
HuggingFace provides a "one-stop shop" to train and deploy AI models. In this case, we use Facebook's open-source Evolutionary Scale Model (ESM-2). These embeddings turn the protein sequences into a vector of numbers that the computer can use in a mathematical model.
To load them into Python use the Pandas library:
import pandas as pd
train_data = pd.read_pickle("train_data.pkl") validation_data = pd.read_pickle("validation_data.pkl") test_data = pd.read_pickle("test_data.pkl")
The epitope_aa and the tcr_full columns are the protein (peptide) sequences for the epitope and the T-cell receptor, respectively. The letters correspond to the standard amino acid codes.
The epitope_smi column is the SMILES notation for the chemical structure of the epitope. We won't use this information. Instead, the ESM-1b embedder should be sufficient for the input to our binary classification model.
The tcr column is the CDR3 hyperloop. It's the part of the TCR that actually binds (assuming it binds) to the epitope.
The label column is whether the two proteins bind. 0 = No. 1 = Yes.
The tcr_vector and epitope_vector columns are the bio-embeddings of the TCR and epitope sequences generated by the Facebook ESM-1b model. These two vectors can be used to create a machine learning model that predicts whether the combination will produce a successful protein binding.
From the TDC website:
T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.
Weber et al.
Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.
Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.
Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.
References:
Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.
Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.
Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).
Dataset License: CC BY 4.0.
Contributed by: Anna Weber and Jannis Born.
The Facebook ESM-2 model has the MIT license and was published in:
HuggingFace has several versions of the trained model.
Checkpoint name Number of layers Number of parameters
esm2_t48_15B_UR50D 48 15B
esm2_t36_3B_UR50D 36 3B
esm2_t33_650M_UR50D 33 650M
esm2_t30_150M_UR50D 30 150M
esm2_t12_35M_UR50D 12 35M
esm2_t6_8M_UR50D 6 8M
https://doi.org/10.5061/dryad.dr7sqvb5b
To gain genome-scale insight into the sly1∆loop allele’s loss of function, we used synthetic genome array (SGA) analysis. SGA measures the synthetic sickness or rescue (suppression) of a query allele versus a genome-scale collection of loss-of-function alleles (Tong and Boone, 2005). The sly1∆loop allele was knocked into the genomic SLY1 locus. The SGA data from this analysis was then aligned with the BioGRID dataset.
sly1∆loop SGA contains the raw SGA dataset. The SGA score algorithm processes raw colony size data, normalizes them for a series of experimental systematic effects and calculates a quantitative genetic interaction score.
LogRatios indicates the log-transformed ratio of the growth of the indicated double mutant to the growth of the single mutant with the indicated quer...
Problem Statement
👉 Download the case studies here
A pharmaceutical manufacturer faced significant challenges in ensuring consistent quality during the production of medications. Manual quality control processes were prone to errors and inefficiencies, leading to product recalls and compliance risks. The company needed an advanced solution to automate quality control, reduce production errors, and comply with stringent regulatory standards.
Challenge
Implementing automated quality control in pharmaceutical manufacturing posed several challenges:
Detecting microscopic defects, contamination, or irregularities in products and packaging.
Ensuring high-speed inspection without disrupting production workflows.
Meeting strict industry regulations for product quality and traceability.
Solution Provided
An AI-powered quality control system was developed using machine vision and advanced inspection algorithms. The solution was designed to:
Automatically inspect pharmaceutical products for defects, contamination, and compliance with production standards.
Analyze packaging integrity to detect labeling errors, seal defects, or missing components.
Provide real-time quality control insights to production teams for immediate corrective actions.
Development Steps
Data Collection
Captured high-resolution images and videos of pharmaceutical products during production, including tablets, capsules, and packaging components.
Preprocessing
Preprocessed visual data to enhance features such as shape, texture, and color, enabling accurate defect detection.
Model Training
Developed machine vision models to detect defects and anomalies at microscopic levels. Integrated AI algorithms to classify defects and provide actionable insights for process improvement.
Validation
Tested the system on a variety of production scenarios to ensure high accuracy and reliability in defect detection.
Deployment
Installed AI-powered inspection systems on production lines, integrating them with existing manufacturing processes and quality control frameworks.
Continuous Monitoring & Improvement
Established a feedback loop to refine models based on new production data and evolving quality standards.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pulmonary function tests (PFTs) are usually interpreted by clinicians using rule-based strategies and pattern recognition. The interpretation, however, has variabilities due to patient and interpreter errors. Most PFTs have recognizable patterns that can be categorized into specific physiological defects. In this study, we developed a computerized algorithm using the python package (pdfplumber) and validated against clinicians’ interpretation. We downloaded PFT reports in the electronic medical record system that were in PDF format. We digitized the flow volume loop (FVL) and extracted numeric values from the reports. The algorithm used FEV1/FVC
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
HystLab (Hysteresis Loop analysis box), is MATLAB based software for the advanced processing and analysis of magnetic hysteresis data. Hysteresis loops are one of the most ubiquitous rock magnetic measurements and with the growing need for high resolution analyses of ever larger datasets, there is a need to rapidly, consistently, and accurately process and analyze these data. HystLab is an easy to use graphical interface that is compatible with a wide range of software platforms. The software can read a wide range of data formats and rapidly process the data. It includes functionality to re-center loops, correction for drift, and perform a range of slope saturation corrections.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Freesound Loop Dataset
This dataset contains 9,455 loops from Freesound.org and the corresponding annotations. These loops have tempo, key, genre and instrumentation annotation.
Dataset Construction
To collect this dataset, the following steps were performed:
Freesound was queried with "loop" and "bpm", so as to collect loops which have a beats-per-minute(BPM) annotations.
The sounds were analysed with AudioCommons extractor, so as to obtain key information.
The textual metadata of each sound was analysed, to obtain the BPM proposed by the user, and to obtain genre information.
Annotators used a web interface to annotate around 3,000 loops.
Dataset Organisation
The dataset contains two folders and two files in the root directory:
'FSL10K' encloses the audio files and their metadata and analysis. The audios are in the 'audio' folder and are named '
'annotations' holds the expert provided annotation for the sounds in the dataset. The annotations are separated in a folder for each annotator and each annotation is stored as a .json file, named 'sound-
Licenses
All the sounds have some kind of Creative Commons license. The license of each sound in the dataset can be obtained from the 'FSL10K/metadata.json' file
Authors and Contact
This dataset was developed by António Ramires et. al.
Any questions related to this dataset please contact:
António Ramires
References
Please cite this paper if you use this dataset:
@inproceedings{ramires2020, author = "Antonio Ramires and Frederic Font and Dmitry Bogdanov and Jordan B. L. Smith and Yi-Hsuan Yang and Joann Ching and Bo-Yu Chen and Yueh-Kao Wu and Hsu Wei-Han and Xavier Serra", title = "The Freesound Loop Dataset and Annotation Tool", booktitle = "Proc. of the 21st International Society for Music Information Retrieval (ISMIR)", year = "2020" }
Acknowledgements
This work has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068 (MIP-Frontiers).