MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "PAQ_pairs"
Dataset Summary
Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository hosts data and code presented in the article "Parsimonious machine learning for the global mapping of aboveground biomass potential". The repository contains a compressed file containing all the code needed to reproduce the methodology that we developed and to analyse its results. We did not upload all the temporary and intermediate data files that are created during the execution of the method. We rather uploaded "milestone" data, i.e. final results or important intermediate ones. This includes the final training dataset, model calibration data, the final trained model, the global data for prediction, the final global map of potential aboveground biomass density (AGBD) at present times (raster files at 1km2 and 10km2 resolution), maps depicting regions where climatic conditions are outside of the training range of positive AGBD instances and maps depicting world regions without trees.
Files:
code.zip : Compressed directory with all the code needed to reproduce the methodology presented in the manuscript. Contains a README file. Also contains temporary data generated in the process, the training dataset, the trained model, and model calibration data.
potential_AGBD_Mgha_1km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions and at a resolution of 1 squared kilometer.
potential_AGBD_Mgha_10km_present_climate_1980_2010.tif : the predicted global potential AGBD under contemporary climate conditions downsampled at a resolution of 10 squared kilometers.
potential_AGBD_Mgha_10km_model_difference.tif : the difference between our prediction of potential AGBD and the prediction from a complex state-of-the-art model from Walker et al. (2022).
potential_AGB_Mg_1km_present_climate_1980_2010.tif : the predicted global potential pixel-level AGB under contemporary climate conditions downsampled at a resolution of 1 squared kilometers.
number_predictors_out_of_range.zip : tiled maps representing the number of climatic predictors outside of the training range before including 0 AGBD instances in the training dataset.
tree_absence_map.zip : tiled maps representing world regions without trees. Based on Crowther et al. (2015) (https://elischolar.library.yale.edu/yale_fes_data/1/).
inference_pipeline_potential_agbd_Mgha_climate.pkl : Calibrated model for the prediction of potential AGBD given bioclimatic conditions.
predictors_data_global.zip : Global predictors data to apply the model on.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "QQP_triplets"
Dataset Summary
This dataset will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. The data is organized as triplets (anchor, positive, negative). Disclaimer: The team releasing Quora data did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/QQP_triplets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This upload contains slices 1 – 1,000 from the data collection described in
Maximilian B. Kiss, Sophia B. Coban, K. Joost Batenburg, Tristan van Leeuwen, and Felix Lucka “"2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning", Sci Data 10, 576 (2023) or arXiv:2306.05907 (2023)
Abstract:
"Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline."
The data collection has been acquired using a highly flexible, programmable and custom-built X-ray CT scanner, the FleX-ray scanner, developed by TESCAN-XRE NV, located in the FleX-ray Lab at the Centrum Wiskunde & Informatica (CWI) in Amsterdam, Netherlands. It consists of a cone-beam microfocus X-ray point source (limited to 90 kV and 90 W) that projects polychromatic X-rays onto a 14-bit CMOS (complementary metal-oxide semiconductor) flat panel detector with CsI(Tl) scintillator (Dexella 1512NDT) and 1536-by-1944 pixels, \(74.8\mu m^2\) each. To create a 2D dataset, a fan-beam geometry was mimicked by only reading out the central row of the detector. Between source and detector there is a rotation stage, upon which samples can be mounted. The machine components (i.e., the source, the detector panel, and the rotation stage) are mounted on translation belts that allow the moving of the components independently from one another.
Please refer to the paper for all further technical details.
The complete dataset can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The reference reconstructions and segmentations can be found via the following links: 1-1000, 1001-2000, 2001-3000, 3001-4000, 4001-5000, OOD.
The corresponding Python scripts for loading, pre-processing, reconstructing and segmenting the projection data in the way described in the paper can be found on github. A machine-readable file with the used scanning parameters and instrument data for each acquisition mode as well as a script loading it can be found on the GitHub repository as well.
Note: It is advisable to use the graphical user interface when decompressing the .zip archives. If you experience a zipbomb error when unzipping the file on a Linux system rerun the command with the UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE environment variable by setting in your .bashrc “export UNZIP_DISABLE_ZIPBOMB_DETECTION=TRUE”.
For more information or guidance in using the data collection, please get in touch with
Maximilian.Kiss [at] cwi.nl
Felix.Lucka [at] cwi.nl
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Selected MRI datasets for training, validation, and testing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context and Aim
Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.
We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.
Description
The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.
The TreeSatAI Benchmark Archive contains:
50,381 image triplets (aerial, Sentinel-1, Sentinel-2)
synchronized time steps and locations
all original spectral bands/polarizations from the sensors
20 species classes (single labels)
12 age classes (single labels)
15 genus classes (multi labels)
60 m and 200 m patches
fixed split for train (90%) and test (10%) data
additional single labels such as English species name, genus, forest stand type, foliage type, land cover
The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.
Version history
v1.0.0 - First release
Citation
Ahlswede et al. (in prep.)
GitHub
Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.
Folder structure
We refer to the proposed folder structure in the PDF file.
Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.
Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.
Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.
The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]
The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.
The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).
CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),
Join the archive
Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.
Project description
This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).
Publications
Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication
Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens
Schulz et al. (2022, in prep.): Phenoprofiling
Conference contributions
S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.
B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of synthetically spoken captions for the STAIR dataset. Following the same methodology as Chrupała et al. (see article | dataset | code) we generated speech for each caption of the STAIR dataset using Google's Text-to-Speech API.
This dataset was used for visually grounded speech experiments (see article accepted at ICASSP2019).
@INPROCEEDINGS{8683069, author={W. N. {Havard} and J. {Chevrot} and L. {Besacier}}, booktitle={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Models of Visually Grounded Speech Signal Pay Attention to Nouns: A Bilingual Experiment on English and Japanese}, year={2019}, volume={}, number={}, pages={8618-8622}, keywords={information retrieval;natural language processing;neural nets;speech processing;word processing;artificial neural attention;human attention;monolingual models;part-of-speech tags;nouns;neural models;visually grounded speech signal;English language;Japanese language;word endings;cross-lingual speech-to-speech retrieval;grounded language learning;attention mechanism;cross-lingual speech retrieval;recurrent neural networks.}, doi={10.1109/ICASSP.2019.8683069}, ISSN={2379-190X}, month={May},}
The dataset comprises the following files :
mp3-stair.tar.gz : MP3 files of each caption in the STAIR dataset. Filenames have the following pattern imageID_captionID, where both imageID and captionID correspond to those provided in the original dataset (see annotation format here)
dataset.mfcc.npy : Numpy array with MFCC vectors for each caption. MFCC were extracted using python_speech_features with default configuration. To know to which caption the MFCC vectors belong to, you can use the files dataset.words.txt and dataset.ids.txt.
dataset.words.txt : Captions corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)
dataset.ids.txt : IDs of the captions (imageID_captionID) corresponding to each MFCC vector (line number = position in Numpy array, starting from 0)
Splits
test
test.txt : captions comprising the test split
test_ids.txt: IDs of the captions in the test split
test_tagged.txt : tagged version of the test split
test-alignments.json.zip : Forced alignments of all the captions in the test split. (dictionary where the key corresponds to the caption ID in the STAIR dataset). Due to an unknown error during upload, the JSON file had to be zipped...
train
train.txt : captions comprising the train split
train_ids.txt : IDs of the captions in the train split
train_tagged.txt : tagged version of the train split
val
val.txt : captions comprising the val split
val_ids.txt : IDs of the captions in the val split
val_tagged.txt : tagged version of the val split
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
New dataset link: https://www.kaggle.com/datasets/tashiee/malebin-2-0-rgb-malware-binary-images
**Important Notice (PLEASE READ) A more comprehensive dataset has been developed, featuring improved preprocessing steps and yielding more accurate classification results. This is due to the fact the current model which was trained using this dataset performs poorly on current malware variants, and there are issues with resizing which leads to distorted images
Due to current time constraints, I am unable to upload the new datasets and accompanying notebooks along with detailed documentation. If you require access to the updated resources, please feel free to contact me at tashvin.raj56@gmail.com — I will be happy to share them personally or update the dataset as soon as possible.
Additionally, while the Malimg dataset performs reliably within a closed-set environment, it should be noted that its malware samples are outdated. As a result, it may not generalize well to modern, real-world malware threats.**
Thus i would refrain you from using this dataset for model training and instead to contact me during office hours. Thanks
1.Malimg Dataset by Nataraj et al. (2011)
2.A portion of samples from https://www.kaggle.com/datasets/walt30/malware-images. Full credits to: https://www.kaggle.com/walt30.
The first dataset, the Malimg dataset, is widely recognized in the field of malware detection and consists of malware images generated by transforming binaries into grayscale images based on byte-to-pixel mapping. For the second sample, the malicious files were downloaded from MalwareBazaar, and as stated by the author, the malware images were visualized following the approach presented by Nataraj et al.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F25809564%2F48590cab63aafafc1c17bb8f2ba0b5ce%2FScreenshot%202025-05-04%20235108.png?generation=1746375133936778&alt=media" alt="">
1.To balance the number of samples across each family.
2.To resize all samples to 256x256.
3.To overcome the lack of datasets (Most existing datasets are outdated such as malimg, and newer ones contain a mix of greyscale and RGB)
Note that some samples were omitted to maintain balance, which helps avoid overfitting and reduces the overall workload.
Also, please note that I do not take credit for the original datasets. Full credits are due to the respective owners.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "coco_captions"
Dataset Summary
COCO is a large-scale object detection, segmentation, and captioning dataset. This repo contains five captions per image; useful for sentence similarity tasks. Disclaimer: The team releasing COCO did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/coco_captions_quintets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Intelligent Invoice Management System
Project Description:
The Intelligent Invoice Management System is an advanced AI-powered platform designed to revolutionize traditional invoice processing. By automating the extraction, validation, and management of invoice data, this system addresses the inefficiencies, inaccuracies, and high costs associated with manual methods. It enables businesses to streamline operations, reduce human error, and expedite payment cycles.
Problem Statement:
Manual invoice processing involves labor-intensive tasks such as data entry, verification, and reconciliation. These processes are time-consuming, prone to errors, and can result in financial losses and delays. The diversity of invoice formats from various vendors adds complexity, making automation a critical need for efficiency and scalability.
Proposed Solution:
The Intelligent Invoice Management System automates the end-to-end process of invoice handling using AI and machine learning techniques. Core functionalities include:
1. Invoice Generation: Automatically generate PDF invoices in at least four formats, populated with synthetic data.
2. Data Development: Leverage a dataset containing fields such as receipt numbers, company details, sales tax information, and itemized tables to create realistic invoice samples.
3. AI-Powered Labeling: Use Tesseract OCR to extract labeled data from invoice images, and train YOLO for label recognition, ensuring precise identification of fields.
4. Database Integration: Store extracted information in a structured database for seamless retrieval and analysis.
5. Web-Based Information System: Provide a user-friendly platform to upload invoices and retrieve key metrics, such as:
- Total sales within a specified duration.
- Total sales tax paid during a given timeframe.
- Detailed invoice information in tabular form for specific date ranges.
Key Features and Deliverables:
1. Invoice Generation:
- Generate 20,000 invoices using an automated script.
- Include dummy logos, company details, and itemized tables for four items per invoice.
Label Definition and Format:
OCR and AI Training:
Database Management:
Web-Based Interface:
Expected Outcomes:
- Reduction in manual effort and operational costs.
- Improved accuracy in invoice processing and financial reporting.
- Enhanced scalability and adaptability for diverse invoice formats.
- Faster turnaround time for invoice-related tasks.
By automating critical aspects of invoice management, this system delivers a robust and intelligent solution to meet the evolving needs of businesses.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "SPECTER"
Dataset Summary
Dataset containing triplets (three sentences): anchor, positive, and negative. Contains titles of papers. Disclaimer: The team releasing SPECTER did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Dataset Structure
Each example in the dataset contains triplets of equivalent sentences and is formatted as a dictionary with the key "set" and a list with… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/SPECTER.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data science beginners start with curated set of data, but it's a well known fact that in a real Data Science Project, major time is spent on collecting, cleaning and organizing data . Also domain expertise is considered as important aspect of creating good ML models. Being an automobile enthusiast, I tool up this challenge to collect images of two of the popular car models from a used car website, where users upload the images of the car they want to sell and then train a Deep Neural Network to identify model of a car from car images. In my search for images I found that approximately 10 percent of the cars pictures did not represent the intended car correctly and those pictures have to be deleted from final data.
There are 4000 images of two of the popular cars (Swift and Wagonr) in India of make Maruti Suzuki with 2000 pictures belonging to each model. The data is divided into training set with 2400 images , validation set with 800 images and test set with 800 images. The data was randomized before splitting into training, test and validation set.
The starter kernal is provided for keras with CNN. I have also created github project documenting advanced techniques in pytorch and keras for image classification like data augmentation, dropout, batch normalization and transfer learning
With small dataset like this, how much accuracy can we achieve and whether more data is always better. The baseline model trained in Keras achieves 88% accuracy on validation set, can we achieve even better performance and by how much.
Is the data collected for the two car models representative of all possible car from all over country or there is sample bias .
I would also like someone to extend the concept to build a use case so that if user uploads an incorrect car picture of car , the ML model could automatically flag it. For example user uploading incorrect model or an image which is not a car
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This version update includes changes to Generated_peptides.csv to fix cyclization. The prior upload did not have ring closures generated correctly as SMILES strings. The model in the publication was trained on the dataset containing errors, however to support the community we decided it would be best to release a 10M peptide SMILES dataset for use in future pretraining applications. All strings should now load correctly to mol files with RDKit.
We modify the Diving48 dataset ("RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020) into three new domains: two based on shape and one based on texture (following Geirhos et al., ICLR 2019). Note that the Statistical Visual Computing Lab in San Diego (http://www.svcl.ucsd.edu) has the copyright to the Diving48 dataset. Please cite the RESOUND paper, if you are using any data related to the Diving48 dataset, including our modified versions here "RESOUND: Towards Action Recognition without Representation Bias", Li et al., ECCV 2020. In the shape domains, we blur the background and only maintain the segmented diver(s) (S1), or their bounding boxes (S2). In the texture domain (T), we conversely mask out bounding boxes where the diver(s) are, and only keep the background. The masked boxes are filled with the average Imagenet pixel value (following Choi et al., NeurIPS 2019). The class evidence should lie only in the divers' movement; hence, the texture version should not contain any relevant signal, and the accuracy should drop to random performance. Thus, we can study how different models drop in score when tested on the shape or texture domain, indicating both cross-domain robustness (for S1 and S2) and texture bias (for T). This modified dataset was introduced in "Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition", Broomé et al., arXiv 2112.12175. Only the test set of Diving48 was used there -- we did not train on these modified domains, they were only for evaluation. The files are .mp4-videos, consisting of 32 frames each, regardless of the length of the original clip (but they are typically around 5 seconds long). We may consider to upload also the training set, please contact us if you need it urgently. Otherwise, the trained model for diver segmentation is released in this repository https://github.com/sofiabroome/diver-segmentation if you want to perform the cropping and saving yourself, at your own desired frame rate.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The annotation files were adjusted to conform to the YOLO Keras TXT format prior to upload, as the original format did not include a label map file.
v1
contains the original imported images, without augmentations. This is the version to download and import to your own project if you'd like to add your own augmentations.
v2
contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "FAST" model.
v3
contains an augmented version of the dataset, with annotations. This version of the project was trained with Roboflow's "ACCURATE" model.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "altlex"
Dataset Summary
Git repository for software associated with the 2016 ACL paper "Identifying Causal Relations Using Parallel Wikipedia Articles." Disclaimer: The team releasing altlex did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/altlex.
Description 👋🛳️ Ahoy, welcome to Kaggle! You’re in the right place. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.
If you want to talk with other users about this competition, come join our Discord! We've got channels for competitions, job postings and career discussions, resources, and socializing with your fellow data scientists. Follow the link here: https://discord.gg/kaggle
The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
Read on or watch the video below to explore more details. Once you’re ready to start competing, click on the "Join Competition button to create an account and gain access to the competition data. Then check out Alexis Cook’s Titanic Tutorial that walks you through step by step how to make your first submission!
The Challenge The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Recommended Tutorial We highly recommend Alexis Cook’s Titanic Tutorial that walks you through making your very first submission step by step and this starter notebook to get started.
How Kaggle’s Competitions Work Join the Competition Read about the challenge description, accept the Competition Rules and gain access to the competition dataset. Get to Work Download the data, build models on it locally or on Kaggle Notebooks (our no-setup, customizable Jupyter Notebooks environment with free GPUs) and generate a prediction file. Make a Submission Upload your prediction as a submission on Kaggle and receive an accuracy score. Check the Leaderboard See how your model ranks against other Kagglers on our leaderboard. Improve Your Score Check out the discussion forum to find lots of tutorials and insights from other competitors. Kaggle Lingo Video You may run into unfamiliar lingo as you dig into the Kaggle discussion forums and public notebooks. Check out Dr. Rachael Tatman’s video on Kaggle Lingo to get up to speed!
What Data Will I Use in This Competition? In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.
Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.
Check out the “Data” tab to explore the datasets even further. Once you feel you’ve created a competitive model, submit it to Kaggle to see where your model stands on our leaderboard against other Kagglers.
How to Submit your Prediction to Kaggle Once you’re ready to make a submission and get on the leaderboard:
Click on the “Submit Predictions” button
Upload a CSV file in the submission file format. You’re able to submit 10 submissions a day.
Submission File Format: You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
The file should have exactly 2 columns:
PassengerId (sorted in any order) Survived (contains your binary predictions: 1 for survived, 0 for deceased) Got it! I’m ready to get started. Where do I get help if I need it? For Competition Help: Titanic Discussion Forum Kaggle doesn’t have a dedicated team to help troubleshoot your code so you’ll typically find that you receive a response more quickly by asking your question in the appropriate forum. The forums are full of useful information on the data, metric, and different approaches. We encourage you to use the forums often. If you share your knowledge, you'll find that others will share a lot in turn!
A Last Word on Kaggle Notebooks As we mentioned before, Kaggle Notebooks is our no-setup, customizable, Jupyter Notebooks environment with free GPUs and a huge repository ...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "sentence-compression"
Dataset Summary
Dataset with pairs of equivalent sentences. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from using the dataset. Disclaimer: The team releasing sentence-compression did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/sentence-compression.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "Amazon-QA"
Dataset Summary
This dataset contains Question and Answer data from Amazon. Disclaimer: The team releasing Amazon-QA did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/Amazon-QA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "PAQ_pairs"
Dataset Summary
Pairs questions and answers obtained from Wikipedia. Disclaimer: The team releasing PAQ QA pairs did not upload the dataset to the Hub and did not write a dataset card. These steps were done by the Hugging Face team.
Supported Tasks
Sentence Transformers training; useful for semantic search and sentence similarity.
Languages
English.
Dataset Structure
Each example in the dataset contains… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/PAQ_pairs.