90 datasets found

Data from: Indian Car Dataset
kaggle.com
zip
Updated Oct 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tejaswi Kumar (2022). Indian Car Dataset [Dataset]. https://www.kaggle.com/datasets/tejaswikumar24/indian-car-dataset
Explore at:
zip(4068 bytes)Available download formats
Dataset updated
Oct 7, 2022
Authors
Tejaswi Kumar
Description
This dataset is collected from an Indian car website and can be used for exploration as well as research purposes. The attributes used in the dataset are: - Car Name: describes the name of a car - Price: Describes the range of price in which the car is usually sold - Engine: the volume of fuel and air that can be pushed through a car's cylinders and is measured in cubic centimetres - Mileage: describes the fuel efficiency of a vehicle
Indian Housing Datasets -ML-ready data for - Citys
kaggle.com
zip
Updated Nov 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vishal Baghel (2025). Indian Housing Datasets -ML-ready data for - Citys [Dataset]. https://www.kaggle.com/datasets/vishalbaghel28/indian-housing-datasets-ml-ready-data-for-citys
Explore at:
zip(153559 bytes)Available download formats
Dataset updated
Nov 8, 2025
Authors
Vishal Baghel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🇮🇳 Indian Housing Datasets — Ready-to-use for ML & Data Analysis

This dataset provides clean, ready-to-use Indian housing data for: - 🏙️ Ahmedabad - 🏙️ Gurugram - 🏙️ Mumbai

Each dataset includes features like: - Property size (sqft) - Location & locality - Price - Number of bedrooms - Furnishing details - Property type (apartment, villa, etc.) - Age of property

All datasets are formatted in CSV for quick loading and analysis in Python, Pandas, or any ML pipeline.

📦 Python Package (for easy access)

You can directly load these datasets using my PyPI library:

pip install india-housing-datasets
h
WordProject
huggingface.co
Updated Jan 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). WordProject [Dataset]. https://huggingface.co/datasets/ai4bharat/WordProject
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from WordProject, a subset of BhasaAnuvaad.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/WordProject.
f
CGRdb2.0: A Python Database Management System for Molecules, Reactions, and...
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timur Gimadiev; Ramil Nugmanov; Aigul Khakimova; Adeliya Fatykhova; Timur Madzhidov; Pavel Sidorov; Alexandre Varnek (2023). CGRdb2.0: A Python Database Management System for Molecules, Reactions, and Chemical Data [Dataset]. http://doi.org/10.1021/acs.jcim.1c01105.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.1c01105.s002
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Timur Gimadiev; Ramil Nugmanov; Aigul Khakimova; Adeliya Fatykhova; Timur Madzhidov; Pavel Sidorov; Alexandre Varnek
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This work introduces CGRdb2.0an open-source database management system for molecules, reactions, and chemical data. CGRdb2.0 is a Python package connecting to a PostgreSQL database that enables native searches for molecules and reactions without complicated SQL syntax. The library provides out-of-the-box implementations for similarity and substructure searches for molecules, as well as similarity and substructure searches for reactions in two waysbased on reaction components and based on the Condensed Graph of Reaction approach, the latter significantly accelerating the performance. In benchmarking studies with the RDKit database cartridge, we demonstrate that CGRdb2.0 performs searches faster for smaller data sets, while allowing for interactive access to the retrieved data.
d
Data from: Cytonuclear discordance in the Florida Everglades invasive...
catalog.data.gov
data.usgs.gov
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Cytonuclear discordance in the Florida Everglades invasive Burmese python (Python bivittatus) population reveals possible hybridization with the Indian python (P. molurus) [Dataset]. https://catalog.data.gov/dataset/cytonuclear-discordance-in-the-florida-everglades-invasive-burmese-python-python-bivittatu
Explore at:
Dataset updated
Nov 21, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Florida, Everglades
Description
Invasive Burmese pythons (Python bivittatus) have been reproducing in the Florida Everglades since the 1980s. Introduction of the species was either due to unintentional escapes or intentional releases from snakes obtained through the commercial pet trade. Burmese pythons have caused a precipitous decline in small mammal populations in south Florida. To better understand the invasive population, two mitochondrial loci (mtDNA; 1398 bps) were sequenced on 426 snakes and 22 microsatellites were genotyped on 389 snakes. Concatenated cytochrome b and cytochrome oxidase 1 mtDNA sequences produced six haplotypes with a nucleotide and haplotype diversity of π=0.002 and h=0.097, respectively. The dominant haplotype was highly divergent from the second most frequent haplotype (π =0.0388). The average number of microsatellite alleles and expected heterozygosity were NA = 5.50 and HE = 0.60, respectively. Nuclear Bayesian assignment tests supported two genetically distinct groups and an admixed group. The effective population size was lower than expected for a population of this size (Ne =315.1), but reflective of the overall low genetic diversity. Patterns for genetic diversity between mtDNA and microsatellites were disparate, indicating nuclear introgression of separate mtDNA stocks due to interbreeding among sympatric populations/stocks of P. bivittatus. Alternatively, hybridization between P. molurus and P. bivittatus may have occurred in native or captive populations. The introgression may have occurred in the native range, breeding of disparate stocks in the pet trade, or in the invasive habitat. The invasive Florida Burmese python sequences were similar to the published sequences identified as P. bivittatus and P. molurus, however the nuclear diversity was nearly half of that reported in wild populations sampled within the native range.
d
Data from: Burmese python environmental DNA data, and associated attributes,...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Burmese python environmental DNA data, and associated attributes, collected from ARM Loxahatchee NWR and surrounding areas, from 2014-2016 [Dataset]. https://catalog.data.gov/dataset/burmese-python-environmental-dna-data-and-associated-attributes-collected-from-arm-lo-2014
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
Environmental DNA (eDNA) detection of invasive species can be used to delimit occupied ranges and estimate probabilities to inform management decisions. Environmental DNA is shed into the environment through skin cells and bodily fluids and can be detected in water samples collected from lakes, rivers, and swamps. In south Florida, invasive Burmese pythons occupy much of the Greater Everglades in mostly inaccessible habitat and are credited with causing severe declines of native species’ populations. Detection of Burmese pythons by traditional methods, such as trapping and visual searching, have been largely ineffective, making eDNA a superior method for differentiating invaded habitat. We adapted a quantitative PCR eDNA assay for droplet digital PCR, a state-of-the-art method that improves precision and accuracy. From August 2014 to October 2016, locations in and around Arthur R. Marshall Loxahatchee National Wildlife Refuge in southeast Florida were surveyed for Burmese python eDNA. The Refuge is maintained to provide water storage and is considered one of the last remnants of the northern Everglades wetlands. Positive eDNA detections were made at each of the five sampling events, assessing a total of 399 samples, with moderate occurrence (ψ=58-91%) and detection (p=40-70%) probabilities, potentially reduced by high PCR inhibition-levels. The high occurrence rates and geographic distribution of the positive samples within the Refuge suggests a steady release of python eDNA from a resident Burmese python population and reduces support for primarily transport of eDNA through boats or flowing water from the north. The first confirmed sighting of a Burmese python in the Refuge occurred in September 2016, after eDNA testing had indicated the presence of pythons. An established population is not expected this far north, however, the detections likely indicate northern range limit of a consistent population at Loxahatchee on the eastern side of the Florida peninsula. Our study demonstrates the benefit of eDNA for determining more accurate range limits and expansion information for Burmese pythons, as well as laying the foundation for the assessment of control efforts.
h
Code_Vulnerability_Security_DPO
huggingface.co
Updated Apr 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Byte (2024). Code_Vulnerability_Security_DPO [Dataset]. https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 21, 2024
Authors
Byte
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Cybernative.ai Code Vulnerability and Security Dataset

Dataset Description

The Cybernative.ai Code Vulnerability and Security Dataset is a dataset of synthetic Data Programming by Demonstration (DPO) pairs, focusing on the intricate relationship between secure and insecure code across a variety of programming languages. This dataset is meticulously crafted to serve as a pivotal resource for researchers, cybersecurity professionals, and AI developers who are keen on… See the full description on the dataset page: https://huggingface.co/datasets/CyberNative/Code_Vulnerability_Security_DPO.
h
medical-instruction-100k
huggingface.co
Updated Nov 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Altaf (2024). medical-instruction-100k [Dataset]. https://huggingface.co/datasets/Mohammed-Altaf/medical-instruction-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2024
Authors
Mohammed Altaf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
What is the Dataset About?🤷🏼‍♂️

The dataset is useful for training a Generative Language Model for the Medical application and instruction purposes, the dataset consists of various thoughs proposed by the people [mentioned as the Human ] and there responses including Medical Terminologies not limited to but including names of the drugs, prescriptions, yogic exercise suggessions, breathing exercise suggessions and few natural home made prescriptions.

How the Dataset… See the full description on the dataset page: https://huggingface.co/datasets/Mohammed-Altaf/medical-instruction-100k.
s
Python Import Data India – Buyers & Importers List
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Python Import Data India – Buyers & Importers List [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Info Solutions PVT LTD
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Benchmarking on Microservices Configurations and the Impact on the...
zenodo.org
data.europa.eu
csv
Updated Sep 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamed Mekki; Mohamed Mekki; Nassima Toumi; Nassima Toumi; Adlen Ksentini; Adlen Ksentini (2023). Benchmarking on Microservices Configurations and the Impact on the Performance in Cloud Native Environments [Dataset]. http://doi.org/10.5281/zenodo.6907619
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6907619
Dataset updated
Sep 21, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mohamed Mekki; Mohamed Mekki; Nassima Toumi; Nassima Toumi; Adlen Ksentini; Adlen Ksentini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer reviewed publication for this dataset has been published in LCN 2022, 47th Annual IEEE Conference on Local Computer Networks. Please cite this paper when referring to the dataset: https://www.eurecom.fr/publication/6971.

Cloud-native and containerization have changed the way to develop and deploy applications. Cloud-native rethinks the application architecture by embracing a microservice approach, where each microservice is packaged into containers to run in a centralized or an edge cloud. When deploying the container running the micro-service, the tenant has to specify the needed computing resources to run their workload in terms of the amount of CPU and memory limit. However, it is not straightforward for a tenant to know in advance the computing amount that allows running the microservice optimally. This will have an impact not only on the service performances but also on the infrastructure provider, particularly if the resource overprovisioning approach is used. To overcome this issue, we conduct an experimental study aiming to detect if a tenant's configuration allows running its service optimally. We run several experiments on a cloud-native platform, using different types of applications under different resource configurations. The obtained results are presented in the accepted IEEE LCN paper (https://www.eurecom.fr/publication/6971) and are shared in this dataset.

The datasets are collected for 3 types of applications: Web servers written in python and Golang, RabbitMQ data broker and the OpenAirInterface 5G Core network function AMF (Access and Mobility Management Function).

Web Servers:

files: golang-web-server-performance.csv, python-web-server-performance.csv

We used Golang and Python-based web servers for the test. Each request to the web server returns a video of a size 43 MB. For testing we used ApacheBench, a command-line program used for benchmarking HTTP web servers. ApacheBench allows parallel requests from multiple clients. For each web server instance we send a number of requests ranging from 100 to 1000 and a concurrency level between 1 and 100, representing the number of parallel clients performing the requests.

The information available in the dataset are as follows:

time: timestamp of collection of metrics.

ram_limit: the memory allocated to the container in megabytes.

cpu_limit: the CPU allocated to the container.

ram_usage: the amount of memory used by the container at the time of the metrics collection in byte.

cpu_usage: the amount of CPU used by the container at the time of the metrics collection.

n: the number of requests sent to the container.

c: the concurrency level in the requests.

lat50: the least response time for the best 50% requests in microseconds.

lat66: the least response time for the best 66% requests in microseconds.

lat75: the least response time for the best 75% requests in microseconds.

lat80: the least response time for the best 80% requests in microseconds.

lat90: the least response time for the best 90% requests in microseconds.

lat95: the least response time for the best 95% requests in microseconds.

lat98: the least response time for the best 98% requests in microseconds.

lat99: the least response time for the best 99% requests in microseconds.

lat100: the least response time in microseconds.

5G Core network’s AMF:

file: amf-performance.csv

For testing we use my5G-RANTester, a tool for emulating control and data planes of the UE and gNB (5G base station). The number of simultaneous registration requests that are sent to each instance of the AMF varies between 10 and 400.

The information available in the dataset are as follows:

time: timestamp of collection of metrics.

ram_limit: the memory allocated to the container in megabytes.

cpu_limit: the CPU allocated to the container.

ram_usage: the amount of memory used by the container at the time of the metrics collection in byte.

cpu_usage: the amount of CPU used by the container at the time of the metrics collection.

n: the number of parallel registration requests sent to the AMF.

mean: the mean registration time for all the registration requests in microseconds.

lat50: the median registration time for registration requests in microseconds.

lat75: the least registration time for the best 75% registration requests in microseconds.

lat80: the least registration time for the best 80% registration requests in microseconds.

lat90: the least registration time for the best 90% registration requests in microseconds.

lat95: the least registration time for the best 95% registration requests in microseconds.

lat98: the least registration time for the best 98% registration requests in microseconds.

lat99: the least registration time for the best 99% registration requests in microseconds.

lat100: the least registration time in microseconds.

RabbitMQ data broker:

file: rabbitmq-performance.csv

For testing we used RabbitMQ PerfTest which is a throughput testing tool that simulates basic workloads and provides the throughput and the time that a message takes to be consumed by a consumer. For each deployed RabbitMQ server we used a number of producers and consumers that ranges from 50 to 500. Each producer sends messages to the broker with a rate of 100 messages per second for a period of time of 90 seconds.

The information available in the dataset are as follows:

time: timestamp of collection of metrics.

ram_limit: the memory allocated to the container in megabytes.

cpu_limit: the CPU allocated to the container.

ram_usage: the amount of memory used by the container at the time of the metrics collection in byte.

cpu_usage: the amount of CPU used by the container at the time of the metrics collection.

n: the number of producers sending messages to the RabbitMQ server.

Min: the minimum consumption time for the producer messages.

lat50: the median consumption time for the producer messages.

lat75: the least consumption time for the best 75% messages in microseconds.

lat95: the least consumption time for the best 95% messages in microseconds.

lat99: the least consumption time for the best 99% messages in microseconds.
Z
DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning
data.niaid.nih.gov
Updated May 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Olsen, Alex; Konovalov, Dimitriv A.; Philippa, Bronson; Ridd, Peter; Wood, Jake C.; Johns, Jamie; Banks, Wesley; Girgenti, Benjamin; Kenny, Owen; Whinney, James; Calvert, Brendan; Rahimi Azghadi, Mostafa; White, Ronald D. (2023). DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7939059
Explore at:
Dataset updated
May 16, 2023
Authors
Olsen, Alex; Konovalov, Dimitriv A.; Philippa, Bronson; Ridd, Peter; Wood, Jake C.; Johns, Jamie; Banks, Wesley; Girgenti, Benjamin; Kenny, Owen; Whinney, James; Calvert, Brendan; Rahimi Azghadi, Mostafa; White, Ronald D.
License
http://www.apache.org/licenses/LICENSE-2.0http://www.apache.org/licenses/LICENSE-2.0
Description
DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning

This repository makes available the source code and public dataset for the work, "DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning", published with open access by Scientific Reports: https://www.nature.com/articles/s41598-018-38343-3. The DeepWeeds dataset consists of 17,509 images capturing eight different weed species native to Australia in situ with neighbouring flora. In our work, the dataset was classified to an average accuracy of 95.7% with the ResNet50 deep convolutional neural network.

The source code, images and annotations are licensed under CC BY 4.0 license. The contents of this repository are released under an Apache 2 license.

Download the dataset images and our trained models

images.zip (468 MB)

models.zip (477 MB)

Due to the size of the images and models they are hosted outside of the Github repository. The images and models must be downloaded into directories named "images" and "models", respectively, at the root of the repository. If you execute the python script (deepweeds.py), as instructed below, this step will be performed for you automatically.

TensorFlow Datasets

Alternatively, you can access the DeepWeeds dataset with TensorFlow Datasets, TensorFlow's official collection of ready-to-use datasets. DeepWeeds was officially added to the TensorFlow Datasets catalog in August 2019.

Weeds and locations

The selected weed species are local to pastoral grasslands across the state of Queensland. They include: "Chinee apple", "Snake weed", "Lantana", "Prickly acacia", "Siam weed", "Parthenium", "Rubber vine" and "Parkinsonia". The images were collected from weed infestations at the following sites across Queensland: "Black River", "Charters Towers", "Cluden", "Douglas", "Hervey Range", "Kelso", "McKinlay" and "Paluma". The table and figure below break down the dataset by weed, location and geographical distribution.

Data organization

Images are assigned unique filenames that include the date/time the image was photographed and an ID number for the instrument which produced the image. The format is like so: YYYYMMDD-HHMMSS-ID, where the ID is simply an integer from 0 to 3. The unique filenames are strings of 17 characters, such as 20170320-093423-1.

labels

The labels.csv file assigns species labels to each image. It is a comma separated text file in the format:

Filename,Label,Species ... 20170207-154924-0,jpg,7,Snake weed 20170610-123859-1.jpg,1,Lantana 20180119-105722-1.jpg,8,Negative ...

Note: The specific label subsets of training (60%), validation (20%) and testing (20%) for the five-fold cross validation used in the paper are also provided here as CSV files in the same format as "labels.csv".

models

We provide the most successful ResNet50 and InceptionV3 models saved in Keras' hdf5 model format. The ResNet50 model, which provided the best results, has also been converted to UFF format in order to construct a TensorRT inference engine.

resnet.hdf5 inception.hdf5 resnet.uff

deepweeds.py

This python script trains and evaluates Keras' base implementation of ResNet50 and InceptionV3 on the DeepWeeds dataset, pre-trained with ImageNet weights. The performance of the networks are cross validated for 5 folds. The final classification accuracy is taken to be the average across the five folds. Similarly, the final confusion matrix from the associated paper aggregates across the five independent folds. The script also provides the ability to measure the inference speeds within the TensorFlow environment.

The script can be executed to carry out these computations using the following commands.

To train and evaluate the ResNet50 model with five-fold cross validation, use python3 deepweeds.py cross_validate --model resnet.

To train and evaluate the InceptionV3 model with five-fold cross validation, use python3 deepweeds.py cross_validate --model inception.

To measure inference times for the ResNet50 model, use python3 deepweeds.py inference --model models/resnet.hdf5.

To measure inference times for the InceptionV3 model, use python3 deepweeds.py inference --model models/inception.hdf5.

Dependencies

The required Python packages to execute deepweeds.py are listed in requirements.txt.

tensorrt

This folder includes C++ source code for creating and executing a ResNet50 TensorRT inference engine on an NVIDIA Jetson TX2 platform. To build and run on your Jetson TX2, execute the following commands:

cd tensorrt/src make -j4 cd ../bin ./resnet_inference

Citations

If you use the DeepWeeds dataset in your work, please cite it as:

IEEE style citation: “A. Olsen, D. A. Konovalov, B. Philippa, P. Ridd, J. C. Wood, J. Johns, W. Banks, B. Girgenti, O. Kenny, J. Whinney, B. Calvert, M. Rahimi Azghadi, and R. D. White, “DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning,” Scientific Reports, vol. 9, no. 2058, 2 2019. [Online]. Available: https://doi.org/10.1038/s41598-018-38343-3 ”

BibTeX

@article{DeepWeeds2019, author = {Alex Olsen and Dmitry A. Konovalov and Bronson Philippa and Peter Ridd and Jake C. Wood and Jamie Johns and Wesley Banks and Benjamin Girgenti and Owen Kenny and James Whinney and Brendan Calvert and Mostafa {Rahimi Azghadi} and Ronald D. White}, title = {{DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning}}, journal = {Scientific Reports}, year = 2019, number = 2058, month = 2, volume = 9, issue = 1, day = 14, url = "https://doi.org/10.1038/s41598-018-38343-3", doi = "10.1038/s41598-018-38343-3" }
d
Data from: Serpentoviruses in free-ranging invasive pythons and native...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Serpentoviruses in free-ranging invasive pythons and native colubrids in southern Florida, United States, 2018-2020 [Dataset]. https://catalog.data.gov/dataset/serpentoviruses-in-free-ranging-invasive-pythons-and-native-colubrids-in-southern-flo-2018
Explore at:
Dataset updated
Oct 8, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
South Florida, United States, Florida
Description
The presence of serpentoviruses in Burmese pythons (Python bivittatus) and native snakes were collected and compiled to characterize serpentovirus in wild free-ranging pythons and free-ranging native snakes within the invasive range of the pythons in southern Florida. Virus presence was tested in 318 pythons and 219 Native snakes, primarily within the Greater Everglades Ecosystem of south Florida. When available, variables collected from submitted samples used for analysis included sampling date, sampling season (Summer/Fall/Winter/Spring), capture date, sample number (if tested more than once), reverse transcription polymerase chain reaction (rtPCR) result (positive/negative), virus type (categorical), sex (male/female), snout-vent length (centimeters), mass (grams), oral mucosal appearance, capture coordinates (Universal Transversal Mercator [UTM] Easting [CaptureUTMx] and Northing [CaptureUTMy]), and capture subpopulation designation (categorical). Snake samples were provided through various local, state, and federal organizations, including the United States Geological Survey (USGS), National Park Service (NPS), United States Department of Agriculture (USDA), the Conservancy of Southwest Florida (CSF), and the Florida Fish and Wildlife Conservation Commission (FWC).
h
NPTEL
huggingface.co
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2025). NPTEL [Dataset]. https://huggingface.co/datasets/ai4bharat/NPTEL
Explore at:
Dataset updated
Jan 16, 2025
Dataset authored and provided by
AI4Bharat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

Overview

BhasaAnuvaad, is the largest Indic-language AST dataset spanning over 44,400 hours of speech and 17M text segments for 13 of 22 scheduled Indian languages and English. This repository consists of parallel data for Speech Translation from NPTEL, a subset of BhasaAnuvaad.

How to use

The datasets library allows you to load and pre-process your dataset in pure Python, at… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/NPTEL.
Data from: Effects of an invasive top predator on Ecosystem structure and...
data.niaid.nih.gov
datadryad.org
zip
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shelby LeClare; Christina Romagosa; Benjamin Baiser (2025). Effects of an invasive top predator on Ecosystem structure and function in a Graminoid Marsh food web [Dataset]. http://doi.org/10.5061/dryad.tqjq2bw9k
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.tqjq2bw9k
Dataset updated
May 20, 2025
Dataset provided by
University of Florida
Authors
Shelby LeClare; Christina Romagosa; Benjamin Baiser
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
In this era of global change, understanding the effects of human-mediated dispersal of organisms has become a priority for ecological and conservation research. Non-native species introductions can result in biological invasions that have substantial impacts on native population abundances, community interactions, and ecosystem processes. Over the past twenty years, the establishment and subsequent invasion of the Burmese python (Python molurus bivittatus) in the Greater Everglades ecosystem has become a topic of concern amongst land managers and conservation practitioners in southern Florida. The objective of this study is to assess community-wide impacts of the python on the native food web as well as the function of the Greater Everglades Ecosystem by attempting to answer two central questions: 1) What functional role does the python occupy within the native food web and 2) is there a shift in overall food web structure and/or function post-invasion? We used ecological network analysis and an extensive diet dataset to quantify the python’s trophic role relative to other residents in the food web as well as compare ecosystem characteristics between pre- and post-invaded network models. Our findings demonstrate that the python functions similarly to the largemouth bass, another torrid, highly invasive predator that exhibits strong top-down impacts within its aquatic habitats. The python also behaves as a dominant predator akin to the Florida panther, primarily affecting native mammal populations through top-down predation effects, displacing other top predators within the food web, and altering patterns of carbon flow along the food chain. Finally, at the system scale, although we see increasing trends in ecosystem activity and decreasing trends in ecosystem structure and organization, functional metrics remain relatively stable pre- and post- invasion indicating functional resilience. Our findings provide a holistic assessment of the Burmese python invasion on the native community and function of the greater Everglades graminoid marsh ecosystem.
d
Data from: Photo-documented sequences from 01 Jun 2021-30 Aug 2021 showing...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Photo-documented sequences from 01 Jun 2021-30 Aug 2021 showing novel interactions between intraguild predators in southern Florida, USA, bobcat and Burmese python [Dataset]. https://catalog.data.gov/dataset/photo-documented-sequences-from-01-jun-2021-30-aug-2021-showing-novel-interactions-between
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
South Florida, United States
Description
Entire photo-documented sequence from 01 June 2021–09 September 2021, including novel interactions between intraguild predators in southern Florida – the native bobcat (Lynx rufus) and the invasive Burmese python (Python bivittatus). A bobcat depredated an unguarded Burmese python nest and subsequently the python exhibited nest defense behavior following the return of both animals to the nest. First, a bobcat discovers an unguarded nest then proceeds to depredate, cache, and uncover the eggs over several days. The bobcat returns to find the female python back on the nest and later proceeds to swipe at the snake. After biologists attempted to the nest but leave the camera, the bobcat returns to scavenge discarded, inviable eggs over several weeks. This is the first documentation of any animal in Florida preying on python eggs, and the first evidence or description of such antagonistic interactions at a python nest. Photos were captured by U.S. Geological Survey equipment in Big Cypress National Preserve within the Greater Everglades Ecosystem, Florida USA.
H
Using Python and Jupyter Notebook to Retrieve and Visualize the Water...
hydroshare.org
zip
Updated Apr 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Farshid (2022). Using Python and Jupyter Notebook to Retrieve and Visualize the Water Temperature Data of the Logan River, Utah [Dataset]. https://www.hydroshare.org/resource/8c565dc2f9244182a575f91515e83d1d
Explore at:
zip(358.6 MB)Available download formats
Dataset updated
Apr 21, 2022
Dataset provided by
HydroShare
Authors
Ali Farshid
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2014 - Dec 18, 2021
Area covered

Description
The mainstem Logan River is a suitable habitat for cold-water fishes such as native populations of cutthroat trout (Budy & Gaeta, 2018). On the other hand, high water temperatures can harm cold-water fish populations by creating physiological stresses, intensifying metabolic demands, and limiting suitable habitats (Williams & et al., 2015). In this regard, the State of Utah Department of Environmental Quality (UDEQ) has identified the Logan River as a suitable habitat for cold-water species, which can become unsuitable when the water temperature rises higher than 20 degrees Celsius (Rule R317-2, 2022). However, the UDEQ does not provide any details on how to evaluate the violations from the standard. One way to evaluate violations is to look at water temperature distributions (i.e., histograms) along the river from high elevations to low elevations at different locations. In this report, I used three different Python libraries to manipulate, extract, and explore the water temperature data of the Logan River from 2014 to 2021 obtained from the Logan River Observatory website. The results (i.e., the generated histograms by executing Jupyter Notebook in the HydroShare environment) show that the Logan River tends to experience higher water temperatures as its elevation drops regardless of the season. This can provide some insights for the UDEQ to simultaneously consider space and time in assessing violations from the standard.
Data from: Deep learning four decades of human migration: datasets
zenodo.org
csv, nc
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel (2025). Deep learning four decades of human migration: datasets [Dataset]. http://doi.org/10.5281/zenodo.17344747
Explore at:
csv, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17344747
Dataset updated
Oct 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Zenodo repository contains all migration flow estimates associated with the paper "Deep learning four decades of human migration." Evaluation code, training data, trained neural networks, and smaller flow datasets are available in the main GitHub repository, which also provides detailed instructions on data sourcing. Due to file size limits, the larger datasets are archived here.

Data is available in both NetCDF (.nc) and CSV (.csv) formats. The NetCDF format is more compact and pre-indexed, making it suitable for large files. In Python, datasets can be opened as xarray.Dataset objects, enabling coordinate-based data selection.

Each dataset uses the following coordinate conventions:

Year: 1990–2023

Birth ISO: Country of birth (UN ISO3)

Origin ISO: Country of origin (UN ISO3)

Destination ISO: Destination country (UN ISO3)

Country ISO: Used for net migration data (UN ISO3)

The following data files are provided:

T.nc: Full table of flows disaggregated by country of birth. Dimensions: Year, Birth ISO, Origin ISO, Destination ISO

flows.nc: Total origin-destination flows (equivalent to T summed over Birth ISO). Dimensions: Year, Origin ISO, Destination ISO

net_migration.nc: Net migration data by country. Dimensions: Year, Country ISO

stocks.nc: Stock estimates for each country pair. Dimensions: Year, Origin ISO (corresponding to Birth ISO), Destination ISO

test_flows.nc: Flow estimates on a randomly selected set of test edges, used for model validation

Additionally, two CSV files are provided for convenience:

mig_unilateral.csv: Unilateral migration estimates per country, comprising:

imm: Total immigration flows

emi: Total emigration flows

net: Net migration

imm_pop: Total immigrant population (non-native-born)

emi_pop: Total emigrant population (living abroad)

mig_bilateral.csv: Bilateral flow data, comprising:

mig_prev: Total origin-destination flows

mig_brth: Total birth-destination flows, where Origin ISO reflects place of birth

Each dataset includes a mean variable (mean estimate) and a std variable (standard deviation of the estimate).

An ISO3 conversion table is also provided.
Data from: Satellite remote sensing dataset of Sentinel-2 for phenology...
zenodo.org
producciocientifica.uv.es
+1more
txt
Updated Apr 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira (2023). Satellite remote sensing dataset of Sentinel-2 for phenology metrics extraction from sites in Bulgaria and France [Dataset]. http://doi.org/10.5281/zenodo.7825727
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7825727
Dataset updated
Apr 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dessislava Ganeva; Dessislava Ganeva; Lukas Graf Valentin; Lukas Graf Valentin; Egor Prikaziuk; Egor Prikaziuk; Gerbrand Koren; Gerbrand Koren; Enrico Tomelleri; Enrico Tomelleri; Jochem Verrelst; Jochem Verrelst; Katja Berger; Katja Berger; Santiago Belda; Santiago Belda; Zhanzhang Cai; Zhanzhang Cai; Cláudio Silva Figueira; Cláudio Silva Figueira
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Site Description:

In this dataset, there are seventeen production crop fields in Bulgaria where winter rapeseed and wheat were grown and two research fields in France where winter wheat – rapeseed – barley – sunflower and winter wheat – irrigated maize crop rotation is used. The full description of those fields is in the database "In-situ crop phenology dataset from sites in Bulgaria and France" (doi.org/10.5281/zenodo.7875440).

Methodology and Data Description:

Remote sensing data is extracted from Sentinel-2 tiles 35TNJ for Bulgarian sites and 31TCJ for French sites on the day of the overpass since September 2015 for Sentinel-2 derived vegetation indices and since October 2016 for HR-VPP products. To suppress spectral mixing effects at the parcel boundaries, as highlighted by Meier et al., 2020, the values from all datasets were subgrouped per field and then aggregated to a single median value for further analysis.

Sentinel-2 data was downloaded for all test sites from CREODIAS (https://creodias.eu/) in L2A processing level using a maximum scene-wide cloudy cover threshold of 75%. Scenes before 2017 were available in L1C processing level only. Scenes in L1C processing level were corrected for atmospheric effects after downloading using Sen2Cor (v2.9) with default settings. This was the same version used for the L2A scenes obtained intermediately from CREODIAS.

Next, the data was extracted from the Sentinel-2 scenes for each field parcel where only SCL classes 4 (vegetation) and 5 (bare soil) pixels were kept. We resampled the 20m band B8A to match the spatial resolution of the green and red band (10m) using nearest neighbor interpolation. The entire image processing chain was carried out using the open-source Python Earth Observation Data Analysis Library (EOdal) (Graf et al., 2022).

Apart from the widely used Normalized Difference Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI), we included two recently proposed indices that were reported to have a higher correlation with photosynthesis and drought response of vegetation: These were the Near-Infrared Reflection of Vegetation (NIRv) (Badgley et al., 2017) and Kernel NDVI (kNDVI) (Camps-Valls et al., 2021). We calculated the vegetation indices in two different ways:

First, we used B08 as near-infrared (NIR) band which comes in a native spatial resolution of 10 m. B08 (central wavelength 833 nm) has a relatively coarse spectral resolution with a bandwidth of 106 nm.

Second, we used B8A which is available at 20 m spatial resolution. B8A differs from B08 in its central wavelength (864 nm) and has a narrower bandwidth (21 nm or 22 nm in the case of Sentinel-2A and 2B, respectively) compared to B08.

The High Resolution Vegetation Phenology and Productivity (HR-VPP) dataset from Copernicus Land Monitoring Service (CLMS) has three 10-m set products of Sentinel-2: vegetation indices, vegetation phenology and productivity parameters and seasonal trajectories (Tian et al., 2021). Both vegetation indices, Normalized Vegetation Index (NDVI) and Plant Phenology (PPI) and plant parameters, Fraction of Absorbed Photosynthetic Active Radiation (FAPAR) and Leaf Area Index (LAI) were computed for the time of Sentinel-2 overpass by the data provider.

NDVI is computed directly from B04 and B08 and PPI is computed using Difference Vegetation Index (DVI = B08 - B04) and its seasonal maximum value per pixel. FAPAR and LAI are retrieved from B03 and B04 and B08 with neural network training on PROSAIL model simulations. The dataset has a quality flag product (QFLAG2) which is a 16-bit that extends the scene classification band (SCL) of the Sentinel-2 Level-2 products. A “medium” filter was used to mask out QFLAG2 values from 2 to 1022, leaving land pixels (bit 1) within or outside cloud proximity (bits 11 and 13) or cloud shadow proximity (bits 12 and 14).

The HR-VPP daily raw vegetation indices products are described in detail in the user manual (Smets et al., 2022) and the computations details of PPI are given by Jin and Eklundh (2014). Seasonal trajectories refer to the 10-daily smoothed time-series of PPI used for vegetation phenology and productivity parameters retrieval with TIMESAT (Jönsson and Eklundh 2002, 2004).

HR-VPP data was downloaded through the WEkEO Copernicus Data and Information Access Services (DIAS) system with a Python 3.8.10 harmonized data access (HDA) API 0.2.1. Zonal statistics [’min’, ’max’, ’mean’, ’median’, ’count’, ’std’, ’majority’] were computed on non-masked pixel values within field boundaries with rasterstats Python package 0.17.00.

The Start of season date (SOSD), end of season date (EOSD) and length of seasons (LENGTH) were extracted from the annual Vegetation Phenology and Productivity Parameters (VPP) dataset as an additional source for comparison. These data are a product of the Vegetation Phenology and Productivity Parameters, see (https://land.copernicus.eu/pan-european/biophysical-parameters/high-resolution-vegetation-phenology-and-productivity/vegetation-phenology-and-productivity) for detailed information.

File Description:

4 datasets:

1_senseco_data_S2_B08_Bulgaria_France; 1_senseco_data_S2_B8A_Bulgaria_France; 1_senseco_data_HR_VPP_Bulgaria_France; 1_senseco_data_phenology_VPP_Bulgaria_France

3 metadata:

2_senseco_metadata_S2_B08_B8A_Bulgaria_France; 2_senseco_metadata_HR_VPP_Bulgaria_France; 2_senseco_metadata_phenology_VPP_Bulgaria_France

The dataset files “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” concerns all vegetation indices (EVI, NDVI, kNDVI, NIRv) data values and related information, and metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France” describes all the existing variables. Both “1_senseco_data_S2_B8_Bulgaria_France” and “1_senseco_data_S2_B8A_Bulgaria_France” have the same column variable names and for that reason, they share the same metadata file “2_senseco_metadata_S2_B08_B8A_Bulgaria_France”.

The dataset file “1_senseco_data_HR_VPP_Bulgaria_France” concerns vegetation indices (NDVI, PPI) and plant parameters (LAI, FAPAR) data values and related information, and metadata file “2_senseco_metadata_HRVPP_Bulgaria_France” describes all the existing variables.

The dataset file “1_senseco_data_phenology_VPP_Bulgaria_France” concerns the vegetation phenology and productivity parameters (LENGTH, SOSD, EOSD) values and related information, and metadata file “2_senseco_metadata_VPP_Bulgaria_France” describes all the existing variables.

Bibliography

G. Badgley, C.B. Field, J.A. Berry, Canopy near-infrared reflectance and terrestrial photosynthesis, Sci. Adv. 3 (2017) e1602244. https://doi.org/10.1126/sciadv.1602244.

G. Camps-Valls, M. Campos-Taberner, Á. Moreno-Martínez, S. Walther, G. Duveiller, A. Cescatti, M.D. Mahecha, J. Muñoz-Marí, F.J. García-Haro, L. Guanter, M. Jung, J.A. Gamon, M. Reichstein, S.W. Running, A unified vegetation index for quantifying the terrestrial biosphere, Sci. Adv. 7 (2021) eabc7447. https://doi.org/10.1126/sciadv.abc7447.

L.V. Graf, G. Perich, H. Aasen, EOdal: An open-source Python package for large-scale agroecological research using Earth Observation and gridded environmental data, Comput. Electron. Agric. 203 (2022) 107487. https://doi.org/10.1016/j.compag.2022.107487.

H. Jin, L. Eklundh, A physically based vegetation index for improved monitoring of plant phenology, Remote Sens. Environ. 152 (2014) 512–525. https://doi.org/10.1016/j.rse.2014.07.010.

P. Jonsson, L. Eklundh, Seasonality extraction by function fitting to time-series of satellite sensor data, IEEE Trans. Geosci. Remote Sens. 40 (2002) 1824–1832. https://doi.org/10.1109/TGRS.2002.802519.

P. Jönsson, L. Eklundh, TIMESAT—a program for analyzing time-series of satellite sensor data, Comput. Geosci. 30 (2004) 833–845. https://doi.org/10.1016/j.cageo.2004.05.006.

J. Meier, W. Mauser, T. Hank, H. Bach, Assessments on the impact of high-resolution-sensor pixel sizes for common agricultural policy and smart farming services in European regions, Comput. Electron. Agric. 169 (2020) 105205. https://doi.org/10.1016/j.compag.2019.105205.

B. Smets, Z. Cai, L. Eklund, F. Tian, K. Bonte, R. Van Hoost, R. Van De Kerchove, S. Adriaensen, B. De Roo, T. Jacobs, F. Camacho, J. Sánchez-Zapero, S. Else, H. Scheifinger, K. Hufkens, P. Jönsson, HR-VPP Product User Manual Vegetation Indices, 2022.

F. Tian, Z. Cai, H. Jin, K. Hufkens, H. Scheifinger, T. Tagesson, B. Smets, R. Van Hoolst, K. Bonte, E. Ivits, X. Tong, J. Ardö, L. Eklundh, Calibrating vegetation phenology from Sentinel-2 using eddy covariance, PhenoCam, and PEP725 networks across Europe, Remote Sens. Environ. 260 (2021) 112456. https://doi.org/10.1016/j.rse.2021.112456.
n
Genome-wide SNP datasets for the non-native pink salmon in Norway
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Feb 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simo Njabulo Maduna; Paul Eric Aspholm; Ane-Sofie Bednarczyk Hansen; Cornelya Klütsch; Snorre Hagen (2024). Genome-wide SNP datasets for the non-native pink salmon in Norway [Dataset]. http://doi.org/10.5061/dryad.zw3r228f2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zw3r228f2
Dataset updated
Feb 5, 2024
Dataset provided by
Norwegian Institute of Bioeconomy Research
Authors
Simo Njabulo Maduna; Paul Eric Aspholm; Ane-Sofie Bednarczyk Hansen; Cornelya Klütsch; Snorre Hagen
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Norway
Description
Effective management of non-indigenous species requires knowledge of their dispersal factors and founder events. We aim to identify the main environmental drivers favouring dispersal events along the invasion gradient and to characterize the spatial patterns of genetic diversity in feral populations of the non-native pink salmon within its epicentre of invasion in Norway. We first conducted SDM using four modelling techniques with varying levels of complexity, which encompassed both regression-based and tree-based machine-learning algorithms, using climatic data from the present to 2050. Then we used the triple-enzyme restriction-site associated DNA sequencing (3RADseq) approach to genotype over 30,000 high-quality single-nucleotide polymorphisms to elucidate patterns of genetic diversity and gene flow within the pink salmon putative invasion hotspot. We discovered temperature- and precipitation-related variables drove pink salmon distributional shifts across its non-native ranges, and that climate-induced favourable areas will remain stable for the next 30 years. In addition, all SDMs identified north-eastern Norway as the epicentre of the pink salmon invasion, and genomic data revealed that there was minimal variation in genetic diversity across the sampled populations at a genome-wide level in this region. While, upon utilizing a specific group of ‘diagnostic’ SNPs, we observed a significant degree of genetic differentiation, ranging from moderate to substantial, and detected four hierarchical genetic clusters concordant with geography. Our findings suggest that fluctuations of climate extreme events associated with ongoing climate change will likely maintain environmental favourability for the pink salmon outside its ‘native’/introduced ranges. Local invaded rivers are themselves a potential source population of invaders in the ongoing secondary spread of pink salmon in Northern Norway. Our study shows that SDMs and genomic data can reveal species distribution determinants and provide indicators to aid in post-control measures and potential inferences of their success. Methods 3RAD library preparation and sequencing: We prepared RADseq libraries using the Adapterama III library preparation protocol of Bayona-Vásquez et al., (2019; their Supplemental File SI). For each sample, ~40-100 ng of genomic DNA were digested for 1 h at 37 °C in a solution with 1.5 µl of 10x Cutsmart® buﬀer, 0.25 µl (NEB®) of Read 1 enzyme (MspI) at 20 U/µl, 0.25 µl of Read 2 enzyme (BamHI-HF) at 20 U/µl, 0.25 µl of Read 1 adapter dimer-cutting enzyme (ClaI) at 20 U/ µl, 1 µl of i5Tru adapter at 2.5 µM, 1 µl of i7Tru adapter at 2.5 µM and 0.75 µl of dH2O. After digestion/ligation, samples were pooled and cleaned with 1.2x Sera-Mag SpeedBeads (Fisher Scientiifc™) in a 1.2:1 (SpeedBeads:DNA) ratio, and we eluted cleaned DNA in 60 µL of TLE. An enrichment PCR of each sample was carried with 10 µl of 5x Kapa Long Range Buﬀer (Kapa Biosystems, Inc.), 0.25 µl of KAPA LongRange DNA Polymerase at 5 U/µl, 1.5 µl of dNTPs mix (10 mM each dNTP), 3.5 µl of MgCl2 at 25 mM, 2.5 µl of iTru5 primer at 5 µM, 2.5 µl of iTru7 primer at 5 µM and 5 µl of pooled DNA. The i5 and i7 adapters ligated to each sample using a unique combination (2 i5 X 1 i7 indexes). The temperature conditions for PCR enrichment were 94 °C for 2 min of initial denaturation, followed by 10 cycles of 94 °C for 20 sec, 57 °C for 15 sec and 72° for 30 sec, and a final cycle of 72 °C for 5 min. The enriched samples were each cleaned and quantified with a Quantus™ Fluorometer. Cleaned, indexed and quantified library pools were pooled to equimolar concentrations and were sent to the Norwegian Sequencing Centre (NSC) for quality control and subsequent final size selection using a one-sided bead clean-up (0.7:1 ratio) to capture 550 bp +/- 10% fragments, and the final paired-end (PE) 150 bp sequencing on one lane each of the Illumina HiSeq 4000 platform. Data filtering: We filtered genotype data and characterized singleton SNP loci and multi-site variants (MSVs) using filtering procedures and custom scripts available in scripts available in STACKS Workflow v.2 (https://github.com/enormandeau/stacks_workflow). First, we filtered the ‘raw’ VCF file keeping only SNPs that (i) showed a minimum depth of four (-m 4), (ii) were called in at least 80% of the samples in each site (-p 80) and (iii) and for which at least two samples had the rare allele i.e., Minor Allele Sample (MAS; -S 2), using the python script 05_filter_vcf_fast.py. Second, we exclude those samples with more than 20% missing genotypes from the data set. Third, we calculated pairwise relatedness between samples with the Yang et al., (2010) algorithm and individual-level heterozygosity in vcftools v.0.1.17 (Danecek et al., 2010). Additionally, we calculated pairwise kinship coefficients among individuals using the KING-robust method (Manichaikul et al., 2010) with the R package SNPRelate v.1.28.0 (Zheng et al., 2012). Then, we estimated genotyping error rates between technical replicates using the software tiger v1.0 (Bresadola et al., 2020). Finally, we removed one of the pair of closely related individuals exhibiting the higher level of missing data along with samples that showed extremely low heterozygosity (< -0.2) from graphical observation of individual-level heterozygosity per sampling population. Fourth, we conducted a secondary dataset filtering step using 05_filter_vcf_fast.py, keeping the above-mentioned data filtering cut-off parameters (i.e., -m = 4; -p = 80; -S = 3). Fifth, we calculated a suit of four summary statistics to discriminate high-confidence SNPs (singleton SNPs) from SNPs exhibiting a duplication pattern (duplicated SNPs; MSVs): (i) median of allele ratio in heterozygotes (MedRatio), (ii) proportion of heterozygotes (PropHet), (iii) proportion of rare homozygotes (PropHomRare) and (iv) inbreeding coefficient (FIS). We calculated each parameter from the filtered VCF file using the python script 08_extract_snp_duplication_info.py. The four parameters calculated for each locus were plotted against each other to visualize their distribution across all loci using the R script 09_classify_snps.R. Based on the methodology of McKinney et al. (2017) and by plotting different combinations of each parameter, we graphically fixed cut-offs for each parameter. Sixth, we then used the python script 10_split_vcf_in_categories.py for classify SNPs to generate two separate datasets: the “SNP dataset,” based on SNP singletons only, and the “MSV dataset,” based on duplicated SNPs only, which we excluded from further analyses. Seventh, we postfiltered the SNP dataset by keeping all unlinked SNPs within each 3RAD locus using the 11_extract_unlinked_snps.py script with a minimum difference of 0.5 (-diff_threshold 0.5) and a maximum distance 1,000 bp (-max_distance 1,000). Then, for the SNP dataset, we filtered out SNPs that were located in unplaced scaffolds i.e., contigs that were not part of the 26 chromosomes of the pink salmon genome.
freeCodeCamp Headlines 5 Jan 2025
kaggle.com
zip
Updated Jan 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pravin M D (2025). freeCodeCamp Headlines 5 Jan 2025 [Dataset]. https://www.kaggle.com/datasets/pravinmd/freecodecamp-headlines-5-jan-2025/code
Explore at:
zip(3083 bytes)Available download formats
Dataset updated
Jan 5, 2025
Authors
Pravin M D
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Pravin M D

Released under MIT

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Tejaswi Kumar (2022). Indian Car Dataset [Dataset]. https://www.kaggle.com/datasets/tejaswikumar24/indian-car-dataset

Data from: Indian Car Dataset

Indian Car Dataset scaped using selenium module of python from a car website

Explore at:

zip(4068 bytes)Available download formats

Dataset updated

Oct 7, 2022

Authors

Tejaswi Kumar

Description

This dataset is collected from an Indian car website and can be used for exploration as well as research purposes. The attributes used in the dataset are: - Car Name: describes the name of a car - Price: Describes the range of price in which the car is usually sold - Engine: the volume of fuel and air that can be pushed through a car's cylinders and is measured in cubic centimetres - Mileage: describes the fuel efficiency of a vehicle

Clear search

Close search

Google apps

Main menu

Data from: Indian Car Dataset

Indian Housing Datasets -ML-ready data for - Citys

🇮🇳 Indian Housing Datasets — Ready-to-use for ML & Data Analysis

📦 Python Package (for easy access)

WordProject

CGRdb2.0: A Python Database Management System for Molecules, Reactions, and...

Data from: Cytonuclear discordance in the Florida Everglades invasive...

Data from: Burmese python environmental DNA data, and associated attributes,...

Code_Vulnerability_Security_DPO

medical-instruction-100k

Python Import Data India – Buyers & Importers List

Benchmarking on Microservices Configurations and the Impact on the...

DeepWeeds: A Multiclass Weed Species Image Dataset for Deep Learning

Data from: Serpentoviruses in free-ranging invasive pythons and native...

NPTEL

Data from: Effects of an invasive top predator on Ecosystem structure and...

Data from: Photo-documented sequences from 01 Jun 2021-30 Aug 2021 showing...

Using Python and Jupyter Notebook to Retrieve and Visualize the Water...

Data from: Deep learning four decades of human migration: datasets

Data from: Satellite remote sensing dataset of Sentinel-2 for phenology...

Genome-wide SNP datasets for the non-native pink salmon in Norway

freeCodeCamp Headlines 5 Jan 2025

Dataset

Contents

Data from: Indian Car Dataset

Indian Car Dataset scaped using selenium module of python from a car website