Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Home-range estimation is an important application of animal tracking data that is frequently complicated by autocorrelation, sampling irregularity, and small effective sample sizes. We introduce a novel, optimal weighting method that accounts for temporal sampling bias in autocorrelated tracking data. This method corrects for irregular and missing data, such that oversampled times are downweighted and undersampled times are upweighted to minimize error in the home-range estimate. We also introduce computationally efficient algorithms that make this method feasible with large datasets. Generally speaking, there are three situations where weight optimization improves the accuracy of home-range estimates: with marine data, where the sampling schedule is highly irregular, with duty cycled data, where the sampling schedule changes during the observation period, and when a small number of home-range crossings are observed, making the beginning and end times more independent and informative than the intermediate times. Using both simulated data and empirical examples including reef manta ray, Mongolian gazelle, and African buffalo, optimal weighting is shown to reduce the error and increase the spatial resolution of home-range estimates. With a conveniently packaged and computationally efficient software implementation, this method broadens the array of datasets with which accurate space-use assessments can be made.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description of the Credit Card Eligibility Data: Determining Factors
The Credit Card Eligibility Dataset: Determining Factors is a comprehensive collection of variables aimed at understanding the factors that influence an individual's eligibility for a credit card. This dataset encompasses a wide range of demographic, financial, and personal attributes that are commonly considered by financial institutions when assessing an individual's suitability for credit.
Each row in the dataset represents a unique individual, identified by a unique ID, with associated attributes ranging from basic demographic information such as gender and age, to financial indicators like total income and employment status. Additionally, the dataset includes variables related to familial status, housing, education, and occupation, providing a holistic view of the individual's background and circumstances.
| Variable | Description |
|---|---|
| ID | An identifier for each individual (customer). |
| Gender | The gender of the individual. |
| Own_car | A binary feature indicating whether the individual owns a car. |
| Own_property | A binary feature indicating whether the individual owns a property. |
| Work_phone | A binary feature indicating whether the individual has a work phone. |
| Phone | A binary feature indicating whether the individual has a phone. |
| A binary feature indicating whether the individual has provided an email address. | |
| Unemployed | A binary feature indicating whether the individual is unemployed. |
| Num_children | The number of children the individual has. |
| Num_family | The total number of family members. |
| Account_length | The length of the individual's account with a bank or financial institution. |
| Total_income | The total income of the individual. |
| Age | The age of the individual. |
| Years_employed | The number of years the individual has been employed. |
| Income_type | The type of income (e.g., employed, self-employed, etc.). |
| Education_type | The education level of the individual. |
| Family_status | The family status of the individual. |
| Housing_type | The type of housing the individual lives in. |
| Occupation_type | The type of occupation the individual is engaged in. |
| Target | The target variable for the classification task, indicating whether the individual is eligible for a credit card or not (e.g., Yes/No, 1/0). |
Researchers, analysts, and financial institutions can leverage this dataset to gain insights into the key factors influencing credit card eligibility and to develop predictive models that assist in automating the credit assessment process. By understanding the relationship between various attributes and credit card eligibility, stakeholders can make more informed decisions, improve risk assessment strategies, and enhance customer targeting and segmentation efforts.
This dataset is valuable for a wide range of applications within the financial industry, including credit risk management, customer relationship management, and marketing analytics. Furthermore, it provides a valuable resource for academic research and educational purposes, enabling students and researchers to explore the intricate dynamics of credit card eligibility determination.
Facebook
TwitterThe USDA Agricultural Research Service (ARS) recently established SCINet , which consists of a shared high performance computing resource, Ceres, and the dedicated high-speed Internet2 network used to access Ceres. Current and potential SCINet users are using and generating very large datasets so SCINet needs to be provisioned with adequate data storage for their active computing. It is not designed to hold data beyond active research phases. At the same time, the National Agricultural Library has been developing the Ag Data Commons, a research data catalog and repository designed for public data release and professional data curation. Ag Data Commons needs to anticipate the size and nature of data it will be tasked with handling. The ARS Web-enabled Databases Working Group, organized under the SCINet initiative, conducted a study to establish baseline data storage needs and practices, and to make projections that could inform future infrastructure design, purchases, and policies. The SCINet Web-enabled Databases Working Group helped develop the survey which is the basis for an internal report. While the report was for internal use, the survey and resulting data may be generally useful and are being released publicly. From October 24 to November 8, 2016 we administered a 17-question survey (Appendix A) by emailing a Survey Monkey link to all ARS Research Leaders, intending to cover data storage needs of all 1,675 SY (Category 1 and Category 4) scientists. We designed the survey to accommodate either individual researcher responses or group responses. Research Leaders could decide, based on their unit's practices or their management preferences, whether to delegate response to a data management expert in their unit, to all members of their unit, or to themselves collate responses from their unit before reporting in the survey. Larger storage ranges cover vastly different amounts of data so the implications here could be significant depending on whether the true amount is at the lower or higher end of the range. Therefore, we requested more detail from "Big Data users," those 47 respondents who indicated they had more than 10 to 100 TB or over 100 TB total current data (Q5). All other respondents are called "Small Data users." Because not all of these follow-up requests were successful, we used actual follow-up responses to estimate likely responses for those who did not respond. We defined active data as data that would be used within the next six months. All other data would be considered inactive, or archival. To calculate per person storage needs we used the high end of the reported range divided by 1 for an individual response, or by G, the number of individuals in a group response. For Big Data users we used the actual reported values or estimated likely values. Resources in this dataset:Resource Title: Appendix A: ARS data storage survey questions. File Name: Appendix A.pdfResource Description: The full list of questions asked with the possible responses. The survey was not administered using this PDF but the PDF was generated directly from the administered survey using the Print option under Design Survey. Asterisked questions were required. A list of Research Units and their associated codes was provided in a drop down not shown here. Resource Software Recommended: Adobe Acrobat,url: https://get.adobe.com/reader/ Resource Title: CSV of Responses from ARS Researcher Data Storage Survey. File Name: Machine-readable survey response data.csvResource Description: CSV file includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed. This information is that same data as in the Excel spreadsheet (also provided).Resource Title: Responses from ARS Researcher Data Storage Survey. File Name: Data Storage Survey Data for public release.xlsxResource Description: MS Excel worksheet that Includes raw responses from the administered survey, as downloaded unfiltered from Survey Monkey, including incomplete responses. Also includes additional classification and calculations to support analysis. Individual email addresses and IP addresses have been removed.Resource Software Recommended: Microsoft Excel,url: https://products.office.com/en-us/excel
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The “Fused Image dataset for convolutional neural Network-based crack Detection” (FIND) is a large-scale image dataset with pixel-level ground truth crack data for deep learning-based crack segmentation analysis. It features four types of image data including raw intensity image, raw range (i.e., elevation) image, filtered range image, and fused raw image. The FIND dataset consists of 2500 image patches (dimension: 256x256 pixels) and their ground truth crack maps for each of the four data types.
The images contained in this dataset were collected from multiple bridge decks and roadways under real-world conditions. A laser scanning device was adopted for data acquisition such that the captured raw intensity and raw range images have pixel-to-pixel location correspondence (i.e., spatial co-registration feature). The filtered range data were generated by applying frequency domain filtering to eliminate image disturbances (e.g., surface variations, and grooved patterns) from the raw range data [1]. The fused image data were obtained by combining the raw range and raw intensity data to achieve cross-domain feature correlation [2,3]. Please refer to [4] for a comprehensive benchmark study performed using the FIND dataset to investigate the impact from different types of image data on deep convolutional neural network (DCNN) performance.
If you share or use this dataset, please cite [4] and [5] in any relevant documentation.
In addition, an image dataset for crack classification has also been published at [6].
References:
[1] Shanglian Zhou, & Wei Song. (2020). Robust Image-Based Surface Crack Detection Using Range Data. Journal of Computing in Civil Engineering, 34(2), 04019054. https://doi.org/10.1061/(asce)cp.1943-5487.0000873
[2] Shanglian Zhou, & Wei Song. (2021). Crack segmentation through deep convolutional neural networks and heterogeneous image fusion. Automation in Construction, 125. https://doi.org/10.1016/j.autcon.2021.103605
[3] Shanglian Zhou, & Wei Song. (2020). Deep learning–based roadway crack classification with heterogeneous image data fusion. Structural Health Monitoring, 20(3), 1274-1293. https://doi.org/10.1177/1475921720948434
[4] Shanglian Zhou, Carlos Canchila, & Wei Song. (2023). Deep learning-based crack segmentation for civil infrastructure: data types, architectures, and benchmarked performance. Automation in Construction, 146. https://doi.org/10.1016/j.autcon.2022.104678
5 Shanglian Zhou, Carlos Canchila, & Wei Song. (2022). Fused Image dataset for convolutional neural Network-based crack Detection (FIND) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6383044
[6] Wei Song, & Shanglian Zhou. (2020). Laser-scanned roadway range image dataset (LRRD). Laser-scanned Range Image Dataset from Asphalt and Concrete Roadways for DCNN-based Crack Classification, DesignSafe-CI. https://doi.org/10.17603/ds2-bzv3-nc78
Facebook
TwitterThis is the dataset released as companion for the paper “Explaining the Product Range Effect in Purchase Data“, presented at the BigData 2013 conference.
Pennacchioli, D., Coscia, M., Rinzivillo, S., Pedreschi, D. and Giannotti, F., Explaining the Product Range Effect in Purchase Data. In BigData, 2013.
Foto von Eduardo Soares auf Unsplash
Facebook
TwitterRaw radio tracking data used to determine the precise distance to Venus (and improve knowledge of the Astronomical Unit) from the Galileo flyby on 10 February 1990.
Facebook
TwitterThe exercise after this contains questions that are based on the housing dataset.
How many houses have a waterfront? a. 21000 b. 21450 c. 163 d. 173
How many houses have 2 floors? a. 2692 b. 8241 c. 10680 d. 161
How many houses built before 1960 have a waterfront? a. 80 b. 7309 c. 90 d. 92
What is the price of the most expensive house having more than 4 bathrooms? a. 7700000 b. 187000 c. 290000 d. 399000
For instance, if the ‘price’ column consists of outliers, how can you make the data clean and remove the redundancies? a. Calculate the IQR range and drop the values outside the range. b. Calculate the p-value and remove the values less than 0.05. c. Calculate the correlation coefficient of the price column and remove the values less than the correlation coefficient. d. Calculate the Z-score of the price column and remove the values less than the z-score.
What are the various parameters that can be used to determine the dependent variables in the housing data to determine the price of the house? a. Correlation coefficients b. Z-score c. IQR Range d. Range of the Features
If we get the r2 score as 0.38, what inferences can we make about the model and its efficiency? a. The model is 38% accurate, and shows poor efficiency. b. The model is showing 0.38% discrepancies in the outcomes. c. Low difference between observed and fitted values. d. High difference between observed and fitted values.
If the metrics show that the p-value for the grade column is 0.092, what all inferences can we make about the grade column? a. Significant in presence of other variables. b. Highly significant in presence of other variables c. insignificance in presence of other variables d. None of the above
If the Variance Inflation Factor value for a feature is considerably higher than the other features, what can we say about that column/feature? a. High multicollinearity b. Low multicollinearity c. Both A and B d. None of the above
Facebook
TwitterGLAH05 Level-1B waveform parameterization data include output parameters from the waveform characterization procedure and other parameters required to calculate surface slope and relief characteristics. GLAH05 contains parameterizations of both the transmitted and received pulses and other characteristics from which elevation and footprint-scale roughness and slope are calculated. The received pulse characterization uses two implementations of the retracking algorithms: one tuned for ice sheets, called the standard parameterization, used to calculate surface elevation for ice sheets, oceans, and sea ice; and another for land (the alternative parameterization). Each data granule has an associated browse product.
Facebook
TwitterAbstractHome range estimation is routine practice in ecological research. While advances in animal tracking technology have increased our capacity to collect data to support home range analysis, these same advances have also resulted in increasingly autocorrelated data. Consequently, the question of which home range estimator to use on modern, highly autocorrelated tracking data remains open. This question is particularly relevant given that most estimators assume independently sampled data. Here, we provide a comprehensive evaluation of the effects of autocorrelation on home range estimation. We base our study on an extensive dataset of GPS locations from 369 individuals representing 27 species distributed across 5 continents. We first assemble a broad array of home range estimators, including Kernel Density Estimation (KDE) with four bandwidth optimizers (Gaussian reference function, autocorrelated-Gaussian reference function (AKDE), Silverman's rule of thumb, and least squares cross-validation), Minimum Convex Polygon, and Local Convex Hull methods. Notably, all of these estimators except AKDE assume independent and identically distributed (IID) data. We then employ half-sample cross-validation to objectively quantify estimator performance, and the recently introduced effective sample size for home range area estimation ($\hat{N}_\mathrm{area}$) to quantify the information content of each dataset. We found that AKDE 95\% area estimates were larger than conventional IID-based estimates by a mean factor of 2. The median number of cross-validated locations included in the holdout sets by AKDE 95\% (or 50\%) estimates was 95.3\% (or 50.1\%), confirming the larger AKDE ranges were appropriately selective at the specified quantile. Conversely, conventional estimates exhibited negative bias that increased with decreasing $\hat{N}_\mathrm{area}$. To contextualize our empirical results, we performed a detailed simulation study to tease apart how sampling frequency, sampling duration, and the focal animal's movement conspire to affect range estimates. Paralleling our empirical results, the simulation study demonstrated that AKDE was generally more accurate than conventional methods, particularly for small $\hat{N}_\mathrm{area}$. While 72\% of the 369 empirical datasets had \textgreater1000 total observations, only 4\% had an $\hat{N}_\mathrm{area}$ \textgreater1000, where 30\% had an $\hat{N}_\mathrm{area}$ \textless30. In this frequently encountered scenario of small $\hat{N}_\mathrm{area}$, AKDE was the only estimator capable of producing an accurate home range estimate on autocorrelated data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
This mmWave Datasets are used for fitness activity identification. This dataset (FA Dataset) contains 14 common fitness daily activities. The data are captured by the mmWave radar TI-AWR1642. The dataset can be used by fellow researchers to reproduce the original work or to further explore other machine-learning problems in the domain of mmWave signals.
Format: .png format
Section 1: Device Configuration
Section 2: Data Format
We provide our mmWave data in heatmaps for this dataset. The data file is in the png format. The details are shown in the following:
Section 3: Experimental Setup
Section 4: Data Description
14 common daily activities and their corresponding files
File Name Activity Type File Name Activity Type
FA1 Crunches FA8 Squats
FA2 Elbow plank and reach FA9 Burpees
FA3 Leg raise FA10 Chest squeezes
FA4 Lunges FA11 High knees
FA5 Mountain climber FA12 Side leg raise
FA6 Punches FA13 Side to side chops
FA7 Push ups FA14 Turning kicks
Section 5: Raw Data and Data Processing Algorithms
Section 6: Citations
If your paper is related to our works, please cite our papers as follows.
https://ieeexplore.ieee.org/document/9868878/
Xie, Yucheng, Ruizhe Jiang, Xiaonan Guo, Yan Wang, Jerry Cheng, and Yingying Chen. "mmFit: Low-Effort Personalized Fitness Monitoring Using Millimeter Wave." In 2022 International Conference on Computer Communications and Networks (ICCCN), pp. 1-10. IEEE, 2022.
Bibtex:
@inproceedings{xie2022mmfit,
title={mmFit: Low-Effort Personalized Fitness Monitoring Using Millimeter Wave},
author={Xie, Yucheng and Jiang, Ruizhe and Guo, Xiaonan and Wang, Yan and Cheng, Jerry and Chen, Yingying},
booktitle={2022 International Conference on Computer Communications and Networks (ICCCN)},
pages={1--10},
year={2022},
organization={IEEE}
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
File name definitions:
'...v_50_175_250_300...' - dataset for velocity ranges [50, 175] + [250, 300] m/s
'...v_175_250...' - dataset for velocity range [175, 250] m/s
'ANNdevelop...' - used to perform 9 parametric sub-analyses where, in each one, many ANNs are developed (trained, validated and tested) and the one yielding the best results is selected
'ANNtest...' - used to test the best ANN from each aforementioned parametric sub-analysis, aiming to find the best ANN model; this dataset includes the 'ANNdevelop...' counterpart
Where to find the input (independent) and target (dependent) variable values for each dataset/excel ?
input values in 'IN' sheet
target values in 'TARGET' sheet
Where to find the results from the best ANN model (for each target/output variable and each velocity range)?
open the corresponding excel file and the expected (target) vs ANN (output) results are written in 'TARGET vs OUTPUT' sheet
Check reference below (to be added when the paper is published)
https://www.researchgate.net/publication/328849817_11_Neural_Networks_-_Max_Disp_-_Railway_Beams
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Business Context
A research institute conducts a Talent Hunt Examination every year to hire people who can work on various research projects in the field of Mathematics and Computer Science. A2Z institute provides a preparatory program to help the aspirants prepare for the Talent Hunt Exam. The institute has a good record of helping many students clear the exam. Before the application for the next batch starts, the institute wants to attract more aspirants to their program. For this, the institute wants to assure the aspiring students of the quality of results obtained by students enrolled in their program in recent years.
However, one challenge in estimating an average score is that every year the exam’s difficulty level varies a little, and the distribution of scores also changes accordingly. The institute keeps a track of the final scores of its alumni who attempted the exam previously. A dataset constituted of a simple random sample of final scores of 600 aspirants from the last three years is prepared by the institute.
Objective
The institute wants to provide an estimate of the average score obtained by aspirants who enroll in their program. Keeping in mind the variation in scores every year, the institute wants to provide a more reliable estimate of the average score using a range of scores instead of a single estimate. It is known from previous records that the standard deviation of the scores is 10 and the cut-off score in the most recent year was 84.
A recent social media post from A2Z institute received feedback from a reputed critic, mentioning that the students from A2Z institute score less than last year's cut-off on average. The institute wants to test if the claim by the critic is valid.
Solution Approach
To provide a more reliable estimate of the average score using a range of scores instead of a single estimate, we will construct a 95% confidence interval for the mean score that an aspirant has scored after enrolling in the institute’s program. To test the validity of the critic's claim (the mean score of the students from A2Z institute is less than last year’s cut-off score of 84), we will perform a hypothesis test (taking alpha = 5%)
Data
The dataset provided (Talent_hunt.csv) contains the final scores of 600 aspirants enrolled in the institute’s program in the last three years.
Facebook
Twitterhttps://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/https://www.elsevier.com/about/policies/open-access-licenses/elsevier-user-license/cpc-license/
Title of program: MARS-1-FOR-EFR-DWBA Catalogue Id: ABPB_v1_0
Nature of problem The package SATURN-MARS-1 consists of two programs SATURN and MARS for calculating cross sections of reactions transferring nucleon(s) primarily between two heavy ions. The calculations are made within the framework of the finite-range distorted wave Born approximation(DWBA). The first part, SATURN, prepares the form factor(s) either for exact finite (EFR) or for no-recoil (NR) approach. The prepared form factor is then used by the second part MARS to calculate either EFR-DWBA or NR-DWBA cross-s ...
Versions of this program held in the CPC repository in Mendeley Data abpb_v1_0; MARS-1-FOR-EFR-DWBA; 10.1016/0010-4655(74)90012-5
This program has been imported from the CPC Program Library held at Queen's University Belfast (1969-2019)
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Technological advances have steadily increased the detail of animal tracking datasets, yet fundamental data limitations exist for many species that cause substantial biases in home‐range estimation. Specifically, the effective sample size of a range estimate is proportional to the number of observed range crossings, not the number of sampled locations. Currently, the most accurate home‐range estimators condition on an autocorrelation model, for which the standard estimation frame‐works are based on likelihood functions, even though these methods are known to underestimate variance—and therefore ranging area—when effective sample sizes are small. Residual maximum likelihood (REML) is a widely used method for reducing bias in maximum‐likelihood (ML) variance estimation at small sample sizes. Unfortunately, we find that REML is too unstable for practical application to continuous‐time movement models. When the effective sample size N is decreased to N ≤ urn:x-wiley:2041210X:media:mee313270:mee313270-math-0001(10), which is common in tracking applications, REML undergoes a sudden divergence in variance estimation. To avoid this issue, while retaining REML’s first‐order bias correction, we derive a family of estimators that leverage REML to make a perturbative correction to ML. We also derive AIC values for REML and our estimators, including cases where model structures differ, which is not generally understood to be possible. Using both simulated data and GPS data from lowland tapir (Tapirus terrestris), we show how our perturbative estimators are more accurate than traditional ML and REML methods. Specifically, when urn:x-wiley:2041210X:media:mee313270:mee313270-math-0002(5) home‐range crossings are observed, REML is unreliable by orders of magnitude, ML home ranges are ~30% underestimated, and our perturbative estimators yield home ranges that are only ~10% underestimated. A parametric bootstrap can then reduce the ML and perturbative home‐range underestimation to ~10% and ~3%, respectively. Home‐range estimation is one of the primary reasons for collecting animal tracking data, and small effective sample sizes are a more common problem than is currently realized. The methods introduced here allow for more accurate movement‐model and home‐range estimation at small effective sample sizes, and thus fill an important role for animal movement analysis. Given REML’s widespread use, our methods may also be useful in other contexts where effective sample sizes are small.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This dataset is synthetically generated to simulate realistic operational data from electric two-wheelers. It includes common parameters such as vehicle speed, battery voltage, current, state of charge, motor and ambient temperatures, regenerative braking status, and trip duration.
The feature distributions are designed to mimic real-world driving behavior and environmental conditions typically observed in urban commuting scenarios for electric scooters and bikes.
The dataset is primarily intended for machine learning tasks such as EV range prediction, regression modeling, exploratory data analysis (EDA), and educational purposes.
This can serve as a baseline dataset for practicing predictive modeling workflows, building dashboards, or testing data visualization techniques.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterOur mission with this project is to provide an always up-to-date and freely accessible map of the cloud landscape for every major cloud service provider.
We've decided to kick things off with collecting SSL certificate data of AWS EC2 machines, considering the value of this data to security researchers. However, we plan to expand the project to include more data and providers in the near future. Your input and suggestions are incredibly valuable to us, so please don't hesitate to reach out on Twitter or Discord and let us know what areas you think we should prioritize next!
You can find origin IP for an example: instacart.com, Just search there instacart.com
You can use command as well if you are using linux. Open the dataset using curl or wget and then **cd ** folder now run command: find . -type f -iname "*.csv" -print0 | xargs -0 grep "word"
Like: find . -type f -iname "*.csv" -print0 | xargs -0 grep "instacart.com"
Done, You will see output.
How can SSL certificate data benefit you? The SSL data is organized into CSV files, with the following properties collected for every found certificate:
IP Address Common Name Organization Country Locality Province Subject Alternative DNS Name Subject Alternative IP address Self-signed (boolean)
IP Address Common Name Organization Country Locality Province Subject Alternative DNS Name Subject Alternative IP address Self-signed 1.2.3.4 example.com Example, Inc. US San Francisco California example.com 1.2.3.4 false 5.6.7.8 acme.net Acme, Inc. US Seattle Washington *.acme.net 5.6.7.8 false So what can you do with this data?
Enumerate subdomains of your target domains Search for your target's domain names (e.g. example.com) and find hits in the Common Name and Subject Alternative Name fields of the collected certificates. All IP ranges are scanned daily and the dataset gets updated accordingly so you are very likley to find ephemeral hosts before they are taken down.
Enumerate domains of your target companies Search for your target's company name (e.g. Example, Inc.), find hits in the Organization field, and explore the associated Common Name and Subject Alternative Name fields. The results will probably include subdomains of the domains you're familiar with and if you're in luck you might find new root domains expanding the scope.
Enumerate possible sub-subdomain enumeration target If the certificate is issued for a wildcard (e.g. *.foo.example.com), chances are there are other subdomains you can find by brute-forcing there. And you know how effective of this technique can be. Here are some wordlists to help you with that!
💡 Note: Remeber to monitor the dataset for daily updates to get notified whenever a new asset comes up!
Perform IP lookups Search for an IP address (e.g. 3.122.37.147) to find host names associated with it, and explore the Common Name, Subject Alternative Name, and Organization fields to gain find more information about that address.
Discover origin IP addresses to bypass proxy services When a website is hidden behind security proxy services like Cloudflare, Akamai, Incapsula, and others, it is possible to search for the host name (e.g., example.com) in the dataset. This search may uncover the origin IP address, allowing you to bypass the proxy. We've discussed a similar technique on our blog which you can find here!
Get a fresh dataset of live web servers Each IP address in the dataset corresponds to an HTTPS server running on port 443. You can use this data for large-scale research without needing to spend time collecting it yourself.
Whatever else you can think of If you use this data for a cool project or research, we would love to hear about it!
Additionally, below you will find a detailed explanation of our data collection process and how you can implement the same technique to gather information from your own IP ranges.
TB; DZ (Too big; didn't zoom):
We kick off the workflow with a simple bash script that retrieves AWS's IP ranges. Using a JQ query, we extract the IP ranges of EC2 machines by filtering for .prefixes[] | select(.service=="EC2") | .ip_prefix. Other services are excluded from this workflow since they don't support custom SSL certificates, making their data irrelevant for our dataset.
Then, we use mapcidr to divide the IP ranges obtained in step 1 into smaller ranges, each containing up to 100k hosts (Thanks, ProjectDiscovery team!). This step will be handy in the next step when we run the parallel scanning process.
At the time of writing, the EC2 IP ranges include over 57 million IP addresses, so scanning them all on a single machine would be impractical, which is where our file-splitter node comes into play.
This node iterates through the input from mapcidr and triggers individual jobs for each range. When executing this w...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Aim: Despite the wide distribution of many parasites around the globe, the range of individual species varies significantly even among phylogenetically related taxa. Since parasites need suitable hosts to complete their development, parasite geographical and environmental ranges should be limited to communities where their hosts are found. Parasites may also suffer from a trade-off between being locally abundant or widely dispersed. We hypothesize that the geographical and environmental ranges of parasites are negatively associated to their host specificity and their local abundance. Location: Worldwide Time period: 2009 to 2021 Major taxa studied: Avian haemosporidian parasites Methods: We tested these hypotheses using a global database which comprises data on avian haemosporidian parasites from across the world. For each parasite lineage, we computed five metrics: phylogenetic host-range, environmental range, geographical range, and their mean local and total number of observations in the database. Phylogenetic generalized least squares models were ran to evaluate the influence of phylogenetic host-range and total and local abundances on geographical and environmental range. In addition, we analysed separately the two regions with the largest amount of available data: Europe and South America. Results: We evaluated 401 lineages from 757 localities and observed that generalism (i.e. phylogenetic host range) associates positively to both the parasites’ geographical and environmental ranges at global and Europe scales. For South America, generalism only associates with geographical range. Finally, mean local abundance (mean local number of parasite occurrences) was negatively related to geographical and environmental range. This pattern was detected worldwide and in South America, but not in Europe. Main Conclusions: We demonstrate that parasite specificity is linked to both their geographical and environmental ranges. The fact that locally abundant parasites present restricted ranges, indicates a trade-off between these two traits. This trade-off, however, only becomes evident when sufficient heterogeneous host communities are considered. Methods We compiled data on haemosporidian lineages from the MalAvi database (http://130.235.244.92/Malavi/ , Bensch et al. 2009) including all the data available from the “Grand Lineage Summary” representing Plasmodium and Haemoproteus genera from wild birds and that contained information regarding location. After checking for duplicated sequences, this dataset comprised a total of ~6200 sequenced parasites representing 1602 distinct lineages (775 Plasmodium and 827 Haemoproteus) collected from 1139 different host species and 757 localities from all continents except Antarctica (Supplementary figure 1, Supplementary Table 1). The parasite lineages deposited in MalAvi are based on a cyt b fragment of 478 bp. This dataset was used to calculate the parasites’ geographical, environmental and phylogenetic ranges. Geographical range All analyses in this study were performed using R version 4.02. In order to estimate the geographical range of each parasite lineage, we applied the R package “GeoRange” (Boyle, 2017) and chose the variable minimum spanning tree distance (i.e., shortest total distance of all lines connecting each locality where a particular lineage has been found). Using the function “create.matrix” from the “fossil” package, we created a matrix of lineages and coordinates and employed the function “GeoRange_MultiTaxa” to calculate the minimum spanning tree distance for each parasite lineage distance (i.e. shortest total distance in kilometers of all lines connecting each locality). Therefore, as at least two distinct sites are necessary to calculate this distance, parasites observed in a single locality could not have their geographical range estimated. For this reason, only parasites observed in two or more localities were considered in our phylogenetically controlled least squares (PGLS) models. Host and Environmental diversity Traditionally, ecologists use Shannon entropy to measure diversity in ecological assemblages (Pielou, 1966). The Shannon entropy of a set of elements is related to the degree of uncertainty someone would have about the identity of a random selected element of that set (Jost, 2006). Thus, Shannon entropy matches our intuitive notion of biodiversity, as the more diverse an assemblage is, the more uncertainty regarding to which species a randomly selected individual belongs. Shannon diversity increases with both the assemblage richness (e.g., the number of species) and evenness (e.g., uniformity in abundance among species). To compare the diversity of assemblages that vary in richness and evenness in a more intuitive manner, we can normalize diversities by Hill numbers (Chao et al., 2014b). The Hill number of an assemblage represents the effective number of species in the assemblage, i.e., the number of equally abundant species that are needed to give the same value of the diversity metric in that assemblage. Hill numbers can be extended to incorporate phylogenetic information. In such case, instead of species, we are measuring the effective number of phylogenetic entities in the assemblage. Here, we computed phylogenetic host-range as the phylogenetic Hill number associated with the assemblage of hosts found infected by a given parasite. Analyses were performed using the function “hill_phylo” from the “hillr” package (Chao et al., 2014a). Hill numbers are parameterized by a parameter “q” that determines the sensitivity of the metric to relative species abundance. Different “q” values produce Hill numbers associated with different diversity metrics. We set q = 1 to compute the Hill number associated with Shannon diversity. Here, low Hill numbers indicate specialization on a narrow phylogenetic range of hosts, whereas a higher Hill number indicates generalism across a broader phylogenetic spectrum of hosts. We also used Hill numbers to compute the environmental range of sites occupied by each parasite lineage. Firstly, we collected the 19 bioclimatic variables from WorldClim version 2 (http://www.worldclim.com/version2) for all sites used in this study (N = 713). Then, we standardized the 19 variables by centering and scaling them by their respective mean and standard deviation. Thereafter, we computed the pairwise Euclidian environmental distance among all sites and used this distance to compute a dissimilarity cluster. Finally, as for the phylogenetic Hill number, we used this dissimilarity cluster to compute the environmental Hill number of the assemblage of sites occupied by each parasite lineage. The environmental Hill number for each parasite can be interpreted as the effective number of environmental conditions in which a parasite lineage occurs. Thus, the higher the environmental Hill number, the more generalist the parasite is regarding the environmental conditions in which it can occur. Parasite phylogenetic tree A Bayesian phylogenetic reconstruction was performed. We built a tree for all parasite sequences for which we were able to estimate the parasite’s geographical, environmental and phylogenetic ranges (see above); this represented 401 distinct parasite lineages. This inference was produced using MrBayes 3.2.2 (Ronquist & Huelsenbeck, 2003) with the GTR + I + G model of nucleotide evolution, as recommended by ModelTest (Posada & Crandall, 1998), which selects the best-fit nucleotide substitution model for a set of genetic sequences. We ran four Markov chains simultaneously for a total of 7.5 million generations that were sampled every 1000 generations. The first 1250 million trees (25%) were discarded as a burn-in step and the remaining trees were used to calculate the posterior probabilities of each estimated node in the final consensus tree. Our final tree obtained a cumulative posterior probability of 0.999. Leucocytozoon caulleryi was used as the outgroup to root the phylogenetic tree as Leucocytozoon spp. represents a basal group within avian haemosporidians (Pacheco et al., 2020).
Facebook
TwitterFive chemicals [2-ethylhexyl 4-hydroxybenzoate (2-EHHB), 4-nonylphenol-branched (4-NP), 4-tert-octylphenol (4-OP), benzyl butyl phthalate (BBP) and dibutyl phthalate (DBP) were subjected to a 21-day Amphibian Metamorphosis Assay (AMA) following OCSPP 890.1100 test guidelines. The selected chemicals exhibited estrogenic or androgenic bioactivity in high throughput screening data obtained from US EPA ToxCast models. Xenopus laevis larvae were exposed nominally to each chemical at 3.6, 10.9, 33.0 and 100 µg/L, except 4-NP for which concentrations were 1.8, 5.5, 16.5 and 50 µg/L. Endpoint data (daily or given study day (SD)) collected included: mortality (daily), developmental stage (SD 7 and 21), hind limb length (HLL) (SD 7 and 21), snout-vent length (SVL) (SD 7 and 21), wet body weight (BW) (SD 7 and 21), and thyroid histopathology (SD 21). 4-OP and BBP caused accelerated development compared to controls at the mean measured concentration of 39.8 and 3.5 µg/L, respectively. Normalized HLL was increased on SD 21 for all chemicals except 4-NP. Histopathology revealed mild thyroid follicular cell hypertrophy at all BBP concentrations, while moderate thyroid follicular cell hypertrophy occurred at the 105 µg /L BBP concentration. Evidence of accelerated metamorphic development was also observed histopathologically in BBP-treated frogs at concentrations as low as 3.5 µg/L. Increased BW relative to control occurred for all chemicals except 4-OP. Increase in SVL was observed in larvae exposed to 4-NP, BBP and DBP on SD 21. With the exception of 4-NP, four of the chemicals tested appeared to alter thyroid axis-driven metamorphosis, albeit through different lines of evidence, with BBP and DBP providing the strongest evidence of effects on the thyroid axis. Citation information for this dataset can be found in Data.gov's References section.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Home-range estimation is an important application of animal tracking data that is frequently complicated by autocorrelation, sampling irregularity, and small effective sample sizes. We introduce a novel, optimal weighting method that accounts for temporal sampling bias in autocorrelated tracking data. This method corrects for irregular and missing data, such that oversampled times are downweighted and undersampled times are upweighted to minimize error in the home-range estimate. We also introduce computationally efficient algorithms that make this method feasible with large datasets. Generally speaking, there are three situations where weight optimization improves the accuracy of home-range estimates: with marine data, where the sampling schedule is highly irregular, with duty cycled data, where the sampling schedule changes during the observation period, and when a small number of home-range crossings are observed, making the beginning and end times more independent and informative than the intermediate times. Using both simulated data and empirical examples including reef manta ray, Mongolian gazelle, and African buffalo, optimal weighting is shown to reduce the error and increase the spatial resolution of home-range estimates. With a conveniently packaged and computationally efficient software implementation, this method broadens the array of datasets with which accurate space-use assessments can be made.