Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income Levels:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The credit card dataset comprises various attributes that capture essential information about individual transactions. Each entry in the dataset is uniquely identified by an 'ID', which aids in precise record-keeping and analysis. The 'V1-V28' features encompass a wide range of transaction-related details, including time, location, type, and several other parameters. These attributes collectively provide a comprehensive snapshot of each transaction. 'Amount' denotes the monetary value involved in the transaction, indicating the specific charge or credit associated with the card. Lastly, the 'Class' attribute plays a pivotal role in fraud detection, categorizing transactions into distinct classes like 'legitimate' and 'fraudulent'. This classification is instrumental in identifying potentially suspicious activities, helping financial institutions safeguard against fraudulent transactions. Together, these attributes form a crucial dataset for studying and mitigating risks associated with credit card transactions.
This is likely a unique identifier for a specific credit card transaction. It helps in keeping track of individual transactions and distinguishing them from one another.
These are possibly features or attributes associated with the credit card transaction. They might include information such as time, amount, location, type of transaction, and various other details that can be used for analysis and fraud detection.
This refers to the monetary value involved in the credit card transaction. It indicates how much money was either charged or credited to the card during that particular transaction.
This is an important attribute indicating the category or type of the transaction. It typically classifies transactions into different groups, like 'fraudulent' or 'legitimate'. This classification is crucial for identifying potentially suspicious or fraudulent activities.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
.pkl or .npz.Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce.While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document which of these skills are being developed in higher education at a similar granularity.Here, we fill this gap by presenting Course-Skill Atlas -- a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct Course-Skill Atlas, we apply natural language processing to quantify the alignment between course syllabi and detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these alignment scores to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college education's role in preparing students for the labor market.Overall, Course-Skill Atlas can enable new research on the source of skills in the context of workforce development and provide actionable insights for shaping the future of higher education to meet evolving labor demands, especially in the face of new technologies.
Facebook
TwitterFor detailed information, visit the Tucson Equity Priority Index StoryMap.Download the layer's data dictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.
Facebook
TwitterConsumptive use (CU) of water is an important factor for determining water availability and groundwater storage. Many regional stakeholders and water-supply managers in the Upper Rio Grande Basin have indicated CU is of primary concern in their water-management strategies, yet CU data is sparse for this area. This polygon feature class, which represents irrigated acres for 2015, is a geospatial component of the U.S. Geological Survey National Water Census Upper Rio Grande Basin (URGB) focus area study's effort to improve quantification of CU in parts of New Mexico, west Texas, and northern Chihuahua. These digital data accompany Ivahnenko, T.I., Flickinger, A.K., Galanter, A.E., Douglas-Mankin, K.R., Pedraza, D.E., and Senay, G.B., 2021, Estimates of public-supply, domestic, and irrigation water withdrawal, use, and trends in the Upper Rio Grande Basin, 1985 to 2015: U.S. Geological Survey Scientific Investigations Report 2021–5036, 31 p., https://doi.org/10.3133/sir20215036.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.
We release our dataset as a set of folders indicating the patch target label (e.g., banana), each containing 1000 subfolders as the ImageNet output classes.
An example showing how to use the dataset is shown below.
import os.path
from torchvision import datasets, transforms, models import torch.utils.data
class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
len(os.listdir(os.path.join(directory, cls_name))) > 0}
return classes, class_to_idx
dataset_folder = 'data/ImageNet-Patch'
available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' }
target_label = 954
dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ])
dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval()
batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0]
accuracy = correct / total attack_sr = attack_success / total
print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Algerian Arabic Scripted Monologue Speech Dataset for the Travel domain, a carefully constructed resource created to support the development of Arabic speech recognition technologies, particularly for applications in travel, tourism, and customer service automation.
This training dataset features 6,000+ high-quality scripted prompt recordings in Algerian Arabic, crafted to simulate real-world Travel industry conversations. It’s ideal for building robust ASR systems, virtual assistants, and customer interaction tools.
The dataset includes a wide spectrum of travel-related interactions to reflect diverse real-world scenarios:
To boost contextual realism, the scripted prompts integrate frequently encountered travel terms and variables:
Every audio file is paired with a verbatim transcription in .TXT format.
Each audio file is enriched with detailed metadata to support advanced analytics and filtering:
Facebook
TwitterDatabase is provided by ASL Marketing and covers the United States of America. With ASL Marketing Reaching GenZ has never been easier. Current high school student data customized by: Class year Date of Birth Gender GPA Geo Household Income Ethnicity Hobbies College-bound Interests College Intent Email
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
High-throughput sequencing has become ubiquitous in biomedical sciences. As new technologies emerge and sequencing costs decline, the diversity and volume of available data increases exponentially, and successfully navigating the data becomes more challenging. Though datasets are often hosted by public repositories, scientists must rely on inconsistent annotation to identify and interpret meaningful data. Moreover, the experimental heterogeneity and wide-ranging quality of high-throughput biological data means that even data with desired cell lines, tissue types, or molecular targets may not be readily interpretable or integrated. We have developed ORSO (Online Resource for Social Omics) as an easy-to-use web application to connect life scientists with genomics data. In ORSO, users interact within a data-driven social network, where they can favorite datasets and follow other users. In addition to more than 30,000 datasets hosted from major biomedical consortia, users may contribute their own data to ORSO, facilitating its discovery by other users. Leveraging user interactions, ORSO provides a novel recommendation system to automatically connect users with hosted data. In addition to social interactions, the recommendation system considers primary read coverage information and annotated metadata. Similarities used by the recommendation system are presented by ORSO in a graph display, allowing exploration of dataset associations. The topology of the network graph reflects established biology, with samples from related systems grouped together. We tested the recommendation system using an RNA-seq time course dataset from differentiation of embryonic stem cells to cardiomyocytes. The ORSO recommendation system correctly predicted early data point sources as embryonic stem cells and late data point sources as heart and muscle samples, resulting in recommendation of related datasets. By connecting scientists with relevant data, ORSO provides a critical new service that facilitates wide-ranging research interests.
Facebook
TwitterWith the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General Information
Total instances: 117008Instances in the Jacobaea vulgaris class: 58504 Instances in the Meadow class: 58504Image sizes from 224x224 pixels on three color channels (RGB)
Performance increase training a ResNet50 on the base dataset versus the same architecture on the augmented data set shared here: +3,79 percent points in ROC AUC on an independent test set with 240 instances.
Data Generation and Source
The initial images in this dataset were taken as part of the project “UAV-basiertes Grünlandmonitoring auf Bestands- und Einzelpflanzenebene” (engl. “UAV-based Grassland Monitoring at Population and Individual Plant Level”), financed by the Authority for Economy, Transport, and Innovation of Hamburg. In September 2018, flights with an octocopter were conducted over two extensively used grassland areas in the urban area of Hamburg.
In my master's thesis at DAMS Lab at TU Berlin, I evaluated the effect of different augmentation strategies for Jacobaea vulgaris image classification on the several performance metrics (most importantly the ROC AUC score). The identified augmentation strategies are -besides to performance based selection- also selected based on domain knowledge, which I acquired during the research for my master thesis.
Additional information about the initial image generation process is to be found here [p. 45–53] and here.
Augmentations applied
Gaussian Noise: For the Gaussian noise augmentation, the mean of the added noise is set to zero. The lower and upper bounds for the random variance of the noise are 20.4663 and 54.0395 respectively. The bounds were identified by hyperparameter tuning. The search space for the lower bound was set from 5 to 30 and for the upper bound from 31 to 100. Those two search spaces were defined by visual inspection of the effects of applying Gaussian noise with different variance values to images of both classes. The Gaussian noise is sampled for each color channel individually.
Random Brightness and Contrast: The brightness will randomly be increased or decreased by a factor ranging from 0.7010 to 1.2990. The The contrast will also be randomly increased by a factor ranging from 0.5775 to 1.4225. Those two ranges were identified using hyperparameter tuning. The search space for the maximal percentual increase or decrease of brightness and contrast was individually set from 1% to maximally 50% increase or decrease.
Cutout Dropout: In this augmentation method a certain percentage of the input image is getting covered by black patches. The patches have a certain size in pixels, the implementation of this technique in this thesis uses square patches. The black patches are then randomly introduced into the image, by randomly alloacting thepatches across the image and then setting the corresponding pixel values to zero. The iamge is getting covered with patches until the cover percentage is reached. Weset percentage of the image to be randomly covered by black patches to 56.76%. The size of the patches, which randomly cover the image, is set to 4 pixels. Agood illustration of this is found in figure 4.2. The augmentation technique is inspired by the research proposed by Devries et al.[8]. Both values were identified by hyperparameter tuning. The search space for the patch size in pixels is categorical and includes the values [1, 2, 4, 7, 8, 14, 16, 28]. Those values all are multiples of 224, which is the image width and height in pixels. The patch size needs to be a multiple of the width and height in order to be suitable for the algorithm implementation. The search space for the cover percentage of the image had been set from 1% to 60%. This search space limits narrows the search down to a space where still a big part of the image is uncovered. The algorithm rearranges the image into a two dimensional grid and randomly masks rows of this grid by setting the pixel values in this row to zero. Then, the image gets rearranged, now with the randomly generated patches included.
Random Saturation: The saturation of each pixel is randomly getting shifted. The upper bound for randomly shifting the saturation value of each pixel is set to 231.689%. This value was identified using hyperparameter tuning. An upper limit of the maximal saturation shift had been set to 40% shift in either direction for hyperparameter tuning.
Horizontal Flip: The image gets flipped along the horizontal axis.
Vertical Flip: The image gets flipped along the vertical axis.
Random Rotation 90 degrees: Randomly rotates the image by a k-fold of 90 degrees, whereby k = {0, 1, 2, 3}.
All augmentation methods and with their tuned augmentation hyperparameters (if existent) are applied to an image from the test set in figure 4.2. With the seven identifiedaugmentation techniques a dataset of 800% the size of the original dataset is created. The Augment model is trained on exactly this dataset. Of course next to the augmented images, the dataset still includes the original, unaugmented images. TensorFlow, along with additional libraries including Optuna for hyperparameter optimization and Albumentations for image augmentation, were used in for the implementation of this project.
Rational behind the augmentations applied
Random Rotation, Vertical and Horizontal Flip: These three augmentation strategies were chosen to make the classifier less sensitive to the orientation of the plant. The goal is to train a model that can classify plants regardless of their orientation. In order to achieve this effectively across different orientations, vertical flips, horizontal flips, and random 90-degree rotations are chosen for evaluation.
Random Saturation: The varying saturation of the images simulates different levels of chlorophyll in the leaves, which is responsible for the green color of theleaves and the intensity of this color. The color of the plant parts (leaves, stems, and flowers) is also influenced by factors such as soil, sun, weed density and pressure, location, and water availability. Varying the saturation of the images simulates changes in these factors.
Gaussian Noise: By adding noise, in this case Gaussian noise, different lighting conditions are simulated when capturing the images. We specifically chose Gaussiannoise because it is common in many real-world scenarios and is based on the Central Limit Theorem, which states that the sum of many independent random variables.tends to be normally distributed. This makes Gaussian noise a logical choice for simulating real-world random noise.
Random Brightness Contrast: The Random Brightness and Random Contrast Augmentation uses brightness to mimic varying lighting conditions and contrast to highlight differences between plants by contrasting them more strongly, thereby highlighting their edges. This approach for highlighting edges is of course much more subtle than the canny edge detection augmentation. This augmentation method combines a weak focus on edges with variations in lighting conditions in one approach. The random contrast is a much softer approach for highlighting edges of plants, compared to the Canny edge detection augmentation. The other features in the images do not get changed that much, compared to the changes from edge detection augmentation.
Cutout Dropout: The cutout augmentation simulates random occlusion by other plants. These occlusions are common and expected. Jacobaea vulgaris plants may be partially or completely obscured by other plants during image capturing. This augmentation technique makes the models more robust to random occlusion.
Data License
The dataset is licensed under the license CC BY 4.0. The attributor of the data is the Chair of Geodesy and Geoinformatics at the University of Rostock. The data was created within the scope of the project 'UAV-based Grassland Monitoring at Population and Individual Plant Level', financed by the Authority for Economy, Transport, and Innovation of Hamburg.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains historical price data for Bitcoin (BTC) against the U.S. Dollar (USD), spanning from June 2010 to November 2024. The data is organized on a daily basis and includes key market metrics such as the opening price, closing price, high, low, volume, and market capitalization for each day.
Columns: The dataset consists of the following columns:
Date: The date of the recorded data point (format: YYYY-MM-DD). Open: The opening price of Bitcoin on that day. High: The highest price Bitcoin reached on that day. Low: The lowest price Bitcoin reached on that day. Close: The closing price of Bitcoin on that day. Volume: The total trading volume of Bitcoin during that day. Market Cap: The total market capitalization of Bitcoin on that day (calculated by multiplying the closing price by the circulating supply of Bitcoin at the time). Source: The data is sourced from Yahoo Finance.
Time Period: The data spans from June 2010, when Bitcoin first began trading, to November 2024. This provides a comprehensive view of Bitcoin’s historical price movements, from its early days of trading at a fraction of a cent to its more recent valuation in the thousands of dollars.
Use Cases:
This dataset is valuable for a variety of purposes, including:
Time Series Analysis: Analyze Bitcoin price movements, identify trends, and develop predictive models for future prices. Financial Modeling: Use the dataset to assess Bitcoin as an asset class, model its volatility, or simulate investment strategies. Machine Learning: Train machine learning algorithms to forecast Bitcoin’s future price or predict market trends based on historical data. Economic Research: Study the impact of global events on Bitcoin’s price, such as regulatory changes, technological developments, or macroeconomic factors. Visualization: Generate visualizations of Bitcoin price trends, trading volume, and market capitalization over time.
Facebook
TwitterData files containing detailed information about vehicles in the UK are also available, including make and model data.
Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.
The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.
Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:
Licensed Vehicles (2014 Q3 to 2016 Q3)
We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.
3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification
Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:
3.1% in 2024
2.3% in 2023
1.4% in 2022
Table VEH0156 (2018 to 2023)
Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.
Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.
Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.
If you have questions regarding any of these changes, please contact the Vehicle statistics team.
Overview
VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)
Detailed breakdowns
VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)
VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset typically includes the following columns:
PassengerId: A unique identifier for each passenger. Survived: This column indicates whether a passenger survived (1) or did not survive (0). Pclass (Ticket class): A proxy for socio-economic status, with 1 being the highest class and 3 the lowest. Name: The name of the passenger. Sex: The gender of the passenger. Age: The age of the passenger. (Note: There might be missing values in this column.) SibSp: The number of siblings or spouses the passenger had aboard the Titanic. Parch: The number of parents or children the passenger had aboard the Titanic. Ticket: The ticket number. Fare: The amount of money the passenger paid for the ticket.
The main goal of using this dataset is to predict whether a passenger survived or not based on various features. It serves as a popular introductory dataset for those learning data analysis, machine learning, and predictive modeling. Keep in mind that the dataset may be subject to variations and updates, so it's always a good idea to check the Kaggle website or dataset documentation for the most recent information.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English Real Estate Chat Dataset is a high-quality collection of over 12,000 text-based conversations between customers and call center agents. These conversations reflect real-world scenarios within the Real Estate sector, offering rich linguistic data for training conversational AI, chatbots, and NLP systems focused on property-related interactions in English-speaking regions.
The dataset spans a broad range of Real Estate service conversations, covering various customer intents and agent support tasks:
This topic variety enables realistic model training for both lead generation and post-sale engagement scenarios.
Conversations are reflective of natural English used in the Real Estate domain, incorporating:
This level of linguistic realism supports model generalization across dialects and user demographics.
Conversations include a mix of short inquiries and detailed advisory sessions, capturing full customer journeys:
Facebook
TwitterFor detailed information, visit the Tucson Equity Priority Index StoryMap.Download the Data DictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains real passenger information from the Titanic’s tragic voyage in 1912. It includes details like age, gender, ticket class, fare, and whether each passenger survived. The data is commonly used for learning data analysis and building beginner machine-learning models. It helps us explore patterns such as who had higher chances of survival and why.
Passenger ages range from infants to elderly adults, with most travelers falling between 20 and 40 years old. There are some missing values, especially among older passengers and children.
Fares vary widely — lower-class passengers paid small amounts, while first-class travelers paid much higher fares. The distribution is skewed because a few people paid very high ticket prices.
Most passengers were in 3rd class, fewer in 2nd, and the smallest group in 1st class. This shows the ship had many lower-class travelers.
The dataset shows that more people did not survive than survived. Survival rates differ by gender, age, and class, with higher survival among women, children, and first-class passengers.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Polish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Polish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Polish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Polish speech models that understand and respond to authentic Polish accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Polish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Polish speech and language AI applications:
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.
Key observations
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Income Levels:
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here