53 datasets found

N
Income Distribution by Quintile: Mean Household Income in Amherst, New York...
neilsberg.com
csv, json
Updated Mar 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/amherst-ny-median-household-income/
Explore at:
json, csvAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Amherst, New York
Variables measured
Income Level, Mean Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 18,852, while the mean income for the highest quintile (20% of households with the highest income) is 296,153. This indicates that the top earners earn 16 times compared to the lowest earners.

*Top 5%: * The mean household income for the wealthiest population (top 5%) is 495,426, which is 167.29% higher compared to the highest quintile, and 2627.98% higher compared to the lowest quintile.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Income Levels:

Lowest Quintile

Second Quintile

Third Quintile

Fourth Quintile

Highest Quintile

Top 5 Percent

Variables / Data Columns

Income Level: This column showcases the income levels (As mentioned above).

Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here
High-fidelity Fraudulent Activity Dataset 2023
kaggle.com
zip
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahzad Aslam (2023). High-fidelity Fraudulent Activity Dataset 2023 [Dataset]. https://www.kaggle.com/datasets/zeesolver/credit-card
Explore at:
zip(149953614 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Shahzad Aslam
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

The credit card dataset comprises various attributes that capture essential information about individual transactions. Each entry in the dataset is uniquely identified by an 'ID', which aids in precise record-keeping and analysis. The 'V1-V28' features encompass a wide range of transaction-related details, including time, location, type, and several other parameters. These attributes collectively provide a comprehensive snapshot of each transaction. 'Amount' denotes the monetary value involved in the transaction, indicating the specific charge or credit associated with the card. Lastly, the 'Class' attribute plays a pivotal role in fraud detection, categorizing transactions into distinct classes like 'legitimate' and 'fraudulent'. This classification is instrumental in identifying potentially suspicious activities, helping financial institutions safeguard against fraudulent transactions. Together, these attributes form a crucial dataset for studying and mitigating risks associated with credit card transactions.

Column Details

ID:

This is likely a unique identifier for a specific credit card transaction. It helps in keeping track of individual transactions and distinguishing them from one another.

V1-V28:

These are possibly features or attributes associated with the credit card transaction. They might include information such as time, amount, location, type of transaction, and various other details that can be used for analysis and fraud detection.

Amount:

This refers to the monetary value involved in the credit card transaction. It indicates how much money was either charged or credited to the card during that particular transaction.

Class:

This is an important attribute indicating the category or type of the transaction. It typically classifies transactions into different groups, like 'fraudulent' or 'legitimate'. This classification is crucial for identifying potentially suspicious or fraudulent activities.
cifar-100-python
kaggle.com
zip
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ThanhTan (2024). cifar-100-python [Dataset]. https://www.kaggle.com/datasets/duongthanhtan/cifar-100-python
Explore at:
zip(168517675 bytes)Available download formats
Dataset updated
Dec 26, 2024
Authors
ThanhTan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CIFAR-100 Dataset

1. Overview

CIFAR-100 is an extension of the CIFAR-10 dataset, with more classes and finer-grained categorization.

It contains 100 classes, making it more challenging than CIFAR-10, which has only 10 classes.

Each image in CIFAR-100 is labeled with both a fine label (specific category) and a coarse label (broader category, such as animals or vehicles).

2. Dataset Details

Number of Images: 60,000 color images in total.

50,000 for training.

10,000 for testing.

Image Size: Each image is a small 32x32 pixel RGB (color) image.

Classes: 100 classes, grouped into 20 superclasses.

Each superclass contains 5 related classes.

3. Fine and Coarse Labels

Fine Labels: The dataset has specific categories, such as 'apple', 'bicycle', 'rose', etc.

Coarse Labels: These are broader categories, like 'fruit', 'flower', 'vehicle', etc.

4. Applications

Image Classification: Used for training models to classify images into their respective categories.

Feature Extraction: Useful for benchmarking feature extraction techniques in computer vision.

Transfer Learning: Often used to pre-train models for other similar tasks.

Deep Learning Research: Commonly used to test architectures like CNNs (Convolutional Neural Networks).

5. Challenges

The images are very small (32x32 pixels), making it harder for models to learn intricate details.

High class count (100) increases classification complexity.

Intra-class variability and inter-class similarity make it a challenging dataset for classification.

6. File Format

The dataset is usually available in Python-friendly formats like .pkl or .npz.

It can also be downloaded and loaded using frameworks like TensorFlow or PyTorch.

7. Example Classes

Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.
Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S....
figshare.com
application/gzip
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank (2024). Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula [Dataset]. http://doi.org/10.6084/m9.figshare.25632429.v7
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25632429.v7
Dataset updated
Oct 8, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce.While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document which of these skills are being developed in higher education at a similar granularity.Here, we fill this gap by presenting Course-Skill Atlas -- a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct Course-Skill Atlas, we apply natural language processing to quantify the alignment between course syllabi and detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these alignment scores to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college education's role in preparing students for the labor market.Overall, Course-Skill Atlas can enable new research on the source of skills in the context of workforce development and provide actionable insights for shaping the future of higher education to meet evolving labor demands, especially in the face of new technologies.
a
Tucson Equity Priority Index (TEPI): Citywide Census Tracts
hub.arcgis.com
teds.tucsonaz.gov
+1more
Updated Jun 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tucson (2024). Tucson Equity Priority Index (TEPI): Citywide Census Tracts [Dataset]. https://hub.arcgis.com/datasets/1ec436c7358c47739872078ecb1d0c44
Explore at:
Dataset updated
Jun 27, 2024
Dataset authored and provided by
City of Tucson
Area covered

Description
For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the layer's data dictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.
d
Data from: 2015 Irrigated acres feature class for the Upper Rio Grande...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). 2015 Irrigated acres feature class for the Upper Rio Grande Basin, New Mexico and Texas, United States and Chihuahua, Mexico [Dataset]. https://catalog.data.gov/dataset/2015-irrigated-acres-feature-class-for-the-upper-rio-grande-basin-new-mexico-and-texas-uni
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Chihuahua, Rio Grande, Texas, New Mexico, Mexico, United States
Description
Consumptive use (CU) of water is an important factor for determining water availability and groundwater storage. Many regional stakeholders and water-supply managers in the Upper Rio Grande Basin have indicated CU is of primary concern in their water-management strategies, yet CU data is sparse for this area. This polygon feature class, which represents irrigated acres for 2015, is a geospatial component of the U.S. Geological Survey National Water Census Upper Rio Grande Basin (URGB) focus area study's effort to improve quantification of CU in parts of New Mexico, west Texas, and northern Chihuahua. These digital data accompany Ivahnenko, T.I., Flickinger, A.K., Galanter, A.E., Douglas-Mankin, K.R., Pedraza, D.E., and Senay, G.B., 2021, Estimates of public-supply, domestic, and irrigation water withdrawal, use, and trends in the Upper Rio Grande Basin, 1985 to 2015: U.S. Geological Survey Scientific Investigations Report 2021–5036, 31 p., https://doi.org/10.3133/sir20215036.
Z
Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...
data.niaid.nih.gov
zenodo.org
Updated Jun 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6568777
Explore at:
Dataset updated
Jun 30, 2022
Dataset provided by
University of Genoa, Italy
University of Cagliari, Italy
Authors
Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

We release our dataset as a set of folders indicating the patch target label (e.g., banana), each containing 1000 subfolders as the ImageNet output classes.

An example showing how to use the dataset is shown below.

code for testing robustness of a model

import os.path

from torchvision import datasets, transforms, models import torch.utils.data

class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """

def find_classes(self, directory): classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir()) if not classes: raise FileNotFoundError(f"Couldn't find any class folder in {directory}.") class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if len(os.listdir(os.path.join(directory, cls_name))) > 0} return classes, class_to_idx

extract and unzip the dataset, then write top folder here

dataset_folder = 'data/ImageNet-Patch'

available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' }

select folder with specific target

target_label = 954

dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ])

dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval()

batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0]

accuracy = correct / total attack_sr = attack_success / total

print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
F
French Scripted Monologue Speech Data in Travel Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Scripted Monologue Speech Data in Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-french-france
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Algerian Arabic Scripted Monologue Speech Dataset for the Travel domain, a carefully constructed resource created to support the development of Arabic speech recognition technologies, particularly for applications in travel, tourism, and customer service automation.
Speech Data
This training dataset features 6,000+ high-quality scripted prompt recordings in Algerian Arabic, crafted to simulate real-world Travel industry conversations. It’s ideal for building robust ASR systems, virtual assistants, and customer interaction tools.
•Participant Diversity
•
Speakers: 60 native Algerian Arabic speakers.

•
Geographic Coverage: Participants from multiple regions across Algeria to ensure rich diversity in dialects and accents.

•
Demographics: Age range from 18 to 70 years, with a gender ratio of approximately 60% male and 40% female.

•Recording Details
•
Prompt Type: Scripted monologue-style prompts.

•
Duration: Each audio sample ranges from 5 to 30 seconds.

•
Audio Format: WAV files with mono channels, 16-bit depth, and 8 kHz / 16 kHz sample rates.

•
Environment: Clean, quiet, echo-free spaces to ensure high-quality recordings.

Topic Coverage
The dataset includes a wide spectrum of travel-related interactions to reflect diverse real-world scenarios:
•Booking and reservation dialogues
•Customer support and general inquiries
•Destination-specific guidance
•Technical and login help
•Promotional offers and travel deals
•Service availability and policy information
•Domain-specific statements
Context Elements
To boost contextual realism, the scripted prompts integrate frequently encountered travel terms and variables:
•
Names: Common Algeria male and female names

•
Addresses: Regional address formats and locality names

•
Dates & Times: Booking dates, travel periods, and time-based interactions

•
Destinations: Mention of cities, countries, airports, and tourist landmarks

•
Prices & Numbers: Cost of flights, hotel rates, promotional discounts, etc.

•
Booking & Confirmation Codes: Typical ticketing and travel identifiers

Transcription
Every audio file is paired with a verbatim transcription in .TXT format.
•
Consistency: Each transcript matches its corresponding audio file exactly.

•
Accuracy: Transcriptions are reviewed and verified by native Algerian Arabic speakers.

•
Usability: File names are synced across audio and text for easy integration.

Metadata
Each audio file is enriched with detailed metadata to support advanced analytics and filtering:
•
Participant Metadata: Unique ID, age, gender, region/state,
d
USA High School Student Marketing Database by ASL Marketing
datarade.ai
Updated Dec 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ASL Marketing (2019). USA High School Student Marketing Database by ASL Marketing [Dataset]. https://datarade.ai/data-products/high-school-student-data
Explore at:
Dataset updated
Dec 19, 2019
Dataset authored and provided by
ASL Marketing
Area covered
United States
Description
Database is provided by ASL Marketing and covers the United States of America. With ASL Marketing Reaching GenZ has never been easier. Current high school student data customized by: Class year Date of Birth Gender GPA Geo Household Income Ethnicity Hobbies College-bound Interests College Intent Email
f
ORSO (Online Resource for Social Omics): A data-driven social network...
figshare.com
plos.figshare.com
tiff
Updated Feb 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher A. Lavender; Andrew J. Shapiro; Frank S. Day; David C. Fargo (2020). ORSO (Online Resource for Social Omics): A data-driven social network connecting scientists to genomics datasets [Dataset]. http://doi.org/10.1371/journal.pcbi.1007571
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1007571
Dataset updated
Feb 5, 2020
Dataset provided by
PLOS Computational Biology
Authors
Christopher A. Lavender; Andrew J. Shapiro; Frank S. Day; David C. Fargo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-throughput sequencing has become ubiquitous in biomedical sciences. As new technologies emerge and sequencing costs decline, the diversity and volume of available data increases exponentially, and successfully navigating the data becomes more challenging. Though datasets are often hosted by public repositories, scientists must rely on inconsistent annotation to identify and interpret meaningful data. Moreover, the experimental heterogeneity and wide-ranging quality of high-throughput biological data means that even data with desired cell lines, tissue types, or molecular targets may not be readily interpretable or integrated. We have developed ORSO (Online Resource for Social Omics) as an easy-to-use web application to connect life scientists with genomics data. In ORSO, users interact within a data-driven social network, where they can favorite datasets and follow other users. In addition to more than 30,000 datasets hosted from major biomedical consortia, users may contribute their own data to ORSO, facilitating its discovery by other users. Leveraging user interactions, ORSO provides a novel recommendation system to automatically connect users with hosted data. In addition to social interactions, the recommendation system considers primary read coverage information and annotated metadata. Similarities used by the recommendation system are presented by ORSO in a graph display, allowing exploration of dataset associations. The topology of the network graph reflects established biology, with samples from related systems grouped together. We tested the recommendation system using an RNA-seq time course dataset from differentiation of embryonic stem cells to cardiomyocytes. The ORSO recommendation system correctly predicted early data point sources as embryonic stem cells and late data point sources as heart and muscle samples, resulting in recommendation of related datasets. By connecting scientists with relevant data, ORSO provides a critical new service that facilitates wide-ranging research interests.
d
Pseudo-Label Generation for Multi-Label Text Classification
catalog.data.gov
datasets.ai
+1more
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.
Z
Jacobaea vulgaris and meadow Augmented image classification dataset (binary)...
data.niaid.nih.gov
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Rostock; Technische Universität Berlin (2024). Jacobaea vulgaris and meadow Augmented image classification dataset (binary) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12547684
Explore at:
Dataset updated
Jun 27, 2024
Dataset provided by
Chair of Geodesy and Geoinformatics
DAMS Lab
Authors
University of Rostock; Technische Universität Berlin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General Information

Total instances: 117008Instances in the Jacobaea vulgaris class: 58504 Instances in the Meadow class: 58504Image sizes from 224x224 pixels on three color channels (RGB)

Performance increase training a ResNet50 on the base dataset versus the same architecture on the augmented data set shared here: +3,79 percent points in ROC AUC on an independent test set with 240 instances.

Data Generation and Source

The initial images in this dataset were taken as part of the project “UAV-basiertes Grünlandmonitoring auf Bestands- und Einzelpflanzenebene” (engl. “UAV-based Grassland Monitoring at Population and Individual Plant Level”), financed by the Authority for Economy, Transport, and Innovation of Hamburg. In September 2018, flights with an octocopter were conducted over two extensively used grassland areas in the urban area of Hamburg.

In my master's thesis at DAMS Lab at TU Berlin, I evaluated the effect of different augmentation strategies for Jacobaea vulgaris image classification on the several performance metrics (most importantly the ROC AUC score). The identified augmentation strategies are -besides to performance based selection- also selected based on domain knowledge, which I acquired during the research for my master thesis.

Additional information about the initial image generation process is to be found here [p. 45–53] and here.

Augmentations applied

Gaussian Noise: For the Gaussian noise augmentation, the mean of the added noise is set to zero. The lower and upper bounds for the random variance of the noise are 20.4663 and 54.0395 respectively. The bounds were identified by hyperparameter tuning. The search space for the lower bound was set from 5 to 30 and for the upper bound from 31 to 100. Those two search spaces were defined by visual inspection of the effects of applying Gaussian noise with different variance values to images of both classes. The Gaussian noise is sampled for each color channel individually.

Random Brightness and Contrast: The brightness will randomly be increased or decreased by a factor ranging from 0.7010 to 1.2990. The The contrast will also be randomly increased by a factor ranging from 0.5775 to 1.4225. Those two ranges were identified using hyperparameter tuning. The search space for the maximal percentual increase or decrease of brightness and contrast was individually set from 1% to maximally 50% increase or decrease.

Cutout Dropout: In this augmentation method a certain percentage of the input image is getting covered by black patches. The patches have a certain size in pixels, the implementation of this technique in this thesis uses square patches. The black patches are then randomly introduced into the image, by randomly alloacting thepatches across the image and then setting the corresponding pixel values to zero. The iamge is getting covered with patches until the cover percentage is reached. Weset percentage of the image to be randomly covered by black patches to 56.76%. The size of the patches, which randomly cover the image, is set to 4 pixels. Agood illustration of this is found in figure 4.2. The augmentation technique is inspired by the research proposed by Devries et al.[8]. Both values were identified by hyperparameter tuning. The search space for the patch size in pixels is categorical and includes the values [1, 2, 4, 7, 8, 14, 16, 28]. Those values all are multiples of 224, which is the image width and height in pixels. The patch size needs to be a multiple of the width and height in order to be suitable for the algorithm implementation. The search space for the cover percentage of the image had been set from 1% to 60%. This search space limits narrows the search down to a space where still a big part of the image is uncovered. The algorithm rearranges the image into a two dimensional grid and randomly masks rows of this grid by setting the pixel values in this row to zero. Then, the image gets rearranged, now with the randomly generated patches included.

Random Saturation: The saturation of each pixel is randomly getting shifted. The upper bound for randomly shifting the saturation value of each pixel is set to 231.689%. This value was identified using hyperparameter tuning. An upper limit of the maximal saturation shift had been set to 40% shift in either direction for hyperparameter tuning.

Horizontal Flip: The image gets flipped along the horizontal axis.

Vertical Flip: The image gets flipped along the vertical axis.

Random Rotation 90 degrees: Randomly rotates the image by a k-fold of 90 degrees, whereby k = {0, 1, 2, 3}.

All augmentation methods and with their tuned augmentation hyperparameters (if existent) are applied to an image from the test set in figure 4.2. With the seven identifiedaugmentation techniques a dataset of 800% the size of the original dataset is created. The Augment model is trained on exactly this dataset. Of course next to the augmented images, the dataset still includes the original, unaugmented images. TensorFlow, along with additional libraries including Optuna for hyperparameter optimization and Albumentations for image augmentation, were used in for the implementation of this project.

Rational behind the augmentations applied

Random Rotation, Vertical and Horizontal Flip: These three augmentation strategies were chosen to make the classifier less sensitive to the orientation of the plant. The goal is to train a model that can classify plants regardless of their orientation. In order to achieve this effectively across different orientations, vertical flips, horizontal flips, and random 90-degree rotations are chosen for evaluation.

Random Saturation: The varying saturation of the images simulates different levels of chlorophyll in the leaves, which is responsible for the green color of theleaves and the intensity of this color. The color of the plant parts (leaves, stems, and flowers) is also influenced by factors such as soil, sun, weed density and pressure, location, and water availability. Varying the saturation of the images simulates changes in these factors.

Gaussian Noise: By adding noise, in this case Gaussian noise, different lighting conditions are simulated when capturing the images. We specifically chose Gaussiannoise because it is common in many real-world scenarios and is based on the Central Limit Theorem, which states that the sum of many independent random variables.tends to be normally distributed. This makes Gaussian noise a logical choice for simulating real-world random noise.

Random Brightness Contrast: The Random Brightness and Random Contrast Augmentation uses brightness to mimic varying lighting conditions and contrast to highlight differences between plants by contrasting them more strongly, thereby highlighting their edges. This approach for highlighting edges is of course much more subtle than the canny edge detection augmentation. This augmentation method combines a weak focus on edges with variations in lighting conditions in one approach. The random contrast is a much softer approach for highlighting edges of plants, compared to the Canny edge detection augmentation. The other features in the images do not get changed that much, compared to the changes from edge detection augmentation.

Cutout Dropout: The cutout augmentation simulates random occlusion by other plants. These occlusions are common and expected. Jacobaea vulgaris plants may be partially or completely obscured by other plants during image capturing. This augmentation technique makes the models more robust to random occlusion.

Data License

The dataset is licensed under the license CC BY 4.0. The attributor of the data is the Chair of Geodesy and Geoinformatics at the University of Rostock. The data was created within the scope of the project 'UAV-based Grassland Monitoring at Population and Individual Plant Level', financed by the Authority for Economy, Transport, and Innovation of Hamburg.
BTC-USD Price Data (June 2010 - November 2024)
kaggle.com
zip
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Farhan Ali (2024). BTC-USD Price Data (June 2010 - November 2024) [Dataset]. https://www.kaggle.com/datasets/farhanali097/btc-usd-price-data-june-2010-november-2024
Explore at:
zip(107769 bytes)Available download formats
Dataset updated
Nov 30, 2024
Authors
Farhan Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains historical price data for Bitcoin (BTC) against the U.S. Dollar (USD), spanning from June 2010 to November 2024. The data is organized on a daily basis and includes key market metrics such as the opening price, closing price, high, low, volume, and market capitalization for each day.

Columns: The dataset consists of the following columns:

Date: The date of the recorded data point (format: YYYY-MM-DD). Open: The opening price of Bitcoin on that day. High: The highest price Bitcoin reached on that day. Low: The lowest price Bitcoin reached on that day. Close: The closing price of Bitcoin on that day. Volume: The total trading volume of Bitcoin during that day. Market Cap: The total market capitalization of Bitcoin on that day (calculated by multiplying the closing price by the circulating supply of Bitcoin at the time). Source: The data is sourced from Yahoo Finance.

Time Period: The data spans from June 2010, when Bitcoin first began trading, to November 2024. This provides a comprehensive view of Bitcoin’s historical price movements, from its early days of trading at a fraction of a cent to its more recent valuation in the thousands of dollars.

Use Cases:

This dataset is valuable for a variety of purposes, including:

Time Series Analysis: Analyze Bitcoin price movements, identify trends, and develop predictive models for future prices. Financial Modeling: Use the dataset to assess Bitcoin as an asset class, model its volatility, or simulate investment strategies. Machine Learning: Train machine learning algorithms to forecast Bitcoin’s future price or predict market trends based on historical data. Economic Research: Study the impact of global events on Bitcoin’s price, such as regulatory changes, technological developments, or macroeconomic factors. Visualization: Generate visualizations of Bitcoin price trends, trading volume, and market capitalization over time.
w
Vehicle licensing statistics data tables
gov.uk
s3.amazonaws.com
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Transport (2025). Vehicle licensing statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-tables
Explore at:
Dataset updated
Oct 15, 2025
Dataset provided by
GOV.UK
Authors
Department for Transport
Description
Data files containing detailed information about vehicles in the UK are also available, including make and model data.

Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.

The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.

Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:

Licensed Vehicles (2014 Q3 to 2016 Q3)

We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.

3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification

Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:

3.1% in 2024

2.3% in 2023

1.4% in 2022

Table VEH0156 (2018 to 2023)

Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.

Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.

Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.

If you have questions regarding any of these changes, please contact the Vehicle statistics team.

All vehicles

Licensed vehicles

Overview

VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)

Detailed breakdowns

VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)

VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at
titanic_dataset
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mahmoud shogaa (2023). titanic_dataset [Dataset]. https://www.kaggle.com/datasets/mahmoudshogaa/titanic-dataset
Explore at:
zip(22491 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
mahmoud shogaa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset typically includes the following columns:

PassengerId: A unique identifier for each passenger. Survived: This column indicates whether a passenger survived (1) or did not survive (0). Pclass (Ticket class): A proxy for socio-economic status, with 1 being the highest class and 3 the lowest. Name: The name of the passenger. Sex: The gender of the passenger. Age: The age of the passenger. (Note: There might be missing values in this column.) SibSp: The number of siblings or spouses the passenger had aboard the Titanic. Parch: The number of parents or children the passenger had aboard the Titanic. Ticket: The ticket number. Fare: The amount of money the passenger paid for the ticket.

The main goal of using this dataset is to predict whether a passenger survived or not based on various features. It serves as a popular introductory dataset for those learning data analysis, machine learning, and predictive modeling. Keep in mind that the dataset may be subject to variations and updates, so it's always a good idea to check the Kaggle website or dataset documentation for the most recent information.
F
English Agent-Customer Chat Dataset for Real Estate
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Agent-Customer Chat Dataset for Real Estate [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-realestate-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English Real Estate Chat Dataset is a high-quality collection of over 12,000 text-based conversations between customers and call center agents. These conversations reflect real-world scenarios within the Real Estate sector, offering rich linguistic data for training conversational AI, chatbots, and NLP systems focused on property-related interactions in English-speaking regions.
Participant & Chat Overview
•
Participants: 200+ native English speakers from the FutureBeeAI Crowd Community

•
Conversation Length: 300–700 words per chat

•
Turns per Chat: 50–150 dialogue turns across both speakers

•
Chat Types: Inbound and outbound

•
Sentiment Coverage: Positive, neutral, and negative interactions included

Topic Diversity
The dataset spans a broad range of Real Estate service conversations, covering various customer intents and agent support tasks:
•Inbound Chats (Customer-Initiated)
•Property inquiries (buy/rent)
•Rental property availability
•Renovation and maintenance inquiries
•Property features and amenities
•Investment advice and ROI analysis
•Property ownership and legal history
•Outbound Chats (Agent-Initiated)
•New property listing announcements
•Post-purchase follow-ups
•Investment opportunity alerts
•Property valuation updates
•Customer satisfaction and feedback surveys
This topic variety enables realistic model training for both lead generation and post-sale engagement scenarios.
Language Nuance & Authenticity
Conversations are reflective of natural English used in the Real Estate domain, incorporating:
•
Cultural Naming Patterns: Personal names, agency names, and developer brands

•
Localized Contact Info: Phone numbers, email addresses, and geographic locations across English-speaking regions

•
Numeric and Temporal Language: Dates, prices, unit sizes, and time references formatted in English conventions

•
Informal and Domain-Specific Language: Real estate slang, idioms, and casual tone used in property discussions

This level of linguistic realism supports model generalization across dialects and user demographics.
Conversational Structure & Flow
Conversations include a mix of short inquiries and detailed advisory sessions, capturing full customer journeys:
•Dialogue Types
•
General inquiries

•Sales consultations
•Investment advisory
•Follow-up coordination
•Complaint handling and support
•Flow Components
•
Greetings and identity verification

•Intent identification and context gathering
<div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items: baseline;
t
Tucson Equity Priority Index (TEPI): Pima County Block Groups
teds.tucsonaz.gov
Updated Jul 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tucson (2024). Tucson Equity Priority Index (TEPI): Pima County Block Groups [Dataset]. https://teds.tucsonaz.gov/maps/cotgis::tucson-equity-priority-index-tepi-pima-county-block-groups
Explore at:
Dataset updated
Jul 23, 2024
Dataset authored and provided by
City of Tucson
Area covered

Description
For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the Data DictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.
EDA: Unlocking the Story Behind the Numbers
kaggle.com
zip
Updated Nov 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coding expert G.N (2025). EDA: Unlocking the Story Behind the Numbers [Dataset]. https://www.kaggle.com/datasets/ranaghulamnabi/eda-unlocking-the-story-behind-the-numbers/discussion
Explore at:
zip(9163 bytes)Available download formats
Dataset updated
Nov 17, 2025
Authors
Coding expert G.N
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context :

This dataset contains real passenger information from the Titanic’s tragic voyage in 1912. It includes details like age, gender, ticket class, fare, and whether each passenger survived. The data is commonly used for learning data analysis and building beginner machine-learning models. It helps us explore patterns such as who had higher chances of survival and why.

Feature distribution:

Age Distribution

Passenger ages range from infants to elderly adults, with most travelers falling between 20 and 40 years old. There are some missing values, especially among older passengers and children.

Fare Distribution

Fares vary widely — lower-class passengers paid small amounts, while first-class travelers paid much higher fares. The distribution is skewed because a few people paid very high ticket prices.

Passenger Class (Pclass) Distribution

Most passengers were in 3rd class, fewer in 2nd, and the smallest group in 1st class. This shows the ship had many lower-class travelers.

Survival Distribution

The dataset shows that more people did not survive than survived. Survival rates differ by gender, age, and class, with higher survival among women, children, and first-class passengers.
F
Polish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-polish-poland
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Polish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Polish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Polish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Polish speech models that understand and respond to authentic Polish accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Polish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Polish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Poland to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Polish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Polish.

•
Voice Assistants: Build smart assistants capable of understanding natural Polish conversations.

<span
F
Mexican Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Facebook

Twitter

Click to copy link

Link copied

Cite

Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/amherst-ny-median-household-income/

Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition

Explore at:

json, csvAvailable download formats

Dataset updated

Mar 3, 2025

Dataset authored and provided by

Neilsberg Research

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

Amherst, New York

Variables measured

Income Level, Mean Household Income

Measurement technique

The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com

Dataset funded by

Neilsberg Research

Description

About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 18,852, while the mean income for the highest quintile (20% of households with the highest income) is 296,153. This indicates that the top earners earn 16 times compared to the lowest earners.
*Top 5%: * The mean household income for the wealthiest population (top 5%) is 495,426, which is 167.29% higher compared to the highest quintile, and 2627.98% higher compared to the lowest quintile.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Income Levels:

Lowest Quintile
Second Quintile
Third Quintile
Fourth Quintile
Highest Quintile
Top 5 Percent

Variables / Data Columns

Income Level: This column showcases the income levels (As mentioned above).
Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here

Clear search

Close search

Google apps

Main menu

Income Distribution by Quintile: Mean Household Income in Amherst, New York...

About this dataset

Content

Inspiration

Recommended for further research

High-fidelity Fraudulent Activity Dataset 2023

Context

Column Details

ID:

V1-V28:

Amount:

Class:

cifar-100-python

CIFAR-100 Dataset

1. Overview

2. Dataset Details

3. Fine and Coarse Labels

4. Applications

5. Challenges

6. File Format

7. Example Classes

Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S....

Tucson Equity Priority Index (TEPI): Citywide Census Tracts

Data from: 2015 Irrigated acres feature class for the Upper Rio Grande...

Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

code for testing robustness of a model

extract and unzip the dataset, then write top folder here

select folder with specific target

French Scripted Monologue Speech Data in Travel Domain

Introduction

Speech Data

Topic Coverage

Context Elements

Transcription

Metadata

USA High School Student Marketing Database by ASL Marketing

ORSO (Online Resource for Social Omics): A data-driven social network...

Pseudo-Label Generation for Multi-Label Text Classification

Jacobaea vulgaris and meadow Augmented image classification dataset (binary)...

BTC-USD Price Data (June 2010 - November 2024)

Vehicle licensing statistics data tables

All vehicles

Licensed vehicles

titanic_dataset

English Agent-Customer Chat Dataset for Real Estate

Introduction

Participant & Chat Overview

Topic Diversity

Language Nuance & Authenticity

Conversational Structure & Flow

Tucson Equity Priority Index (TEPI): Pima County Block Groups

EDA: Unlocking the Story Behind the Numbers

Context :

Feature distribution:

Polish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Mexican Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition

About this dataset

Content

Inspiration

Recommended for further research