53 datasets found
  1. N

    Income Distribution by Quintile: Mean Household Income in Amherst, New York...

    • neilsberg.com
    csv, json
    Updated Mar 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/amherst-ny-median-household-income/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Mar 3, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Amherst, New York
    Variables measured
    Income Level, Mean Household Income
    Measurement technique
    The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

    Key observations

    • Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 18,852, while the mean income for the highest quintile (20% of households with the highest income) is 296,153. This indicates that the top earners earn 16 times compared to the lowest earners.
    • *Top 5%: * The mean household income for the wealthiest population (top 5%) is 495,426, which is 167.29% higher compared to the highest quintile, and 2627.98% higher compared to the lowest quintile.
    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Income Levels:

    • Lowest Quintile
    • Second Quintile
    • Third Quintile
    • Fourth Quintile
    • Highest Quintile
    • Top 5 Percent

    Variables / Data Columns

    • Income Level: This column showcases the income levels (As mentioned above).
    • Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here

  2. High-fidelity Fraudulent Activity Dataset 2023

    • kaggle.com
    zip
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahzad Aslam (2023). High-fidelity Fraudulent Activity Dataset 2023 [Dataset]. https://www.kaggle.com/datasets/zeesolver/credit-card
    Explore at:
    zip(149953614 bytes)Available download formats
    Dataset updated
    Oct 5, 2023
    Authors
    Shahzad Aslam
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description
    Context

    The credit card dataset comprises various attributes that capture essential information about individual transactions. Each entry in the dataset is uniquely identified by an 'ID', which aids in precise record-keeping and analysis. The 'V1-V28' features encompass a wide range of transaction-related details, including time, location, type, and several other parameters. These attributes collectively provide a comprehensive snapshot of each transaction. 'Amount' denotes the monetary value involved in the transaction, indicating the specific charge or credit associated with the card. Lastly, the 'Class' attribute plays a pivotal role in fraud detection, categorizing transactions into distinct classes like 'legitimate' and 'fraudulent'. This classification is instrumental in identifying potentially suspicious activities, helping financial institutions safeguard against fraudulent transactions. Together, these attributes form a crucial dataset for studying and mitigating risks associated with credit card transactions.

    Column Details

    ID:

    This is likely a unique identifier for a specific credit card transaction. It helps in keeping track of individual transactions and distinguishing them from one another.

    V1-V28:

    These are possibly features or attributes associated with the credit card transaction. They might include information such as time, amount, location, type of transaction, and various other details that can be used for analysis and fraud detection.

    Amount:

    This refers to the monetary value involved in the credit card transaction. It indicates how much money was either charged or credited to the card during that particular transaction.

    Class:

    This is an important attribute indicating the category or type of the transaction. It typically classifies transactions into different groups, like 'fraudulent' or 'legitimate'. This classification is crucial for identifying potentially suspicious or fraudulent activities.

  3. cifar-100-python

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ThanhTan (2024). cifar-100-python [Dataset]. https://www.kaggle.com/datasets/duongthanhtan/cifar-100-python
    Explore at:
    zip(168517675 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    ThanhTan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CIFAR-100 Dataset

    1. Overview

    • CIFAR-100 is an extension of the CIFAR-10 dataset, with more classes and finer-grained categorization.
    • It contains 100 classes, making it more challenging than CIFAR-10, which has only 10 classes.
    • Each image in CIFAR-100 is labeled with both a fine label (specific category) and a coarse label (broader category, such as animals or vehicles).

    2. Dataset Details

    • Number of Images: 60,000 color images in total.
      • 50,000 for training.
      • 10,000 for testing.
    • Image Size: Each image is a small 32x32 pixel RGB (color) image.
    • Classes: 100 classes, grouped into 20 superclasses.
      • Each superclass contains 5 related classes.

    3. Fine and Coarse Labels

    • Fine Labels: The dataset has specific categories, such as 'apple', 'bicycle', 'rose', etc.
    • Coarse Labels: These are broader categories, like 'fruit', 'flower', 'vehicle', etc.

    4. Applications

    • Image Classification: Used for training models to classify images into their respective categories.
    • Feature Extraction: Useful for benchmarking feature extraction techniques in computer vision.
    • Transfer Learning: Often used to pre-train models for other similar tasks.
    • Deep Learning Research: Commonly used to test architectures like CNNs (Convolutional Neural Networks).

    5. Challenges

    • The images are very small (32x32 pixels), making it harder for models to learn intricate details.
    • High class count (100) increases classification complexity.
    • Intra-class variability and inter-class similarity make it a challenging dataset for classification.

    6. File Format

    • The dataset is usually available in Python-friendly formats like .pkl or .npz.
    • It can also be downloaded and loaded using frameworks like TensorFlow or PyTorch.

    7. Example Classes

    Some example classes include: - Animals: beaver, dolphin, otter, elephant, snake. - Plants: apple, orange, mushroom, palm tree, pine tree. - Vehicles: bicycle, bus, motorcycle, train, rocket. - Everyday Objects: clock, keyboard, lamp, table, chair.

  4. Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S....

    • figshare.com
    application/gzip
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank (2024). Course-Skill Atlas: A national longitudinal dataset of skills taught in U.S. higher education curricula [Dataset]. http://doi.org/10.6084/m9.figshare.25632429.v7
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alireza Javadian Sabet; Sarah H. Bana; Renzhe Yu; Morgan Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Higher education plays a critical role in driving an innovative economy by equipping students with knowledge and skills demanded by the workforce.While researchers and practitioners have developed data systems to track detailed occupational skills, such as those established by the U.S. Department of Labor (DOL), much less effort has been made to document which of these skills are being developed in higher education at a similar granularity.Here, we fill this gap by presenting Course-Skill Atlas -- a longitudinal dataset of skills inferred from over three million course syllabi taught at nearly three thousand U.S. higher education institutions. To construct Course-Skill Atlas, we apply natural language processing to quantify the alignment between course syllabi and detailed workplace activities (DWAs) used by the DOL to describe occupations. We then aggregate these alignment scores to create skill profiles for institutions and academic majors. Our dataset offers a large-scale representation of college education's role in preparing students for the labor market.Overall, Course-Skill Atlas can enable new research on the source of skills in the context of workforce development and provide actionable insights for shaping the future of higher education to meet evolving labor demands, especially in the face of new technologies.

  5. a

    Tucson Equity Priority Index (TEPI): Citywide Census Tracts

    • hub.arcgis.com
    • teds.tucsonaz.gov
    • +1more
    Updated Jun 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tucson (2024). Tucson Equity Priority Index (TEPI): Citywide Census Tracts [Dataset]. https://hub.arcgis.com/datasets/1ec436c7358c47739872078ecb1d0c44
    Explore at:
    Dataset updated
    Jun 27, 2024
    Dataset authored and provided by
    City of Tucson
    Area covered
    Description

    For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the layer's data dictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.

  6. d

    Data from: 2015 Irrigated acres feature class for the Upper Rio Grande...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). 2015 Irrigated acres feature class for the Upper Rio Grande Basin, New Mexico and Texas, United States and Chihuahua, Mexico [Dataset]. https://catalog.data.gov/dataset/2015-irrigated-acres-feature-class-for-the-upper-rio-grande-basin-new-mexico-and-texas-uni
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    U.S. Geological Survey
    Area covered
    Chihuahua, Rio Grande, Texas, New Mexico, Mexico, United States
    Description

    Consumptive use (CU) of water is an important factor for determining water availability and groundwater storage. Many regional stakeholders and water-supply managers in the Upper Rio Grande Basin have indicated CU is of primary concern in their water-management strategies, yet CU data is sparse for this area. This polygon feature class, which represents irrigated acres for 2015, is a geospatial component of the U.S. Geological Survey National Water Census Upper Rio Grande Basin (URGB) focus area study's effort to improve quantification of CU in parts of New Mexico, west Texas, and northern Chihuahua. These digital data accompany Ivahnenko, T.I., Flickinger, A.K., Galanter, A.E., Douglas-Mankin, K.R., Pedraza, D.E., and Senay, G.B., 2021, Estimates of public-supply, domestic, and irrigation water withdrawal, use, and trends in the Upper Rio Grande Basin, 1985 to 2015: U.S. Geological Survey Scientific Investigations Report 2021–5036, 31 p., https://doi.org/10.3133/sir20215036.

  7. Z

    Data from: ImageNet-Patch: A Dataset for Benchmarking Machine Learning...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli (2022). ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6568777
    Explore at:
    Dataset updated
    Jun 30, 2022
    Dataset provided by
    University of Genoa, Italy
    University of Cagliari, Italy
    Authors
    Maura Pintor; Daniele Angioni; Angelo Sotgiu; Luca Demetrio; Ambra Demontis; Battista Biggio; Fabio Roli
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.

    We release our dataset as a set of folders indicating the patch target label (e.g., banana), each containing 1000 subfolders as the ImageNet output classes.

    An example showing how to use the dataset is shown below.

    code for testing robustness of a model

    import os.path

    from torchvision import datasets, transforms, models import torch.utils.data

    class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """

    def find_classes(self, directory):
      classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
      if not classes:
        raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
      class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
              len(os.listdir(os.path.join(directory, cls_name))) > 0}
      return classes, class_to_idx
    

    extract and unzip the dataset, then write top folder here

    dataset_folder = 'data/ImageNet-Patch'

    available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' }

    select folder with specific target

    target_label = 954

    dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ])

    dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval()

    batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0]

    accuracy = correct / total attack_sr = attack_success / total

    print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)

  8. F

    French Scripted Monologue Speech Data in Travel Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Scripted Monologue Speech Data in Travel Domain [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/travel-scripted-speech-monologues-french-france
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Algerian Arabic Scripted Monologue Speech Dataset for the Travel domain, a carefully constructed resource created to support the development of Arabic speech recognition technologies, particularly for applications in travel, tourism, and customer service automation.

    Speech Data

    This training dataset features 6,000+ high-quality scripted prompt recordings in Algerian Arabic, crafted to simulate real-world Travel industry conversations. It’s ideal for building robust ASR systems, virtual assistants, and customer interaction tools.

    Participant Diversity
    Speakers: 60 native Algerian Arabic speakers.
    Geographic Coverage: Participants from multiple regions across Algeria to ensure rich diversity in dialects and accents.
    Demographics: Age range from 18 to 70 years, with a gender ratio of approximately 60% male and 40% female.
    Recording Details
    Prompt Type: Scripted monologue-style prompts.
    Duration: Each audio sample ranges from 5 to 30 seconds.
    Audio Format: WAV files with mono channels, 16-bit depth, and 8 kHz / 16 kHz sample rates.
    Environment: Clean, quiet, echo-free spaces to ensure high-quality recordings.

    Topic Coverage

    The dataset includes a wide spectrum of travel-related interactions to reflect diverse real-world scenarios:

    Booking and reservation dialogues
    Customer support and general inquiries
    Destination-specific guidance
    Technical and login help
    Promotional offers and travel deals
    Service availability and policy information
    Domain-specific statements

    Context Elements

    To boost contextual realism, the scripted prompts integrate frequently encountered travel terms and variables:

    Names: Common Algeria male and female names
    Addresses: Regional address formats and locality names
    Dates & Times: Booking dates, travel periods, and time-based interactions
    Destinations: Mention of cities, countries, airports, and tourist landmarks
    Prices & Numbers: Cost of flights, hotel rates, promotional discounts, etc.
    Booking & Confirmation Codes: Typical ticketing and travel identifiers

    Transcription

    Every audio file is paired with a verbatim transcription in .TXT format.

    Consistency: Each transcript matches its corresponding audio file exactly.
    Accuracy: Transcriptions are reviewed and verified by native Algerian Arabic speakers.
    Usability: File names are synced across audio and text for easy integration.

    Metadata

    Each audio file is enriched with detailed metadata to support advanced analytics and filtering:

    Participant Metadata: Unique ID, age, gender, region/state,

  9. d

    USA High School Student Marketing Database by ASL Marketing

    • datarade.ai
    Updated Dec 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ASL Marketing (2019). USA High School Student Marketing Database by ASL Marketing [Dataset]. https://datarade.ai/data-products/high-school-student-data
    Explore at:
    Dataset updated
    Dec 19, 2019
    Dataset authored and provided by
    ASL Marketing
    Area covered
    United States
    Description

    Database is provided by ASL Marketing and covers the United States of America. With ASL Marketing Reaching GenZ has never been easier. Current high school student data customized by: Class year Date of Birth Gender GPA Geo Household Income Ethnicity Hobbies College-bound Interests College Intent Email

  10. f

    ORSO (Online Resource for Social Omics): A data-driven social network...

    • figshare.com
    • plos.figshare.com
    tiff
    Updated Feb 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher A. Lavender; Andrew J. Shapiro; Frank S. Day; David C. Fargo (2020). ORSO (Online Resource for Social Omics): A data-driven social network connecting scientists to genomics datasets [Dataset]. http://doi.org/10.1371/journal.pcbi.1007571
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Feb 5, 2020
    Dataset provided by
    PLOS Computational Biology
    Authors
    Christopher A. Lavender; Andrew J. Shapiro; Frank S. Day; David C. Fargo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-throughput sequencing has become ubiquitous in biomedical sciences. As new technologies emerge and sequencing costs decline, the diversity and volume of available data increases exponentially, and successfully navigating the data becomes more challenging. Though datasets are often hosted by public repositories, scientists must rely on inconsistent annotation to identify and interpret meaningful data. Moreover, the experimental heterogeneity and wide-ranging quality of high-throughput biological data means that even data with desired cell lines, tissue types, or molecular targets may not be readily interpretable or integrated. We have developed ORSO (Online Resource for Social Omics) as an easy-to-use web application to connect life scientists with genomics data. In ORSO, users interact within a data-driven social network, where they can favorite datasets and follow other users. In addition to more than 30,000 datasets hosted from major biomedical consortia, users may contribute their own data to ORSO, facilitating its discovery by other users. Leveraging user interactions, ORSO provides a novel recommendation system to automatically connect users with hosted data. In addition to social interactions, the recommendation system considers primary read coverage information and annotated metadata. Similarities used by the recommendation system are presented by ORSO in a graph display, allowing exploration of dataset associations. The topology of the network graph reflects established biology, with samples from related systems grouped together. We tested the recommendation system using an RNA-seq time course dataset from differentiation of embryonic stem cells to cardiomyocytes. The ORSO recommendation system correctly predicted early data point sources as embryonic stem cells and late data point sources as heart and muscle samples, resulting in recommendation of related datasets. By connecting scientists with relevant data, ORSO provides a critical new service that facilitates wide-ranging research interests.

  11. d

    Pseudo-Label Generation for Multi-Label Text Classification

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Pseudo-Label Generation for Multi-Label Text Classification [Dataset]. https://catalog.data.gov/dataset/pseudo-label-generation-for-multi-label-text-classification
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    With the advent and expansion of social networking, the amount of generated text data has seen a sharp increase. In order to handle such a huge volume of text data, new and improved text mining techniques are a necessity. One of the characteristics of text data that makes text mining difficult, is multi-labelity. In order to build a robust and effective text classification method which is an integral part of text mining research, we must consider this property more closely. This kind of property is not unique to text data as it can be found in non-text (e.g., numeric) data as well. However, in text data, it is most prevalent. This property also puts the text classification problem in the domain of multi-label classification (MLC), where each instance is associated with a subset of class-labels instead of a single class, as in conventional classification. In this paper, we explore how the generation of pseudo labels (i.e., combinations of existing class labels) can help us in performing better text classification and under what kind of circumstances. During the classification, the high and sparse dimensionality of text data has also been considered. Although, here we are proposing and evaluating a text classification technique, our main focus is on the handling of the multi-labelity of text data while utilizing the correlation among multiple labels existing in the data set. Our text classification technique is called pseudo-LSC (pseudo-Label Based Subspace Clustering). It is a subspace clustering algorithm that considers the high and sparse dimensionality as well as the correlation among different class labels during the classification process to provide better performance than existing approaches. Results on three real world multi-label data sets provide us insight into how the multi-labelity is handled in our classification process and shows the effectiveness of our approach.

  12. Z

    Jacobaea vulgaris and meadow Augmented image classification dataset (binary)...

    • data.niaid.nih.gov
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Rostock; Technische Universität Berlin (2024). Jacobaea vulgaris and meadow Augmented image classification dataset (binary) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12547684
    Explore at:
    Dataset updated
    Jun 27, 2024
    Dataset provided by
    Chair of Geodesy and Geoinformatics
    DAMS Lab
    Authors
    University of Rostock; Technische Universität Berlin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    General Information

    Total instances: 117008Instances in the Jacobaea vulgaris class: 58504 Instances in the Meadow class: 58504Image sizes from 224x224 pixels on three color channels (RGB)

    Performance increase training a ResNet50 on the base dataset versus the same architecture on the augmented data set shared here: +3,79 percent points in ROC AUC on an independent test set with 240 instances.

    Data Generation and Source

    The initial images in this dataset were taken as part of the project “UAV-basiertes Grünlandmonitoring auf Bestands- und Einzelpflanzenebene” (engl. “UAV-based Grassland Monitoring at Population and Individual Plant Level”), financed by the Authority for Economy, Transport, and Innovation of Hamburg. In September 2018, flights with an octocopter were conducted over two extensively used grassland areas in the urban area of Hamburg.

    In my master's thesis at DAMS Lab at TU Berlin, I evaluated the effect of different augmentation strategies for Jacobaea vulgaris image classification on the several performance metrics (most importantly the ROC AUC score). The identified augmentation strategies are -besides to performance based selection- also selected based on domain knowledge, which I acquired during the research for my master thesis.

    Additional information about the initial image generation process is to be found here [p. 45–53] and here.

    Augmentations applied

    Gaussian Noise: For the Gaussian noise augmentation, the mean of the added noise is set to zero. The lower and upper bounds for the random variance of the noise are 20.4663 and 54.0395 respectively. The bounds were identified by hyperparameter tuning. The search space for the lower bound was set from 5 to 30 and for the upper bound from 31 to 100. Those two search spaces were defined by visual inspection of the effects of applying Gaussian noise with different variance values to images of both classes. The Gaussian noise is sampled for each color channel individually.

    Random Brightness and Contrast: The brightness will randomly be increased or decreased by a factor ranging from 0.7010 to 1.2990. The The contrast will also be randomly increased by a factor ranging from 0.5775 to 1.4225. Those two ranges were identified using hyperparameter tuning. The search space for the maximal percentual increase or decrease of brightness and contrast was individually set from 1% to maximally 50% increase or decrease.

    Cutout Dropout: In this augmentation method a certain percentage of the input image is getting covered by black patches. The patches have a certain size in pixels, the implementation of this technique in this thesis uses square patches. The black patches are then randomly introduced into the image, by randomly alloacting thepatches across the image and then setting the corresponding pixel values to zero. The iamge is getting covered with patches until the cover percentage is reached. Weset percentage of the image to be randomly covered by black patches to 56.76%. The size of the patches, which randomly cover the image, is set to 4 pixels. Agood illustration of this is found in figure 4.2. The augmentation technique is inspired by the research proposed by Devries et al.[8]. Both values were identified by hyperparameter tuning. The search space for the patch size in pixels is categorical and includes the values [1, 2, 4, 7, 8, 14, 16, 28]. Those values all are multiples of 224, which is the image width and height in pixels. The patch size needs to be a multiple of the width and height in order to be suitable for the algorithm implementation. The search space for the cover percentage of the image had been set from 1% to 60%. This search space limits narrows the search down to a space where still a big part of the image is uncovered. The algorithm rearranges the image into a two dimensional grid and randomly masks rows of this grid by setting the pixel values in this row to zero. Then, the image gets rearranged, now with the randomly generated patches included.

    Random Saturation: The saturation of each pixel is randomly getting shifted. The upper bound for randomly shifting the saturation value of each pixel is set to 231.689%. This value was identified using hyperparameter tuning. An upper limit of the maximal saturation shift had been set to 40% shift in either direction for hyperparameter tuning.

    Horizontal Flip: The image gets flipped along the horizontal axis.

    Vertical Flip: The image gets flipped along the vertical axis.

    Random Rotation 90 degrees: Randomly rotates the image by a k-fold of 90 degrees, whereby k = {0, 1, 2, 3}.

    All augmentation methods and with their tuned augmentation hyperparameters (if existent) are applied to an image from the test set in figure 4.2. With the seven identifiedaugmentation techniques a dataset of 800% the size of the original dataset is created. The Augment model is trained on exactly this dataset. Of course next to the augmented images, the dataset still includes the original, unaugmented images. TensorFlow, along with additional libraries including Optuna for hyperparameter optimization and Albumentations for image augmentation, were used in for the implementation of this project.

    Rational behind the augmentations applied

    Random Rotation, Vertical and Horizontal Flip: These three augmentation strategies were chosen to make the classifier less sensitive to the orientation of the plant. The goal is to train a model that can classify plants regardless of their orientation. In order to achieve this effectively across different orientations, vertical flips, horizontal flips, and random 90-degree rotations are chosen for evaluation.

    Random Saturation: The varying saturation of the images simulates different levels of chlorophyll in the leaves, which is responsible for the green color of theleaves and the intensity of this color. The color of the plant parts (leaves, stems, and flowers) is also influenced by factors such as soil, sun, weed density and pressure, location, and water availability. Varying the saturation of the images simulates changes in these factors.

    Gaussian Noise: By adding noise, in this case Gaussian noise, different lighting conditions are simulated when capturing the images. We specifically chose Gaussiannoise because it is common in many real-world scenarios and is based on the Central Limit Theorem, which states that the sum of many independent random variables.tends to be normally distributed. This makes Gaussian noise a logical choice for simulating real-world random noise.

    Random Brightness Contrast: The Random Brightness and Random Contrast Augmentation uses brightness to mimic varying lighting conditions and contrast to highlight differences between plants by contrasting them more strongly, thereby highlighting their edges. This approach for highlighting edges is of course much more subtle than the canny edge detection augmentation. This augmentation method combines a weak focus on edges with variations in lighting conditions in one approach. The random contrast is a much softer approach for highlighting edges of plants, compared to the Canny edge detection augmentation. The other features in the images do not get changed that much, compared to the changes from edge detection augmentation.

    Cutout Dropout: The cutout augmentation simulates random occlusion by other plants. These occlusions are common and expected. Jacobaea vulgaris plants may be partially or completely obscured by other plants during image capturing. This augmentation technique makes the models more robust to random occlusion.

    Data License

    The dataset is licensed under the license CC BY 4.0. The attributor of the data is the Chair of Geodesy and Geoinformatics at the University of Rostock. The data was created within the scope of the project 'UAV-based Grassland Monitoring at Population and Individual Plant Level', financed by the Authority for Economy, Transport, and Innovation of Hamburg.

  13. BTC-USD Price Data (June 2010 - November 2024)

    • kaggle.com
    zip
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farhan Ali (2024). BTC-USD Price Data (June 2010 - November 2024) [Dataset]. https://www.kaggle.com/datasets/farhanali097/btc-usd-price-data-june-2010-november-2024
    Explore at:
    zip(107769 bytes)Available download formats
    Dataset updated
    Nov 30, 2024
    Authors
    Farhan Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains historical price data for Bitcoin (BTC) against the U.S. Dollar (USD), spanning from June 2010 to November 2024. The data is organized on a daily basis and includes key market metrics such as the opening price, closing price, high, low, volume, and market capitalization for each day.

    Columns: The dataset consists of the following columns:

    Date: The date of the recorded data point (format: YYYY-MM-DD). Open: The opening price of Bitcoin on that day. High: The highest price Bitcoin reached on that day. Low: The lowest price Bitcoin reached on that day. Close: The closing price of Bitcoin on that day. Volume: The total trading volume of Bitcoin during that day. Market Cap: The total market capitalization of Bitcoin on that day (calculated by multiplying the closing price by the circulating supply of Bitcoin at the time). Source: The data is sourced from Yahoo Finance.

    Time Period: The data spans from June 2010, when Bitcoin first began trading, to November 2024. This provides a comprehensive view of Bitcoin’s historical price movements, from its early days of trading at a fraction of a cent to its more recent valuation in the thousands of dollars.

    Use Cases:

    This dataset is valuable for a variety of purposes, including:

    Time Series Analysis: Analyze Bitcoin price movements, identify trends, and develop predictive models for future prices. Financial Modeling: Use the dataset to assess Bitcoin as an asset class, model its volatility, or simulate investment strategies. Machine Learning: Train machine learning algorithms to forecast Bitcoin’s future price or predict market trends based on historical data. Economic Research: Study the impact of global events on Bitcoin’s price, such as regulatory changes, technological developments, or macroeconomic factors. Visualization: Generate visualizations of Bitcoin price trends, trading volume, and market capitalization over time.

  14. w

    Vehicle licensing statistics data tables

    • gov.uk
    • s3.amazonaws.com
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2025). Vehicle licensing statistics data tables [Dataset]. https://www.gov.uk/government/statistical-data-sets/vehicle-licensing-statistics-data-tables
    Explore at:
    Dataset updated
    Oct 15, 2025
    Dataset provided by
    GOV.UK
    Authors
    Department for Transport
    Description

    Data files containing detailed information about vehicles in the UK are also available, including make and model data.

    Some tables have been withdrawn and replaced. The table index for this statistical series has been updated to provide a full map between the old and new numbering systems used in this page.

    The Department for Transport is committed to continuously improving the quality and transparency of our outputs, in line with the Code of Practice for Statistics. In line with this, we have recently concluded a planned review of the processes and methodologies used in the production of Vehicle licensing statistics data. The review sought to seek out and introduce further improvements and efficiencies in the coding technologies we use to produce our data and as part of that, we have identified several historical errors across the published data tables affecting different historical periods. These errors are the result of mistakes in past production processes that we have now identified, corrected and taken steps to eliminate going forward.

    Most of the revisions to our published figures are small, typically changing values by less than 1% to 3%. The key revisions are:

    Licensed Vehicles (2014 Q3 to 2016 Q3)

    We found that some unlicensed vehicles during this period were mistakenly counted as licensed. This caused a slight overstatement, about 0.54% on average, in the number of licensed vehicles during this period.

    3.5 - 4.25 tonnes Zero Emission Vehicles (ZEVs) Classification

    Since 2023, ZEVs weighing between 3.5 and 4.25 tonnes have been classified as light goods vehicles (LGVs) instead of heavy goods vehicles (HGVs). We have now applied this change to earlier data and corrected an error in table VEH0150. As a result, the number of newly registered HGVs has been reduced by:

    • 3.1% in 2024

    • 2.3% in 2023

    • 1.4% in 2022

    Table VEH0156 (2018 to 2023)

    Table VEH0156, which reports average CO₂ emissions for newly registered vehicles, has been updated for the years 2018 to 2023. Most changes are minor (under 3%), but the e-NEDC measure saw a larger correction, up to 15.8%, due to a calculation error. Other measures (WLTP and Reported) were less notable, except for April 2020 when COVID-19 led to very few new registrations which led to greater volatility in the resultant percentages.

    Neither these specific revisions, nor any of the others introduced, have had a material impact on the statistics overall, the direction of trends nor the key messages that they previously conveyed.

    Specific details of each revision made has been included in the relevant data table notes to ensure transparency and clarity. Users are advised to review these notes as part of their regular use of the data to ensure their analysis accounts for these changes accordingly.

    If you have questions regarding any of these changes, please contact the Vehicle statistics team.

    All vehicles

    Licensed vehicles

    Overview

    VEH0101: https://assets.publishing.service.gov.uk/media/68ecf5acf159f887526bbd7c/veh0101.ods">Vehicles at the end of the quarter by licence status and body type: Great Britain and United Kingdom (ODS, 99.7 KB)

    Detailed breakdowns

    VEH0103: https://assets.publishing.service.gov.uk/media/68ecf5abf159f887526bbd7b/veh0103.ods">Licensed vehicles at the end of the year by tax class: Great Britain and United Kingdom (ODS, 23.8 KB)

    VEH0105: https://assets.publishing.service.gov.uk/media/68ecf5ac2adc28a81b4acfc8/veh0105.ods">Licensed vehicles at

  15. titanic_dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mahmoud shogaa (2023). titanic_dataset [Dataset]. https://www.kaggle.com/datasets/mahmoudshogaa/titanic-dataset
    Explore at:
    zip(22491 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    mahmoud shogaa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset typically includes the following columns:

    PassengerId: A unique identifier for each passenger. Survived: This column indicates whether a passenger survived (1) or did not survive (0). Pclass (Ticket class): A proxy for socio-economic status, with 1 being the highest class and 3 the lowest. Name: The name of the passenger. Sex: The gender of the passenger. Age: The age of the passenger. (Note: There might be missing values in this column.) SibSp: The number of siblings or spouses the passenger had aboard the Titanic. Parch: The number of parents or children the passenger had aboard the Titanic. Ticket: The ticket number. Fare: The amount of money the passenger paid for the ticket.

    The main goal of using this dataset is to predict whether a passenger survived or not based on various features. It serves as a popular introductory dataset for those learning data analysis, machine learning, and predictive modeling. Keep in mind that the dataset may be subject to variations and updates, so it's always a good idea to check the Kaggle website or dataset documentation for the most recent information.

  16. F

    English Agent-Customer Chat Dataset for Real Estate

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Agent-Customer Chat Dataset for Real Estate [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-realestate-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English Real Estate Chat Dataset is a high-quality collection of over 12,000 text-based conversations between customers and call center agents. These conversations reflect real-world scenarios within the Real Estate sector, offering rich linguistic data for training conversational AI, chatbots, and NLP systems focused on property-related interactions in English-speaking regions.

    Participant & Chat Overview

    Participants: 200+ native English speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both speakers
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative interactions included

    Topic Diversity

    The dataset spans a broad range of Real Estate service conversations, covering various customer intents and agent support tasks:

    Inbound Chats (Customer-Initiated)
    Property inquiries (buy/rent)
    Rental property availability
    Renovation and maintenance inquiries
    Property features and amenities
    Investment advice and ROI analysis
    Property ownership and legal history
    Outbound Chats (Agent-Initiated)
    New property listing announcements
    Post-purchase follow-ups
    Investment opportunity alerts
    Property valuation updates
    Customer satisfaction and feedback surveys

    This topic variety enables realistic model training for both lead generation and post-sale engagement scenarios.

    Language Nuance & Authenticity

    Conversations are reflective of natural English used in the Real Estate domain, incorporating:

    Cultural Naming Patterns: Personal names, agency names, and developer brands
    Localized Contact Info: Phone numbers, email addresses, and geographic locations across English-speaking regions
    Numeric and Temporal Language: Dates, prices, unit sizes, and time references formatted in English conventions
    Informal and Domain-Specific Language: Real estate slang, idioms, and casual tone used in property discussions

    This level of linguistic realism supports model generalization across dialects and user demographics.

    Conversational Structure & Flow

    Conversations include a mix of short inquiries and detailed advisory sessions, capturing full customer journeys:

    Dialogue Types
    General inquiries
    Sales consultations
    Investment advisory
    Follow-up coordination
    Complaint handling and support
    Flow Components
    Greetings and identity verification
    Intent identification and context gathering
    <div style="margin-left: 60px; font-weight: 300; display: flex; gap: 16px; align-items: baseline;

  17. t

    Tucson Equity Priority Index (TEPI): Pima County Block Groups

    • teds.tucsonaz.gov
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tucson (2024). Tucson Equity Priority Index (TEPI): Pima County Block Groups [Dataset]. https://teds.tucsonaz.gov/maps/cotgis::tucson-equity-priority-index-tepi-pima-county-block-groups
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset authored and provided by
    City of Tucson
    Area covered
    Description

    For detailed information, visit the Tucson Equity Priority Index StoryMap.Download the Data DictionaryWhat is the Tucson Equity Priority Index (TEPI)?The Tucson Equity Priority Index (TEPI) is a tool that describes the distribution of socially vulnerable demographics. It categorizes the dataset into 5 classes that represent the differing prioritization needs based on the presence of social vulnerability: Low (0-20), Low-Moderate (20-40), Moderate (40-60), Moderate-High (60-80) High (80-100). Each class represents 20% of the dataset’s features in order of their values. The features within the Low (0-20) classification represent the areas that, when compared to all other locations in the study area, have the lowest need for prioritization, as they tend to have less socially vulnerable demographics. The features that fall into the High (80-100) classification represent the 20% of locations in the dataset that have the greatest need for prioritization, as they tend to have the highest proportions of socially vulnerable demographics. How is social vulnerability measured?The Tucson Equity Priority Index (TEPI) examines the proportion of vulnerability per feature using 11 demographic indicators:Income Below Poverty: Households with income at or below the federal poverty level (FPL), which in 2023 was $14,500 for an individual and $30,000 for a family of fourUnemployment: Measured as the percentage of unemployed persons in the civilian labor forceHousing Cost Burdened: Homeowners who spend more than 30% of their income on housing expenses, including mortgage, maintenance, and taxesRenter Cost Burdened: Renters who spend more than 30% of their income on rentNo Health Insurance: Those without private health insurance, Medicare, Medicaid, or any other plan or programNo Vehicle Access: Households without automobile, van, or truck accessHigh School Education or Less: Those highest level of educational attainment is a High School diploma, equivalency, or lessLimited English Ability: Those whose ability to speak English is "Less Than Well."People of Color: Those who identify as anything other than Non-Hispanic White Disability: Households with one or more physical or cognitive disabilities Age: Groups that tend to have higher levels of vulnerability, including children (those below 18), and seniors (those 65 and older)An overall percentile value is calculated for each feature based on the total proportion of the above indicators in each area. How are the variables combined?These indicators are divided into two main categories that we call Thematic Indices: Economic and Personal Characteristics. The two thematic indices are further divided into five sub-indices called Tier-2 Sub-Indices. Each Tier-2 Sub-Index contains 2-3 indicators. Indicators are the datasets used to measure vulnerability within each sub-index. The variables for each feature are re-scaled using the percentile normalization method, which converts them to the same scale using values between 0 to 100. The variables are then combined first into each of the five Tier-2 Sub-Indices, then the Thematic Indices, then the overall TEPI using the mean aggregation method and equal weighting. The resulting dataset is then divided into the five classes, where:High Vulnerability (80-100%): Representing the top classification, this category includes the highest 20% of regions that are the most socially vulnerable. These areas require the most focused attention. Moderate-High Vulnerability (60-80%): This upper-middle classification includes areas with higher levels of vulnerability compared to the median. While not the highest, these areas are more vulnerable than a majority of the dataset and should be considered for targeted interventions. Moderate Vulnerability (40-60%): Representing the middle or median quintile, this category includes areas of average vulnerability. These areas may show a balanced mix of high and low vulnerability. Detailed examination of specific indicators is recommended to understand the nuanced needs of these areas. Low-Moderate Vulnerability (20-40%): Falling into the lower-middle classification, this range includes areas that are less vulnerable than most but may still exhibit certain vulnerable characteristics. These areas typically have a mix of lower and higher indicators, with the lower values predominating. Low Vulnerability (0-20%): This category represents the bottom classification, encompassing the lowest 20% of data points. Areas in this range are the least vulnerable, making them the most resilient compared to all other features in the dataset.

  18. EDA: Unlocking the Story Behind the Numbers

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coding expert G.N (2025). EDA: Unlocking the Story Behind the Numbers [Dataset]. https://www.kaggle.com/datasets/ranaghulamnabi/eda-unlocking-the-story-behind-the-numbers/discussion
    Explore at:
    zip(9163 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Coding expert G.N
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context :

    This dataset contains real passenger information from the Titanic’s tragic voyage in 1912. It includes details like age, gender, ticket class, fare, and whether each passenger survived. The data is commonly used for learning data analysis and building beginner machine-learning models. It helps us explore patterns such as who had higher chances of survival and why.

    Feature distribution:

    1. Age Distribution

    Passenger ages range from infants to elderly adults, with most travelers falling between 20 and 40 years old. There are some missing values, especially among older passengers and children.

    1. Fare Distribution

    Fares vary widely — lower-class passengers paid small amounts, while first-class travelers paid much higher fares. The distribution is skewed because a few people paid very high ticket prices.

    1. Passenger Class (Pclass) Distribution

    Most passengers were in 3rd class, fewer in 2nd, and the smallest group in 1st class. This shows the ship had many lower-class travelers.

    1. Survival Distribution

    The dataset shows that more people did not survive than survived. Survival rates differ by gender, age, and class, with higher survival among women, children, and first-class passengers.

  19. F

    Polish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Polish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-polish-poland
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Polish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Polish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Polish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Polish speech models that understand and respond to authentic Polish accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Polish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Polish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Poland to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Polish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Polish.
    Voice Assistants: Build smart assistants capable of understanding natural Polish conversations.
    <span

  20. F

    Mexican Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mexican Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Neilsberg Research (2025). Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition [Dataset]. https://www.neilsberg.com/insights/amherst-ny-median-household-income/

Income Distribution by Quintile: Mean Household Income in Amherst, New York // 2025 Edition

Explore at:
json, csvAvailable download formats
Dataset updated
Mar 3, 2025
Dataset authored and provided by
Neilsberg Research
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered
Amherst, New York
Variables measured
Income Level, Mean Household Income
Measurement technique
The data presented in this dataset is derived from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. It delineates income distributions across income quintiles (mentioned above) following an initial analysis and categorization. Subsequently, we adjusted these figures for inflation using the Consumer Price Index retroactive series via current methods (R-CPI-U-RS). For additional information about these estimations, please contact us via email at research@neilsberg.com
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset presents the mean household income for each of the five quintiles in Amherst, New York, as reported by the U.S. Census Bureau. The dataset highlights the variation in mean household income across quintiles, offering valuable insights into income distribution and inequality.

Key observations

  • Income disparities: The mean income of the lowest quintile (20% of households with the lowest income) is 18,852, while the mean income for the highest quintile (20% of households with the highest income) is 296,153. This indicates that the top earners earn 16 times compared to the lowest earners.
  • *Top 5%: * The mean household income for the wealthiest population (top 5%) is 495,426, which is 167.29% higher compared to the highest quintile, and 2627.98% higher compared to the lowest quintile.
Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Income Levels:

  • Lowest Quintile
  • Second Quintile
  • Third Quintile
  • Fourth Quintile
  • Highest Quintile
  • Top 5 Percent

Variables / Data Columns

  • Income Level: This column showcases the income levels (As mentioned above).
  • Mean Household Income: Mean household income, in 2023 inflation-adjusted dollars for the specific income level.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Amherst town median household income. You can refer the same here

Search
Clear search
Close search
Google apps
Main menu