47 datasets found
  1. Mathematics Dataset

    • github.com
    • opendatalab.com
    • +1more
    Updated Apr 3, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id
    Explore at:
    Dataset updated
    Apr 3, 2019
    Dataset provided by
    DeepMindhttp://deepmind.com/
    Description

    This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

    ## Example questions

     Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
     Answer: 4
     
     Question: Calculate -841880142.544 + 411127.
     Answer: -841469015.544
     
     Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
     Answer: 54*a - 30
    

    It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

    • algebra (linear equations, polynomial roots, sequences)
    • arithmetic (pairwise operations and mixed expressions, surds)
    • calculus (differentiation)
    • comparison (closest numbers, pairwise comparisons, sorting)
    • measurement (conversion, working with time)
    • numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)
    • polynomials (addition, simplification, composition, evaluating, expansion)
    • probability (sampling without replacement)
  2. P

    Saudi License Plate Characters Dataset Dataset

    • paperswithcode.com
    Updated Apr 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Saudi License Plate Characters Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/saudi-license-plate-characters-dataset
    Explore at:
    Dataset updated
    Apr 18, 2025
    Area covered
    Saudi Arabia
    Description

    Description:

    👉 Download the dataset here

    The Saudi License Plate Characters Dataset consists of 593 annotated images of Saudi Arabian vehicle license plates, meticulously designed to aid in character detection and recognition tasks. This dataset spans 27 distinct classes, incorporating a diverse set of characters found on Saudi license plates, including Arabic and Latin letters as well as Eastern and Western Arabic numerals. The dataset is ideal for training and evaluating machine learning models, particularly in optical character recognition (OCR) applications for license plates.

    Download Dataset

    Data Composition

    Each image in this dataset is provided with two corresponding annotation files: an XML file and a TXT file formatted for YOLO (You Only Look Once) model training. These annotations include bounding boxes around the characters on the license plates, taking into account the paired nature of Arabic and Latin characters or Eastern and Western numerals. Bounding boxes ensure precise localization of the characters, making this dataset highly suitable for character-level recognition.

    The images were sourced from a combination of publicly available data on the internet and original photographs taken by mobile phones. Each image was manually annotated to ensure high accuracy in labeling. The dataset covers a wide range of real-world scenarios, including varying lighting conditions, plate orientations, and image resolutions, making it versatile for robust OCR model development.

    Applications

    This dataset can be utilized for a wide array of applications:

    Automatic License Plate Recognition (ALPR): Enhancing the recognition of vehicle license plates in real-time applications such as traffic monitoring, toll collection, parking management, and law enforcement.

    Multilingual OCR: Developing and testing OCR systems that can handle multilingual characters, particularly those using Arabic script.

    Deep Learning Models: Training deep learning models for object detection, particularly in scenarios requiring precise recognition of small and complex character sets.

    Smart Cities and Surveillance Systems: Automating traffic management and surveillance systems by integrating license plate recognition for vehicle tracking.

    File Formats

    Images: High-quality images in various formats (e.g., JPG or PNG) that capture license plates under different conditions.

    Annotations: XML and YOLO-formatted TXT files with detailed bounding boxes, providing structured data for easy integration with various machine learning frameworks.

    Class Labels

    The dataset includes 27-character classes, encompassing both Arabic and Latin characters, as well as numerals. This dual representation allows for flexibility in training models that can recognize characters in different formats across multiple languages.

    Key Features

    593 High-Quality Images: Covering diverse real-world conditions to ensure robustness and generalization of models.

    Multi-Class Annotations: 27 classes of license plate characters, supporting multilingual character detection.

    Bounding Boxes for Dual Characters: Carefully annotated bounding boxes for both Arabic and Latin representations of characters on license plates.

    Manually Labeled: All annotations are manually verified to ensure high accuracy.

    This dataset is sourced from Kaggle.

  3. Z

    MotionMiners Missplacement Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dönnebrink, Robin (2024). MotionMiners Missplacement Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8272090
    Explore at:
    Dataset updated
    Jan 24, 2024
    Dataset provided by
    Moya Rueda, Fernando
    Dönnebrink, Robin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MotionMiners Miss-placement Dataset 𝑀𝑃1 is composed of recordings of seven subjects carrying out different activities in the intralogistics, using a sensor set-up of On-Body Devices (OBDs) for industrial applications. Here, the position and orientation of the OBD change with respect to the recording-and-usage guidelines. The OBDs are labeled with respect to their expected location on the human body, namely, 𝑂𝐵𝐷𝑅 , 𝑂𝐵𝐷𝐿 and 𝑂𝐵𝐷𝑇 on the right arm, left arm, and frontal torso. Tab. 1 (see manuscript) presents the different miss-placement classes of the dataset. This dataset considers the miss-placement as a classification problem; however, differently, the 𝑀𝑃 dataset considers rotations miss-placements—commonly appear on deployment from practitioners experience. The 𝑀𝑃 dataset contains recordings of seven subjects performing six activities: Standing, Walking, Handling Centred, Handling Upwards, Handling Downwards, and an additional Synchronisation. Each subject carried out each activity under the case of up to 15 different miss-placement situations (soon updating to 20 different miss-placement situations), including a correct set-up of the devices. The 𝑀𝑃 dataset is divided in two subsets, 𝑀𝑃_A and 𝑀𝑃_B. Each recording of a subject contains:

    raw data of Acc, Gyr, and Mag in 3D for a certain number of samples, making a matrix of size [Samples times 27] annotated data of Acc, Gyr, and Mag in 3D for a certain number of samples, making a matrix of size [Samples, Act class, [27 channels]]

    for MP_B, it includes the synchronized recording of the correct sensor set-up, so the matrix becomes [Samples, class, [27 channels of the miss-placed setup], [27 channels of the correct set up]] the miss-placement annotations [Samples, Miss-placement class] the activity annotations [Samples, activity class, [19 semantic attributes]]

    the semantic attributes are given following the following paper: "LARa: Creating a Dataset for Human Activity Recognition in Logistics Using Semantic Attributes", Sensors 2020, DOI: 10.3390/s20154083. If you use this dataset for research, please cite the following paper: "Miss-placement Prediction of Multiple On-body Devices for Human Activity Recognition", Sensors 2020, DOI: 10.1145/3615834.3615838. For any questions about the dataset, please contact Fernando Moya Rueda at fernando.moya@motionminers.com.

  4. P

    Data from: DISFA Dataset

    • paperswithcode.com
    Updated Mar 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Mohammad Mavadati; Mohammad H. Mahoor; Kevin Bartlett; Philip Trinh; Jeffrey F. Cohn (2021). DISFA Dataset [Dataset]. https://paperswithcode.com/dataset/disfa
    Explore at:
    Dataset updated
    Mar 18, 2021
    Authors
    Seyed Mohammad Mavadati; Mohammad H. Mahoor; Kevin Bartlett; Philip Trinh; Jeffrey F. Cohn
    Description

    The Denver Intensity of Spontaneous Facial Action (DISFA) dataset consists of 27 videos of 4844 frames each, with 130,788 images in total. Action unit annotations are on different levels of intensity, which are ignored in the following experiments and action units are either set or unset. DISFA was selected from a wider range of databases popular in the field of facial expression recognition because of the high number of smiles, i.e. action unit 12. In detail, 30,792 have this action unit set, 82,176 images have some action unit(s) set and 48,612 images have no action unit(s) set at all.

  5. Z

    EOAD (Egocentric Outdoor Activity Dataset)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehmet Ali Arabacı (2024). EOAD (Egocentric Outdoor Activity Dataset) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7742659
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Alptekin Temizel
    Elif Surer
    Mehmet Ali Arabacı
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    EOAD is a collection of videos captured by wearable cameras, mostly of sports activities. It contains both visual and audio modalities.

    It was initiated by the HUJI and FPVSum egocentric activity datasets. However, the number of samples and diversity of activities for HUJI and FPVSum were insufficient. Therefore, we combined these datasets and populated them with new YouTube videos.

    The selection of videos was based on the following criteria:

    The videos should not include text overlays.

    The videos should contain natural sound (no external music)

    The actions in videos should be continuous (no cutting the scene or jumping in time)

    Video samples were trimmed depending on scene changes for long videos (such as driving, scuba diving, and cycling). As a result, a video may have several clips depicting egocentric actions. Hence, video clips were extracted from carefully defined time intervals within videos. The final dataset includes video clips with a single action and natural audio information.

    Statistics for EOAD:

    30 activities

    303 distinct videos

    1392 video clips

    2243 minutes labeled videos clips

    The detailed statistics for the selected datasets and the crawled videos clips from YouTube are given below:

    HUJI: 49 distinct videos - 148 video clips for 9 activities (driving, biking, motorcycle, walking, boxing, horse riding, running, skiing, stair climbing)

    FPVSum: 39 distinct videos - 124 video segments for 8 activities (biking, horse riding, skiing, longboarding, rock climbing, scuba, skateboarding, surfing)

    YouTube: 216 distinct videos - 1120 video clips for 27 activities (american football, basketball, bungee jumping, driving, go-kart, horse riding, ice hockey, jet ski, kayaking, kitesurfing, longboarding, motorcycle, paintball, paragliding, rafting, rock climbing, rowing, running, sailing, scuba diving, skateboarding, soccer, stair climbing, surfing, tennis, volleyball, walking)

    The video clips used for training, validation and test sets for each activity are listed in Table 1. Multiple video clips may belong to a single video because of trimming it for some reasons (i.e., scene cut, temporary overlayed text on videos, or video parts unrelated to activities).

    While splitting the dataset, the minimum number of videos for each activity was selected as 8. Additionally, the video samples were divided as 50%, 25%, and 25% for training (minimum four videos), validation (minimum two videos), and testing (minimum two videos), respectively. On the other hand, videos were split according to the raw video footage to prevent the mixing of similar video clips (having the same actors and scenes) into training, validation, and test sets. Therefore, we ensured that the video clips trimmed from the same videos were split together into training, validation, or test sets to satisfy a fair comparison.

    Some activities have continuity throughout the video, such as scuba, longboarding, or riding horse, which also have an equal number of video segments with the number of videos. However, some activities, such as skating, occurred in a short time, making the number of video segments higher than the others. As a result, the number of video clips for training, validation, and test sets was highly imbalanced for the selected activities (i.e., jet ski and rafting have 4; however, soccer has 99 video clips for training).

                                      Table 1 - Dataset splitting for EOAD
    

    Train

    Validation

    Test

    Action Label

    Clips

    Total Duration

    Clips

    Total Duration

    Clips

    Total Duration

    AmericanFootball

    34

    00:06:09

    36

    00:05:03

    9

    00:01:20

    Basketball

    43

    01:13:22

    19

    00:08:13

    10

    00:28:46

    Biking

    9

    01:58:01

    6

    00:32:22

    11

    00:36:16

    Boxing

    7

    00:24:54

    11

    00:14:14

    5

    00:17:30

    BungeeJumping

    7

    00:02:22

    4

    00:01:36

    4

    00:01:31

    Driving

    19

    00:37:23

    9

    00:24:46

    9

    00:29:23

    GoKart

    5

    00:40:00

    3

    00:11:46

    3

    00:19:46

    Horseback

    5

    01:15:14

    5

    01:02:26

    2

    00:20:38

    IceHockey

    52

    00:19:22

    46

    00:20:34

    10

    00:36:59

    Jetski

    4

    00:23:35

    5

    00:18:42

    6

    00:02:43

    Kayaking

    28

    00:43:11

    22

    00:14:23

    4

    00:11:05

    Kitesurfing

    30

    00:21:51

    17

    00:05:38

    6

    00:01:32

    Longboarding

    5

    00:15:40

    4

    00:18:03

    4

    00:09:11

    Motorcycle

    20

    00:49:38

    21

    00:13:53

    8

    00:20:30

    Paintball

    7

    00:33:52

    4

    00:12:08

    4

    00:08:52

    Paragliding

    11

    00:28:42

    4

    00:10:16

    4

    00:19:50

    Rafting

    4

    00:15:41

    3

    00:07:27

    3

    00:06:13

    RockClimbing

    6

    00:49:38

    2

    00:21:59

    2

    00:18:50

    Rowing

    5

    00:47:05

    3

    00:13:21

    3

    00:03:26

    Running

    21

    01:21:56

    19

    00:46:29

    11

    00:42:59

    Sailing

    7

    00:39:30

    4

    00:14:39

    6

    00:15:43

    Scuba

    5

    00:35:02

    3

    00:23:43

    2

    00:18:52

    Skate

    91

    00:15:53

    30

    00:07:01

    10

    00:02:03

    Ski

    14

    01:48:15

    17

    01:01:59

    7

    00:39:15

    Soccer

    102

    00:48:39

    52

    00:13:17

    16

    00:06:54

    StairClimbing

    6

    01:05:32

    6

    00:17:18

    5

    00:20:22

    Surfing

    23

    00:12:51

    17

    00:06:52

    10

    00:07:04

    Tennis

    34

    00:27:04

    9

    00:06:03

    9

    00:03:14

    Volleyball

    87

    00:19:14

    35

    00:07:46

    7

    00:18:58

    Walking

    49

    00:43:02

    36

    00:38:25

    10

    00:10:23

    Total

    30

    740

    20:22:37

    452

    09:20:23

    200

    08:00:08

    EOAD Code Repository

    Scripts for downloading raw videos and trim them in to video clips are provided in this GitHub repository.

    Regarding the questions, please contact mali.arabaci@gmail.com.

  6. Reddit Submissions

    • kaggle.com
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmad (2023). Reddit Submissions [Dataset]. https://www.kaggle.com/datasets/pypiahmad/reddit-submissions/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ahmad
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Reddit Submissions dataset encompasses submissions of Reddit posts, particularly focusing on resubmissions of the same content, along with pertinent metadata. This dataset covers a timespan from July 2008 to January 2013 and provides an insightful view into the dynamics of content sharing and engagement within the Reddit community.

    Basic Statistics: - Number of Submissions (images): 132,308 - Number of Unique Images: 16,736 - Timespan: July 2008 - January 2013

    Metadata: - Timestamps: The time when a post was submitted. - Upvotes/Downvotes: The number of upvotes and downvotes a post received. - Post Title: The title of the submitted post. - Subreddit: The subreddit to which the post was submitted. - Additional metadata such as total votes, Reddit ID, number of comments, and username of the submitter.

    Examples: ```plaintext

    image_id, unixtime, rawtime, title, total_votes, reddit_id,...

    number_of_downvotes, localtime, score, number_of_comments, username 1005, 1335861624, 2012-05-01T15:40:24.968266-07:00, I immediately regret this decision, 27, t296r, 20, pics, 7, 1335886824, 13, 0, ninjaroflmaster 1005, 1336470481, 2012-05-08T16:48:01.418140-07:00, "Pushing your friend into the water, Level: 99", 18, tds4i, 16, funny, 2, 1336495681, 14, 0, hme4 1005, 1339566752, 2012-06-13T12:52:32.371941-07:00, I told him. He Didn't Listen, 6, v0cma, 4, funny, 2, 1339591952, 2, 0, HeyPatWhatsUp 1005, 1342200476, 2012-07-14T00:27:56.857805-07:00, Don't end up as this guy., 16, wjivx, 7, funny, 9, 1342225676, -2, 2, catalyst24 ```

    Download Links: - Resubmissions Data (7.3MB) - Raw HTML of Resubmissions (1.8GB)

    Citation: - Understanding the interplay between titles, content, and communities in social media, Himabindu Lakkaraju, Julian McAuley, Jure Leskovec, ICWSM, 2013. pdf

    Use Cases: 1. Content Resubmission Analysis: Analyzing the pattern and impact of content resubmissions across different subreddits. 2. Community Engagement: Studying how different titles, content, and subreddits influence user engagement in terms of upvotes, downvotes, and comments. 3. Temporal Analysis: Investigating how the popularity of certain content changes over time and how resubmissions are accepted by the community at different time intervals. 4. Subreddit Analysis: Understanding the characteristics of different subreddits in terms of content sharing and resubmissions. 5. User Behavior Analysis: Examining user behavior in terms of content submission, resubmission, and interaction. 6. Social Media Marketing: For marketers, understanding the dynamics of content resubmission could help in optimizing the content sharing strategy on Reddit. 7. Machine Learning: Utilizing the dataset to build models that can predict the success of a post or resubmission based on various factors. 8. NLP Applications: Analyzing text data for sentiment analysis, topic modeling, and other Natural Language Processing (NLP) applications. 9. Spam Detection: Identifying spam or redundant content through the analysis of resubmissions and user behaviors.

    This dataset is valuable for researchers, social media analysts, marketers, and data scientists interested in studying social media dynamics, especially on a platform like Reddit where content resubmission is common.

  7. w

    Dataset of book subjects that contain 27

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain 27 [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=27&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 4 rows and is filtered where the books is 27. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  8. Rural statistics local level data sets

    • gov.uk
    Updated Jul 21, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Environment, Food & Rural Affairs (2016). Rural statistics local level data sets [Dataset]. https://www.gov.uk/government/statistical-data-sets/rural-statistics-local-level-data-sets
    Explore at:
    Dataset updated
    Jul 21, 2016
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Environment, Food & Rural Affairs
    Description

    Local authority and Local Enterprise Partnership data sets for key economic data by rural and urban breakdown.

    Additional information:

    https://assets.publishing.service.gov.uk/media/5a7f09bfed915d74e62280b0/local-data-12-13_LU.xlsx">Local authority level data on population, claimant count, insolvencies, business numbers and house prices

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">211 KB</span></p>
    
    
    
    
     <p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
     <details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
    

    Request an accessible format.

      If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:defra.helpline@defra.gov.uk" target="_blank" class="govuk-link">defra.helpline@defra.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
    

  9. N

    Bellmont, IL annual income distribution by work experience and gender...

    • neilsberg.com
    csv, json
    Updated Feb 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). Bellmont, IL annual income distribution by work experience and gender dataset: Number of individuals ages 15+ with income, 2023 // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/ba96cafd-f4ce-11ef-8577-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Feb 27, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bellmont, Illinois
    Variables measured
    Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time, Number of males working full time for a given income bracket, Number of males working part time for a given income bracket, Number of females working full time for a given income bracket, Number of females working part time for a given income bracket
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To portray the number of individuals for both the genders (Male and Female), within each income bracket we conducted an initial analysis and categorization of the American Community Survey data. Households are categorized, and median incomes are reported based on the self-identified gender of the head of the household. For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within Bellmont. The dataset can be utilized to gain insights into gender-based income distribution within the Bellmont population, aiding in data analysis and decision-making..

    Key observations

    • Employment patterns: Within Bellmont, among individuals aged 15 years and older with income, there were 70 men and 101 women in the workforce. Among them, 27 men were engaged in full-time, year-round employment, while 27 women were in full-time, year-round roles.
    • Annual income under $24,999: Of the male population working full-time, 3.70% fell within the income range of under $24,999, while 40.74% of the female population working full-time was represented in the same income bracket.
    • Annual income above $100,000: 7.41% of men in full-time roles earned incomes exceeding $100,000, while none of women in full-time positions earned within this income bracket.
    • Refer to the research insights for more key observations on more income brackets ( Annual income under $24,999, Annual income between $25,000 and $49,999, Annual income between $50,000 and $74,999, Annual income between $75,000 and $99,999 and Annual income above $100,000) and employment types (full-time year-round and part-time)
    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Income brackets:

    • $1 to $2,499 or loss
    • $2,500 to $4,999
    • $5,000 to $7,499
    • $7,500 to $9,999
    • $10,000 to $12,499
    • $12,500 to $14,999
    • $15,000 to $17,499
    • $17,500 to $19,999
    • $20,000 to $22,499
    • $22,500 to $24,999
    • $25,000 to $29,999
    • $30,000 to $34,999
    • $35,000 to $39,999
    • $40,000 to $44,999
    • $45,000 to $49,999
    • $50,000 to $54,999
    • $55,000 to $64,999
    • $65,000 to $74,999
    • $75,000 to $99,999
    • $100,000 or more

    Variables / Data Columns

    • Income Bracket: This column showcases 20 income brackets ranging from $1 to $100,000+..
    • Full-Time Males: The count of males employed full-time year-round and earning within a specified income bracket
    • Part-Time Males: The count of males employed part-time and earning within a specified income bracket
    • Full-Time Females: The count of females employed full-time year-round and earning within a specified income bracket
    • Part-Time Females: The count of females employed part-time and earning within a specified income bracket

    Employment type classifications include:

    • Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.
    • Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Bellmont median household income by race. You can refer the same here

  10. Dataset for paper: Decoding multi-joint hand movements from brain signals by...

    • zenodo.org
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huaqin Sun; Huaqin Sun (2025). Dataset for paper: Decoding multi-joint hand movements from brain signals by learning a synergy-based neural manifold [Dataset]. http://doi.org/10.5281/zenodo.15295424
    Explore at:
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Huaqin Sun; Huaqin Sun
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    ## Dataset and File Structure Description


    The dataset is organized into three folders: NeuralData, KinData, and Info.

    - NeuralData: Contains invasive brain signal recordings for each session.
    - KinData: Includes joint angle trajectories for 11 hand movements.
    - Info: Holds session metadata.

    The dataset consists of 3 sessions, each collected on a separate experimental day. Each session contains the following:

    1. A NeuralData file in MATLAB .mat format.
    2. An Info file in YAML format.

    The filenames indicate both the collection date and session number. For example, ‘**2022-06-27_Session_1.mat**’ refers to the first session, collected on June 27, 2022. Each session includes 11 different hand movements, with joint angle trajectories stored in the KinData folder. Filenames specify the gesture class and movement duration. For instance, ‘**gesture_data_901_2400.mat**’ represents the 901 gesture class, with an execution time of 2400 ms (2.4 seconds).

    ### Key Metadata Description

    ### NeuralData Files:

    - N: Number of trials
    - C: Number of neural channels
    - T: Number of time bins
    - data: A 3D matrix (N-T-C), representing neural activity across all trials
    - target_label: A vector of size N, indicating the hand target index for each trial (coded from 901 to 911)
    - target_no: A vector of size N, indicating the hand target index for each trial (coded from 0 to 10)
    - len_trial: A vector of size N, specifying the number of time bins for each trial

    ### Info Files:

    - start_bin: Bin index for the movement start
    - end_bin: Bin index for the movement end
    - kin_data: Length of the movement duration (e.g., 2400 denotes 2.4 seconds)
    - num_block: Number of blocks in the session
    - num_movement: Total number of movements

    ### KinData Files:

    - D: Number of motion dimensions (hand joints)
    - M: Number of target gestures (hand movements)
    - T: Number of time bins
    - angle_base: A vector of size M, representing the initial joint angle for each gesture, used as the base angle for all movements
    - rotation_angle: A matrix of size T-D, showing the joint rotation angle trajectory during movement
    - rotation_axis: A matrix of size T-Dx3, indicating the rotation axis for each joint in three-dimensional space

    Note: We recorded 16 joint angles during movement, consisting of one wrist joint and 15 hand joints. The analysis focuses on the kinematics of the 15 hand joints, using only the last 15 dimensions of the rotation angle data.

  11. P

    GoEmotions Dataset

    • paperswithcode.com
    • tensorflow.org
    • +4more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dorottya Demszky; Dana Movshovitz-Attias; Jeongwoo Ko; Alan Cowen; Gaurav Nemade; Sujith Ravi, GoEmotions Dataset [Dataset]. https://paperswithcode.com/dataset/goemotions
    Explore at:
    Authors
    Dorottya Demszky; Dana Movshovitz-Attias; Jeongwoo Ko; Alan Cowen; Gaurav Nemade; Sujith Ravi
    Description

    GoEmotions is a corpus of 58k carefully curated comments extracted from Reddit, with human annotations to 27 emotion categories or Neutral.

    Number of examples: 58,009. Number of labels: 27 + Neutral. Maximum sequence length in training and evaluation datasets: 30.

    On top of the raw data, the dataset also includes a version filtered based on reter-agreement, which contains a train/test/validation split:

    Size of training dataset: 43,410. Size of test dataset: 5,427. Size of validation dataset: 5,426.

    The emotion categories are: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise.

  12. A fMRI dataset in response to large number of short natural dynamic facial...

    • openneuro.org
    Updated Oct 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Panpan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan (2024). A fMRI dataset in response to large number of short natural dynamic facial expression videos [Dataset]. http://doi.org/10.18112/openneuro.ds005047.v1.0.4
    Explore at:
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    OpenNeurohttps://openneuro.org/
    Authors
    Panpan Chen; Chi Zhang; Bao Li; Li Tong; Linyuan Wang; Shuxiao Ma; Long Cao; Ziya Yu; Bin Yan
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Summary

    Facial expression is among the most natural methods for human beings to convey their emotional information in daily life. Although the neural mechanism of facial expression has been extensively studied employing lab-controlled images and a small number of lab-controlled video stimuli, how the human brain processes natural facial expressions still needs to be investigated. To our knowledge, this type of data specifically on large number of natural facial expression videos is currently missing. We describe here the natural Facial Expressions Dataset (NFED), a fMRI dataset including responses to 1,320 short (3-second) natural facial expression video clips. These video clips is annotated with three types of labels: emotion, gender, and ethnicity, along with accompanying metadata. We validate that the dataset has good quality within and across participants and, notably, can capture temporal and spatial stimuli features. NFED provides researchers with fMRI data for understanding of the visual processing of large number of natural facial expression videos.

    Data Records

    The data, which were structured following the BIDS format53, were accessible at https://openneuro.org/datasets/ds00504754. The “sub-

    Stimulus. Distinct folders store the stimuli for distinct fMRI experiments: "stimuli/face-video", "stimuli/floc", and "stimuli/prf" (Fig. 2b). The category labels and metadata corresponding to video stimuli are stored in the "videos-stimuli_category_metadata.tsv”. The “videos-stimuli_description.json” file describes category and metadata information of video stimuli(Fig. 2b).

    Raw MRI data. Each participant's folder is comprised of 11 session folders: “sub-

    Volume data from pre-processing. The pre-processed volume-based fMRI data were in the folder named “pre-processed_volume_data/sub-

    Surface data from pre-processing. The pre-processed surface-based data were stored in a file named “volumetosurface/sub-

    FreeSurfer recon-all. The results of reconstructing the cortical surface were saved as “recon-all-FreeSurfer/sub-

    Surface-based GLM analysis data. We have conducted GLMsingle on the data of the main experiment. There is a file named “sub--

    Validation. The code of technical validation was saved in the “derivatives/validation/code” folder. The results of technical validation were saved in the “derivatives/validation/results” folder (Fig. 2h). “README.md” describes the detailed information of code and results.

  13. Vocational qualifications dataset

    • gov.uk
    • s3.amazonaws.com
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ofqual (2025). Vocational qualifications dataset [Dataset]. https://www.gov.uk/government/statistical-data-sets/vocational-qualifications-dataset
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Ofqual
    Description

    This dataset covers vocational qualifications starting 2012 to present for England.

    It is updated every quarter.

    In the dataset, the number of certificates issued are rounded to the nearest 5 and values less than 5 appear as ‘Fewer than 5’ to preserve confidentiality (and a 0 represents no certificates).

    Where a qualification has been owned by more than one awarding organisation at different points in time, a separate row is given for each organisation.

    Background information as well as commentary accompanying this dataset is available separately.

    For any queries contact us at data.analytics@ofqual.gov.uk.

  14. P

    Data from: ImageNet Dataset

    • paperswithcode.com
    Updated Jun 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li (2022). ImageNet Dataset [Dataset]. https://paperswithcode.com/dataset/imagenet
    Explore at:
    Dataset updated
    Jun 23, 2022
    Authors
    Jia Deng; Wei Dong; Richard Socher; Li-Jia Li; Kai Li; Fei-Fei Li
    Description

    The ImageNet dataset contains 14,197,122 annotated images according to the WordNet hierarchy. Since 2010 the dataset is used in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), a benchmark in image classification and object detection. The publicly released dataset contains a set of manually annotated training images. A set of test images is also released, with the manual annotations withheld. ILSVRC annotations fall into one of two categories: (1) image-level annotation of a binary label for the presence or absence of an object class in the image, e.g., “there are cars in this image” but “there are no tigers,” and (2) object-level annotation of a tight bounding box and class label around an object instance in the image, e.g., “there is a screwdriver centered at position (20,25) with width of 50 pixels and height of 30 pixels”. The ImageNet project does not own the copyright of the images, therefore only thumbnails and URLs of images are provided.

    Total number of non-empty WordNet synsets: 21841 Total number of images: 14197122 Number of images with bounding box annotations: 1,034,908 Number of synsets with SIFT features: 1000 Number of images with SIFT features: 1.2 million

  15. Z

    AIT Log Data Set V2.0

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rauber, Andreas (2024). AIT Log Data Set V2.0 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5789063
    Explore at:
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Rauber, Andreas
    Wurzenberger, Markus
    Frank, Maximilian
    Landauer, Max
    Skopik, Florian
    Hotwagner, Wolfgang
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    AIT Log Data Sets

    This repository contains synthetic log data suitable for evaluation of intrusion detection systems, federated learning, and alert aggregation. A detailed description of the dataset is available in [1]. The logs were collected from eight testbeds that were built at the Austrian Institute of Technology (AIT) following the approach by [2]. Please cite these papers if the data is used for academic publications.

    In brief, each of the datasets corresponds to a testbed representing a small enterprise network including mail server, file share, WordPress server, VPN, firewall, etc. Normal user behavior is simulated to generate background noise over a time span of 4-6 days. At some point, a sequence of attack steps is launched against the network. Log data is collected from all hosts and includes Apache access and error logs, authentication logs, DNS logs, VPN logs, audit logs, Suricata logs, network traffic packet captures, horde logs, exim logs, syslog, and system monitoring logs. Separate ground truth files are used to label events that are related to the attacks. Compared to the AIT-LDSv1.1, a more complex network and diverse user behavior is simulated, and logs are collected from all hosts in the network. If you are only interested in network traffic analysis, we also provide the AIT-NDS containing the labeled netflows of the testbed networks. We also provide the AIT-ADS, an alert data set derived by forensically applying open-source intrusion detection systems on the log data.

    The datasets in this repository have the following structure:

    The gather directory contains all logs collected from the testbed. Logs collected from each host are located in gather//logs/.

    The labels directory contains the ground truth of the dataset that indicates which events are related to attacks. The directory mirrors the structure of the gather directory so that each label files is located at the same path and has the same name as the corresponding log file. Each line in the label files references the log event corresponding to an attack by the line number counted from the beginning of the file ("line"), the labels assigned to the line that state the respective attack step ("labels"), and the labeling rules that assigned the labels ("rules"). An example is provided below.

    The processing directory contains the source code that was used to generate the labels.

    The rules directory contains the labeling rules.

    The environment directory contains the source code that was used to deploy the testbed and run the simulation using the Kyoushi Testbed Environment.

    The dataset.yml file specifies the start and end time of the simulation.

    The following table summarizes relevant properties of the datasets:

    fox

    Simulation time: 2022-01-15 00:00 - 2022-01-20 00:00

    Attack time: 2022-01-18 11:59 - 2022-01-18 13:15

    Scan volume: High

    Unpacked size: 26 GB

    harrison

    Simulation time: 2022-02-04 00:00 - 2022-02-09 00:00

    Attack time: 2022-02-08 07:07 - 2022-02-08 08:38

    Scan volume: High

    Unpacked size: 27 GB

    russellmitchell

    Simulation time: 2022-01-21 00:00 - 2022-01-25 00:00

    Attack time: 2022-01-24 03:01 - 2022-01-24 04:39

    Scan volume: Low

    Unpacked size: 14 GB

    santos

    Simulation time: 2022-01-14 00:00 - 2022-01-18 00:00

    Attack time: 2022-01-17 11:15 - 2022-01-17 11:59

    Scan volume: Low

    Unpacked size: 17 GB

    shaw

    Simulation time: 2022-01-25 00:00 - 2022-01-31 00:00

    Attack time: 2022-01-29 14:37 - 2022-01-29 15:21

    Scan volume: Low

    Data exfiltration is not visible in DNS logs

    Unpacked size: 27 GB

    wardbeck

    Simulation time: 2022-01-19 00:00 - 2022-01-24 00:00

    Attack time: 2022-01-23 12:10 - 2022-01-23 12:56

    Scan volume: Low

    Unpacked size: 26 GB

    wheeler

    Simulation time: 2022-01-26 00:00 - 2022-01-31 00:00

    Attack time: 2022-01-30 07:35 - 2022-01-30 17:53

    Scan volume: High

    No password cracking in attack chain

    Unpacked size: 30 GB

    wilson

    Simulation time: 2022-02-03 00:00 - 2022-02-09 00:00

    Attack time: 2022-02-07 10:57 - 2022-02-07 11:49

    Scan volume: High

    Unpacked size: 39 GB

    The following attacks are launched in the network:

    Scans (nmap, WPScan, dirb)

    Webshell upload (CVE-2020-24186)

    Password cracking (John the Ripper)

    Privilege escalation

    Remote command execution

    Data exfiltration (DNSteal)

    Note that attack parameters and their execution orders vary in each dataset. Labeled log files are trimmed to the simulation time to ensure that their labels (which reference the related event by the line number in the file) are not misleading. Other log files, however, also contain log events generated before or after the simulation time and may therefore be affected by testbed setup or data collection. It is therefore recommended to only consider logs with timestamps within the simulation time for analysis.

    The structure of labels is explained using the audit logs from the intranet server in the russellmitchell data set as an example in the following. The first four labels in the labels/intranet_server/logs/audit/audit.log file are as follows:

    {"line": 1860, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1861, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1862, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    {"line": 1863, "labels": ["attacker_change_user", "escalate"], "rules": {"attacker_change_user": ["attacker.escalate.audit.su.login"], "escalate": ["attacker.escalate.audit.su.login"]}}

    Each JSON object in this file assigns a label to one specific log line in the corresponding log file located at gather/intranet_server/logs/audit/audit.log. The field "line" in the JSON objects specify the line number of the respective event in the original log file, while the field "labels" comprise the corresponding labels. For example, the lines in the sample above provide the information that lines 1860-1863 in the gather/intranet_server/logs/audit/audit.log file are labeled with "attacker_change_user" and "escalate" corresponding to the attack step where the attacker receives escalated privileges. Inspecting these lines shows that they indeed correspond to the user authenticating as root:

    type=USER_AUTH msg=audit(1642999060.603:2226): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_ACCT msg=audit(1642999060.603:2227): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:accounting acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=CRED_ACQ msg=audit(1642999060.615:2228): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:setcred acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    type=USER_START msg=audit(1642999060.627:2229): pid=27950 uid=33 auid=4294967295 ses=4294967295 msg='op=PAM:session_open acct="jhall" exe="/bin/su" hostname=? addr=? terminal=/dev/pts/1 res=success'

    The same applies to all other labels for this log file and all other log files. There are no labels for logs generated by "normal" (i.e., non-attack) behavior; instead, all log events that have no corresponding JSON object in one of the files from the labels directory, such as the lines 1-1859 in the example above, can be considered to be labeled as "normal". This means that in order to figure out the labels for the log data it is necessary to store the line numbers when processing the original logs from the gather directory and see if these line numbers also appear in the corresponding file in the labels directory.

    Beside the attack labels, a general overview of the exact times when specific attack steps are launched are available in gather/attacker_0/logs/attacks.log. An enumeration of all hosts and their IP addresses is stated in processing/config/servers.yml. Moreover, configurations of each host are provided in gather//configs/ and gather//facts.json.

    Version history:

    AIT-LDS-v1.x: Four datasets, logs from single host, fine-granular audit logs, mail/CMS.

    AIT-LDS-v2.0: Eight datasets, logs from all hosts, system logs and network traffic, mail/CMS/cloud/web.

    Acknowledgements: Partially funded by the FFG projects INDICAETING (868306) and DECEPT (873980), and the EU projects GUARD (833456) and PANDORA (SI2.835928).

    If you use the dataset, please cite the following publications:

    [1] M. Landauer, F. Skopik, M. Frank, W. Hotwagner, M. Wurzenberger, and A. Rauber. "Maintainable Log Datasets for Evaluation of Intrusion Detection Systems". IEEE Transactions on Dependable and Secure Computing, vol. 20, no. 4, pp. 3466-3482, doi: 10.1109/TDSC.2022.3201582. [PDF]

    [2] M. Landauer, F. Skopik, M. Wurzenberger, W. Hotwagner and A. Rauber, "Have it Your Way: Generating Customized Log Datasets With a Model-Driven Simulation Testbed," in IEEE Transactions on Reliability, vol. 70, no. 1, pp. 402-415, March 2021, doi: 10.1109/TR.2020.3031317. [PDF]

  16. Z

    Jingju a Cappella Recordings Collection

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yile Yang (2022). Jingju a Cappella Recordings Collection [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3251760
    Explore at:
    Dataset updated
    May 13, 2022
    Dataset provided by
    Xavier Serra
    Yile Yang
    Rong Gong
    Rafael Caro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Jingju a Cappella Recordings Collection (JaCRC) is part of the Jingju Music Corpus created in the CompMusic project at the Music Technology Group, Universitat Pompeu Fabra, Barcelona (MTG). The JaCRC was created for different research tasks, mostly concerning melodic characteristics of jingju arias and pronunciation in jingju, and parts of the collection have been used in several publications. The JaCRC contains 314 recordings of jingju a cappella singing, plus 76 recordings of the jinghu accompaniment for their corresponding vocal tracks. Except for 53 of them (see CONTENT below), all of the recordings were newly created for this collection. The JaCRC also contains the manual segmentation of 217 vocal recordings and lyrics files for 156, 67 of which include annotations for start and end of each lyrics line in a related music score (see the README file). The dataset is released under a Creative Commons license (see LICENSE below).

    The content of the JaCRC was previously published in three different parts (part 1, part 2, part 3). This new release puts all the data together under an unified structure in order to ease its usability.

    CONTENT

    The main body of the JaCRC are 239 a cappella recordings of jingju arias. Among those, the main contribution of the collection are the 186 newly created a cappella recordings by professional or semi-professional actors. Some of the recordings contain incomplete arias because the performer decided to stop according to their own will. The aria is then completed in subsequent recording(s). In few occasions, the performer decided to record a second version of the same aria. Both versions are included in the collection.

    The performers for 76 of these recordings sung over a jinghu accompaniment played live in a different room. These accompaniments were also recorded and added to the JaCRC.

    To complement the collection, recordings from existing sources were also integrated to the JaCRC. 15 a cappella recordings were obtained from commercial releases by subtracting the instrumental accompaniment, published in separate tracks to be used as accompaniment by amateur singers, from the mixed track. These recordings are not included in the JaCRC for copyright issues, but can be shared for research purposes only (see CONTACT below). However, the metadata and the segmentation files for these 15 recordings have been included in the JaCRC. Besides, 53 a cappella jingju recordings from Singing Voice Audio Dataset were included here with permission of their authors (see LICENSE and USE below).

    With the goal of developing technologies to aid learning of jingju singing, 75 recordings were created from amateur performers, both children and adults. These amateur performers, considered as ‘students,’ sung trying to imitate a reference model, considered as ‘teacher.’ The ‘teacher’ would be either present in the session, and their performances were also recorded, or an existing recording of the JaCRC was played as model. The 16 recordings of the teachers are part of the JaCRC and the anonymized recordings of the students are included in the JaCRC.

    All the artists recorded for the JaCRC manifested their written consent to the MTG for the public release of these recordings under Creative Common license.

    In order to be used for different research tasks, 142 recordings were manually segmented to the phrase and syllable level. Among these, 81 recordings, including those 16 ones used as ‘teacher’ recordings, were further segmented to the phoneme level. All ‘student’ recordings were also segmented to the phrase, syllable and phoneme level. These segmentations are included in the JaCRC as Praat TextGrid files.

    For 156 recordings there are corresponding csv files containing the lyrics performed in the recording, one line per row. Among these, 67 csv files also contain annotations for the boundaries of each lyrics line in a related music score. The boundaries are annotated as offset according to the music21 toolkit. The related music scores can be found in the Jingju Music Scores Collection with the same name as the one annotated in the csv files.

    COVERAGE

    As part of the Jingju Music Corpus, the JaCRC was gathered with the purpose of studying the most representative characteristics of jingju vocal music, and therefore the most representative instances of the main elements of jingju vocal music, that is, role type, shengqiang and banshi, are well covered in the collection. Below some statistics about the coverage of these elements in the JaCRC are given. The numbers in brackets correspond to the number of recordings that include (not always exclusively) that element and its percentage with respect to the total 254 recordings in the collection. The numbers include the 15 recordings from commercial realeases not available in the collection (see CONTENT above).

    Regarding role types, the JaCRC includes 5 different ones. The two most extensively covered ones are dan (127, 50.0%), including male dan (27) and huadan (2), and laosheng (108, 42.5%), including female laosheng (8). The other role types included in the JaCRC are jing (17, 6.7%), most of them of female jing (16), xiaosheng (1, 0.4%) and chou (1, 0.4%).

    The two main shengqiang in jingju are extensively covered in the JaCRC, namely xipi (153, 60.2%) and erhuang (62, 24.4%). Besides, other 7 shengqiang are also present in the collection, namely sipingdiao (14, 5.5%), nanbangzi (11, 4.3%), fan’erhuang (8, 3.1%), fansipingdiao (2, 0.8%), fanxipi (4, 1.6%), gaobozi (1, 0.4%), and handiao (1, 0.4%).

    As for banshi, there are instances of 18 different ones included in the JaCRC. The 7 more extensively represented banshi are yuanban (76, 29.9%), liushui (63, 24.8%), manban (46, 18.1%), erliu (40, 15.7%), sanban (34, 13.4%), yaoban (34, 13.4%), and daoban (27, 10.6%). Other banshi also included in the collection are kuaiban (17, 6.7%), huilong (8, 3.1%), sanyan (7, 2.8%), kuaisanyan (7, 2.8%), mansanyan (3, 1.2%), zhongsanyan (3, 1.2%), pengban (2, 0.8%), gunban (1, 0.4%), duoban (1, 0.4%), shuban (1, 0.4%), and kuaisanban (1, 0.4%).

    In terms of content, the JaCRC contains recordings of 142 arias from 74 different plays.

    Finally, the recordings in the JaCRC are performed by 23 artists, including 8 professional actors, 2 graduated jingju students, 3 undergraduate jingju students in their 4th year, and 10 amateur performers. In terms of role types, there are 10 laosheng performers, one of them being the one who also performs the xiaosheng and jing recordings, and another one also performing the chou recording, 8 dan, one of them also performing the huadan recordings, 3 male dan, 1 female laosheng and 1 female jing.

    ANNOTATIONS

    All the annotation files are named in the same exact manner as its corresponding recording, so that they can be easily matched. Besides, the metadata and information csv files indicate which annotations are available for which recordings.

    There are two types of annotations: segmentation and lyrics.

    The segmentation annotations were done manually and in three phases, corresponding to the subfolders in the “JaCRC-annotations” folder numbered ‘1,’ ‘2’ and ‘3.’ All the segmentations were done using the software Praat and are available in the JaCRC as TextGrid files. The phoneme annotations follow the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA). Below is a description of the annotations contained in each of the subfolders:

    “1-phrase-syllable-phoneme” folder: all the recordings whose annotations are contained in this folder were segmented at least to the phrase (lyrics line), syllable and phoneme levels. Since the annotations were done for different research tasks, the TextGrid files might contain different numbers of tiers, but all of them have a tier named ‘line’ for the phrase level segmentation with lyrics line in Chinese characters as labels, a tier named ‘pinyin’ for the syllable level segmentation with syllables in the pinyin romanization system as labels, and a ‘details’ tier for phoneme segmentation and labels in X-SAMPA. In order to ease access to these annotations, tab-separated values files were generated from the TextGrid files and also included as txt files in this folder. The files that add “_phrase” to the recording’s name contain the phrase level annotations in pinyin. Those that add “_phrase_char” contain the same phrase level annotations, but in Chinese characters. Those that add “_syllable” contain the syllable level annotations in pinyin. And those that add “_phoneme” contain the phoneme level annotations in X-SAMPA.

    “2-phrase-syllable” folder: same case as in the previous folder, but without phoneme level annotations. In these TextGrid files, the phrase level annotations are still in tiers named ‘line,’ and the syllable level ones are in tiers named ‘dianSilence.’

    “3-students” folder: same case as in “1-phrase-syllable-phoneme” folder. In these TextGrid files, the phrase level annotations are still in tiers named ‘line,’ the syllable level ones are in tiers named ‘dianSilence,’ and the phoneme level ones in tiers named ‘details.’

    The lyrics annotations consist of csv files (semicolon as separator) containing the lyrics of their corresponding recordings in their original Chinese script. Each row corresponds to a lyrics line. The first three columns contain information for “Role type,” “Shengqiang” and “Banshi” (see the README file). In the fourth one, under the heading “Couplet line,” “s” (from shangju) indicates that the corresponding lyrics line is an opening line, “x” (from xiaju) indicates that it is a closing line, and “k” indicates is a kutou line. The fifth column, “Lyrics line,” contains the lyrics. If there is a matching music score in the Jingju Music Scores Collection (JMSC) for the aria performed in the

  17. w

    New driving test trial: numbers of instructors and learner drivers

    • gov.uk
    Updated Dec 1, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Driver and Vehicle Standards Agency (2016). New driving test trial: numbers of instructors and learner drivers [Dataset]. https://www.gov.uk/government/statistical-data-sets/new-driving-test-trial-numbers-of-instructors-and-learner-drivers
    Explore at:
    Dataset updated
    Dec 1, 2016
    Dataset provided by
    GOV.UK
    Authors
    Driver and Vehicle Standards Agency
    Description

    About this data set

    This data set comes from data held by the Driver and Vehicle Standards Agency (DVSA).

    It isn’t classed as an ‘official statistic’. This means it’s not subject to scrutiny and assessment by the UK Statistics Authority.

    The government is trialling driving test changes in 2015 and 2016 to make it a better test of the driver’s ability to drive safely on their own.

    This data shows the numbers of approved driving instructors and learner drivers taking part in the trial, and the number of tests booked.

    Data tables

    https://assets.publishing.service.gov.uk/media/5a80e9de40f0b6230269636f/new-driving-test-trial-statistics.csv">Numbers of driving instructors, learner drivers and driving tests booked

    CSV, 206 Bytes

    View online

    Data you cannot find

    Data you cannot find could be published as:

    You can send an FOI request if you still cannot find the information you need.

    Data that cannot be released

    By law, DVSA cannot send you information that’s part of an official statistic that hasn’t yet been published.

  18. P

    Fruits 360 Dataset

    • paperswithcode.com
    • data.mendeley.com
    • +1more
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Fruits 360 Dataset [Dataset]. https://paperswithcode.com/dataset/fruits-360-1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Description

    Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds Version: 2025.03.24.0 Content The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

    Branches The dataset has four major branches:

    -The 100x100 branch, where all images have 100x100 pixels. See fruits-360_100x100 folder.

    -The original-size branch, where all images are at their original (captured) size. See fruits-360_original-size folder.

    -The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See fruits-360_dataset_meta folder.

    -The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See fruits-360_multi folder.

    How to cite Mihai Oltean, Fruits-360 dataset, 2017-

    Dataset properties For the 100x100 branch Total number of images: 111589.

    Training set size: 83616 images.

    Test set size: 27973 images.

    Number of classes: 166 (fruits, vegetables, nuts and seeds).

    Image size: 100x100 pixels.

    For the original-size branch Total number of images: 29440.

    Training set size: 14731 images.

    Validation set size: 7370 images

    Test set size: 7339 images.

    Number of classes: 48 (fruits, vegetables, nuts and seeds).

    Image size: various (original, captured, size) pixels.

    For the meta branch Number of classes: 26 (fruits, vegetables, nuts and seeds).

    For the multi branch Number of images: 150.

    Filename format: For the 100x100 branch image_index_100.jpg (e.g. 31_100.jpg) or

    r_image_index_100.jpg (e.g. r_31_100.jpg) or

    r?_image_index_100.jpg (e.g. r2_31_100.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

    Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

    For the original-size branch r?_image_index.jpg (e.g. r2_31.jpg)

    where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

    The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

    For the multi branch The file's name is the concatenation of the names of the fruits inside that picture.

    Alternate download The Fruits-360 dataset can be downloaded from:

    Kaggle https://www.kaggle.com/moltean/fruits

    GitHub https://github.com/fruits-360

    How fruits were filmed Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

    A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

    Behind the fruits, we placed a white sheet of paper as a background.

    Here is a movie showing how the fruits and vegetables are filmed: https://youtu.be/_HFKJ144JuU

    How fruits were extracted from the background However, due to the variations in the lighting conditions, the background was not uniform and we wrote a dedicated algorithm that extracts the fruit from the background. This algorithm is of flood fill type: we start from each edge of the image and we mark all pixels there, then we mark all pixels found in the neighborhood of the already marked pixels for which the distance between colors is less than a prescribed value. We repeat the previous step until no more pixels can be marked.

    All marked pixels are considered as being background (which is then filled with white) and the rest of the pixels are considered as belonging to the object.

    The maximum value for the distance between 2 neighbor pixels is a parameter of the algorithm and is set (by trial and error) for each movie.

    Pictures from the test-multiple_fruits folder were taken with a Nexus 5X phone or an iPhone 11.

    History Fruits were filmed at the dates given below (YYYY.MM.DD):

    2017.02.25 - Apple (golden).

    2017.02.28 - Apple (red-yellow, red, golden2), Kiwi, Pear, Grapefruit, Lemon, Orange, Strawberry, Banana.

    2017.03.05 - Apple (golden3, Braeburn, Granny Smith, red2).

    2017.03.07 - Apple (red3).

    2017.05.10 - Plum, Peach, Peach flat, Apricot, Nectarine, Pomegranate.

    2017.05.27 - Avocado, Papaya, Grape, Cherrie.

    2017.12.25 - Carambula, Cactus fruit, Granadilla, Kaki, Kumsquats, Passion fruit, Avocado ripe, Quince.

    2017.12.28 - Clementine, Cocos, Mango, Lime, Lychee.

    2017.12.31 - Apple Red Delicious, Pear Monster, Grape White.

    2018.01.14 - Ananas, Grapefruit Pink, Mandarine, Pineapple, Tangelo.

    2018.01.19 - Huckleberry, Raspberry.

    2018.01.26 - Dates, Maracuja, Plum 2, Salak, Tamarillo.

    2018.02.05 - Guava, Grape White 2, Lemon Meyer

    2018.02.07 - Banana Red, Pepino, Pitahaya Red.

    2018.02.08 - Pear Abate, Pear Williams.

    2018.05.22 - Lemon rotated, Pomegranate rotated.

    2018.05.24 - Cherry Rainier, Cherry 2, Strawberry Wedge.

    2018.05.26 - Cantaloupe (2 varieties).

    2018.05.31 - Melon Piel de Sapo.

    2018.06.05 - Pineapple Mini, Physalis, Physalis with Husk, Rambutan.

    2018.06.08 - Mulberry, Redcurrant.

    2018.06.16 - Hazelnut, Walnut, Tomato, Cherry Red.

    2018.06.17 - Cherry Wax (Yellow, Red, Black).

    2018.08.19 - Apple Red Yellow 2, Grape Blue, Grape White 3-4, Peach 2, Plum 3, Tomato Maroon, Tomato 1-4 .

    2018.12.20 - Nut Pecan, Pear Kaiser, Tomato Yellow.

    2018.12.21 - Banana Lady Finger, Chesnut, Mangostan.

    2018.12.22 - Pomelo Sweetie.

    2019.04.21 - Apple Crimson Snow, Apple Pink Lady, Blueberry, Kohlrabi, Mango Red, Pear Red, Pepper (Red, Yellow, Green).

    2019.06.18 - Beetroot Red, Corn, Ginger Root, Nectarine Flat, Nut Forest, Onion Red, Onion Red Peeled, Onion White, Potato Red, Potato Red Washed, Potato Sweet, Potato White.

    2019.07.07 - Cauliflower, Eggplant, Pear Forelle, Pepper Orange, Tomato Heart.

    2019.09.22 - Corn Husk, Cucumber Ripe, Fig, Pear 2, Pear Stone, Tomato not Ripened, Watermelon.

    2021.06.07 - Eggplant long 1.

    2021.08.09 - Apple hit 1, Cucumber 1.

    2021.09.03 - Pear 3.

    2021.09.22 - Apple 6, Cucumber 3.

    2023.12.30 - Official Github repository is now https://github.com/fruits-360

    License CC BY-SA 4.0

    Copyright (c) 2017-, Mihai Oltean

    You are free to:

    Share — copy and redistribute the material in any medium or format for any purpose, even commercially.

    Adapt — remix, transform, and build upon the material for any purpose, even commercially.

    Under the following terms:

    Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

    ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

  19. w

    Dataset of artists who created Headpiece (after André Mare, page 27) from...

    • workwithdata.com
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of artists who created Headpiece (after André Mare, page 27) from ARCHITECTURES [Dataset]. https://www.workwithdata.com/datasets/artists?f=1&fcol0=j0-artwork&fop0=%3D&fval0=Headpiece+(after+Andr%C3%A9+Mare%2C+page+27)+from+ARCHITECTURES&j=1&j0=artworks
    Explore at:
    Dataset updated
    May 8, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about artists. It has 1 row and is filtered where the artworks is Headpiece (after André Mare, page 27) from ARCHITECTURES. It features 9 columns including birth date, death date, country, and gender.

  20. A

    ‘Major Basin Lines’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Oct 7, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2009). ‘Major Basin Lines’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/data-gov-major-basin-lines-233f/425cc37a/?iid=002-572&v=presentation
    Explore at:
    Dataset updated
    Oct 7, 2009
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Major Basin Lines’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://catalog.data.gov/dataset/019385ae-c846-4a0e-98f7-2de8dfcbb993 on 27 January 2022.

    --- Dataset description provided by original source is as follows ---

    Major Drainage Basin Set:

    Connecticut Major Drainage Basins is 1:24,000-scale, polygon and line feature data that define Major drainage basin areas in Connecticut. These large basins mostly range from 70 to 2,000 square miles in size. Connecticut Major Drainage Basins includes drainage areas for all Connecticut rivers, streams, brooks, lakes, reservoirs and ponds published on 1:24,000-scale 7.5 minute topographic quadrangle maps prepared by the USGS between 1969 and 1984. Data is compiled at 1:24,000 scale (1 inch = 2,000 feet). This information is not updated. Polygon and line features represent drainage basin areas and boundaries, respectively. Each basin area (polygon) feature is outlined by one or more major basin boundary (line) feature. These data include 10 major basin area (polygon) features and 284 major basin boundary (line) features. Major Basin area (polygon) attributes include major basin number and feature size in acres and square miles. The major basin number (MBAS_NO) uniquely identifies individual basins and is 1 character in length. There are 8 unique major basin numbers. Examples include 1, 4, and 6. Note there are more major basin polygon features (10) than unique major basin numbers (8) because two polygon features are necessary to represent both the entire South East Coast and Hudson Major basins in Connecticut. Major basin boundary (line) attributes include a drainage divide type attribute (DIVIDE) used to cartographically represent the hierarchical drainage basin system. This divide type attribute is used to assign different line symbology to different levels of drainage divides. For example, major basin drainage divides are more pronounced and shown with a wider line symbol than regional basin drainage divides. Connecticut Major Drainage Basin polygon and line feature data are derived from the geometry and attributes of the Connecticut Drainage Basins data.

    Connecticut Major Drainage Basins is 1:24,000-scale, polygon and line feature data that define Major drainage basin areas in Connecticut. These large basins mostly range from 70 to 2,000 square miles in size. Connecticut Major Drainage Basins includes drainage areas for all Connecticut rivers, streams, brooks, lakes, reservoirs and ponds published on 1:24,000-scale 7.5 minute topographic quadrangle maps prepared by the USGS between 1969 and 1984. Data is compiled at 1:24,000 scale (1 inch = 2,000 feet). This information is not updated. Polygon and line features represent drainage basin areas and boundaries, respectively. Each basin area (polygon) feature is outlined by one or more major basin boundary (line) feature. These data include 10 major basin area (polygon) features and 284 major basin boundary (line) features. Major Basin area (polygon) attributes include major basin number and feature size in acres and square miles. The major basin number (MBAS_NO) uniquely identifies individual basins and is 1 character in length. There are 8 unique major basin numbers. Examples include 1, 4, and 6. Note there are more major basin polygon features (10) than unique major basin numbers (8) because two polygon features are necessary to represent both the entire South East Coast and Hudson Major basins in Connecticut. Major basin boundary (line) attributes include a drainage divide type attribute (DIVIDE) used to cartographically represent the hierarchical drainage basin system. This divide type attribute is used to assign different line symbology to different levels of drainage divides. For example, major basin drainage divides are more pronounced and shown with a wider line symbol than regional basin drainage divides. Connecticut Major Drainage Basin polygon and line feature data are derived from the geometry and attributes of the Connecticut Drainage Basins data.

    --- Original source retains full ownership of the source dataset ---

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id
Organization logo

Mathematics Dataset

Related Article
Explore at:
Dataset updated
Apr 3, 2019
Dataset provided by
DeepMindhttp://deepmind.com/
Description

This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

## Example questions

 Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
 Answer: 4
 
 Question: Calculate -841880142.544 + 411127.
 Answer: -841469015.544
 
 Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
 Answer: 54*a - 30

It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

  • algebra (linear equations, polynomial roots, sequences)
  • arithmetic (pairwise operations and mixed expressions, surds)
  • calculus (differentiation)
  • comparison (closest numbers, pairwise comparisons, sorting)
  • measurement (conversion, working with time)
  • numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)
  • polynomials (addition, simplification, composition, evaluating, expansion)
  • probability (sampling without replacement)
Search
Clear search
Close search
Google apps
Main menu