36 datasets found

10,109 People - Face Images Dataset
nexdata.ai
Updated Jun 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 10,109 People - Face Images Dataset [Dataset]. https://www.nexdata.ai/datasets/1402?source=Github
Explore at:
Dataset updated
Jun 14, 2024
Dataset authored and provided by
Nexdata
Variables measured
Data size, Data format, Data diversity, Age distribution, Race distribution, Gender distribution, Collecting environment
Description
10,109 people - face images dataset includes people collected from many countries. Multiple photos of each person’s daily life are collected, and the gender, race, age, etc. of the person being collected are marked.This Dataset provides a rich resource for artificial intelligence applications. It has been validated by multiple AI companies and proves beneficial for achieving outstanding performance in real-world applications. Throughout the process of Dataset collection, storage, and usage, we have consistently adhered to Dataset protection and privacy regulations to ensure the preservation of user privacy and legal rights. All Dataset comply with regulations such as GDPR, CCPA, PIPL, and other applicable laws.
f
People Data | Global |Reach - 900 Million Records for Comprehensive Consumer...
factori.ai
Updated Dec 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). People Data | Global |Reach - 900 Million Records for Comprehensive Consumer Insights & Data Enrichment [Dataset]. https://www.factori.ai/datasets/people-data/
Explore at:
Dataset updated
Dec 24, 2024
License
https://www.factori.ai/privacy-policyhttps://www.factori.ai/privacy-policy
Area covered
Global
Description
Our proprietary People Data is a mobile user dataset that connects anonymous IDs to a wide range of attributes, including demographics, device ownership, audience segments, key locations, and more. This rich dataset allows our partner brands to gain a comprehensive view of consumers based on their personas, enabling them to derive actionable insights swiftly.

People Data Graph

Record Count: 900 Million

Capturing Frequency: Once per Event

Delivering Frequency: Once per Month

Updated: Monthly

People Data

Reach Our extensive data reach covers a variety of categories, encompassing user demographics, Mobile Advertising IDs (MAID), device details, locations, affluence, interests, traveled countries, and more. Data Export Methodology We dynamically collect and provide the most updated data and insights through the best-suited method at appropriate intervals, whether daily, weekly, monthly, or quarterly.

Business Needs

Our People Data caters to various business needs, offering valuable insights for consumer analysis, data enrichment, sales forecasting, and retail analytics, empowering brands to make informed decisions and optimize their strategies.
o
Multi-feature Golf Play Dataset
opendatabay.com
.undefined
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Multi-feature Golf Play Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/23026657-8212-4f36-84a0-f6064a0b889b
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Datasimple
Area covered
Education & Learning Analytics
Description
This is the Extended Golf Play Dataset, a rich and detailed collection designed to expand upon the classic golf dataset [1]. It incorporates a wide array of features suitable for various data science applications and is especially valuable for teaching purposes [1]. The dataset is organised in a long format, where each row represents a single observation and often includes textual data, such as player reviews or comments [2]. It contains a special set of mini datasets, each tailored to a specific teaching point, for example, demonstrating data cleaning or combining datasets [1]. These are ideal for beginners to practise with real examples and are complemented by notebooks with step-by-step guides [1].

Columns

The dataset features a variety of columns, including core, extra, and text-based attributes: * ID: A unique identifying number for each player [1]. * Date: The specific day the data was recorded or the golf session took place [1, 2]. * Weekday: The day of the week, with numerical representation (e.g., 0 for Sunday, 1 for Monday) [1, 3]. * Holiday: Indicates whether the day was a special holiday (Yes/No), specifically noted for holidays in Japan (1 for yes, 0 for no) [1, 3]. * Month: The month in which golf was played [3]. * Season: The time of year, such as spring, summer, autumn, or winter [1, 3]. * Outlook: Describes the weather conditions during the session (e.g., sunny, cloudy, rainy, snowy) [1, 3]. * Temperature: The ambient temperature during the golf session, recorded in Celsius [1, 3]. * Humidity: The percentage of moisture in the air [1, 3]. * Windy: A boolean indicator (True/False or 1 for yes, 0 for no) if it was windy [1, 3]. * Crowded-ness: A measure of how busy the golf course was, ranging from 0 to 1 [1, 4]. * PlayTime-Hour: The duration for which people played golf, in hours [1]. * Play: Indicates whether golf was played or not (Yes/No) [1]. * Review: Textual feedback from players about their day at golf [1]. * EmailCampaign: Text content of emails sent daily by the golf place [1]. * MaintenanceTasks: Descriptions of work carried out to maintain the golf course [1].

Distribution

This dataset is organised in a long format, meaning each row represents a single observation [2]. Data files are typically in CSV format, with sample files updated separately to the platform [5]. Specific numbers for rows or records are not currently available within the provided sources. The dataset also includes a special collection of mini datasets within its structure [1].

Usage

This dataset is highly versatile and ideal for learning and applying various data science skills: * Data Visualisation: Learn to create graphs and identify patterns within the data [1]. * Predictive Modelling: Discover which data points are useful for predicting if golf will be played [1]. * Data Cleaning: Practise spotting and managing data that appears incorrect or inconsistent [1]. * Time Series Analysis: Understand how various factors change over time, such as daily or monthly trends [1, 2]. * Data Grouping: Learn to combine similar days or observations together [1]. * Text Analysis: Extract insights from textual features like player reviews, potentially for sentiment analysis or thematic extraction [1, 2]. * Recommendation Systems: Develop models to suggest optimal times to play golf based on historical data [1]. * Data Management: Gain experience in managing and analysing data structured in a long format, which is common for repeated measures [2].

Coverage

The dataset's regional coverage is global [6]. While the Date column records the day the data was captured or the session occurred, no specific time range for the collected data is stated beyond the listing date of 11/06/2025 [1, 6]. Demographic scope includes unique player IDs [1], but no specific demographic details or data availability notes for particular groups or years are provided.

License

CC-BY

Who Can Use It

This dataset is designed for a broad audience: * New Learners: It is easy to understand and comes with guides to aid the learning process [1]. * Teachers: An excellent resource for conducting classes on data visualisation and interpretation [1]. * Researchers: Suitable for testing novel data analysis methodologies [1]. * Students: Can acquire a wide range of skills, from making graphs to understanding textual data and building recommendation systems [1].

Dataset Name Suggestions

Golf Play Extended Analytics

Advanced Golf Session Data

Long Format Golf Insights

Multi-feature Golf Play Dataset

Textual Golf Data for Learning

Attributes

Original Data Source: ⛳️ Golf Play Dataset Extended
A
‘Time Series Forecasting with Yahoo Stock Price ’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Time Series Forecasting with Yahoo Stock Price ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-time-series-forecasting-with-yahoo-stock-price-9e5c/d6d871c7/?iid=002-651&v=presentation
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Time Series Forecasting with Yahoo Stock Price ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/arashnic/time-series-forecasting-with-yahoo-stock-price on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

Stocks and financial instrument trading is a lucrative proposition. Stock markets across the world facilitate such trades and thus wealth exchanges hands. Stock prices move up and down all the time and having ability to predict its movement has immense potential to make one rich. Stock price prediction has kept people interested from a long time. There are hypothesis like the Efficient Market Hypothesis, which says that it is almost impossible to beat the market consistently and there are others which disagree with it.

There are a number of known approaches and new research going on to find the magic formula to make you rich. One of the traditional methods is the time series forecasting. Fundamental analysis is another method where numerous performance ratios are analyzed to assess a given stock. On the emerging front, there are neural networks, genetic algorithms, and ensembling techniques.

Another challenging problem in stock price prediction is Black Swan Event, unpredictable events that cause stock market turbulence. These are events that occur from time to time, are unpredictable and often come with little or no warning.

A black swan event is an event that is completely unexpected and cannot be predicted. Unexpected events are generally referred to as black swans when they have significant consequences, though an event with few consequences might also be a black swan event. It may or may not be possible to provide explanations for the occurrence after the fact – but not before. In complex systems, like economies, markets and weather systems, there are often several causes. After such an event, many of the explanations for its occurrence will be overly simplistic.

#
#

https://www.visualcapitalist.com/wp-content/uploads/2020/03/mm3_black_swan_events_shareable.jpg"> #
#
New bleeding age state-of-the-art deep learning models stock predictions is overcoming such obstacles e.g. "Transformer and Time Embeddings". An objectives are to apply these novel models to forecast stock price.

Content

Stock price prediction is the task of forecasting the future value of a given stock. Given the historical daily close price for S&P 500 Index, prepare and compare forecasting solutions. S&P 500 or Standard and Poor's 500 index is an index comprising of 500 stocks from different sectors of US economy and is an indicator of US equities. Other such indices are the Dow 30, NIFTY 50, Nikkei 225, etc. For the purpose of understanding, we are utilizing S&P500 index, concepts, and knowledge can be applied to other stocks as well.

Dataset

The historical stock price information is also publicly available. For our current use case, we will utilize the pandas_datareader library to get the required S&P 500 index history using Yahoo Finance databases. We utilize the closing price information from the dataset available though other information such as opening price, adjusted closing price, etc., are also available. We prepare a utility function get_raw_data() to extract required information in a pandas dataframe. The function takes index ticker name as input. For S&P 500 index, the ticker name is ^GSPC. The following snippet uses the utility function to get the required data.(See Simple LSTM Regression)

Features and Terminology: In stock trading, the high and low refer to the maximum and minimum prices in a given time period. Open and close are the prices at which a stock began and ended trading in the same period. Volume is the total amount of trading activity. Adjusted values factor in corporate actions such as dividends, stock splits, and new share issuance.

Starter Kernel(s)

Simple LSTM Regression

Acknowledgements

Mining and updating of this dateset will depend upon Yahoo Finance .

Inspiration

Sort of variation of sequence modeling and bleeding age e.g. attention can be applied for research and forecasting

Some Readings

Applications of deep learning in stock market prediction: recent progress

Stock predictions with state-of-the-art Transformer and Time Embeddings

https://arxiv.org/pdf/1712.02136.pdf

https://catanacapital.com/blog/unpredictable-black-swan-event-stock-market/

https://par.nsf.gov/servlets/purl/10149715

https://arxiv.org/pdf/1712.02136.pdf

*If you download and find the data useful your upvote is an explicit feedback for future works*

--- Original source retains full ownership of the source dataset ---
a
MPII Human Pose Dataset
academictorrents.com
opendatalab.com
bittorrent
Updated Jun 13, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2020). MPII Human Pose Dataset [Dataset]. https://academictorrents.com/details/6be335f0d038fd4ed4422dd318705e0843059718
Explore at:
bittorrent(12101283689)Available download formats
Dataset updated
Jun 13, 2020
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
MPII Human Pose dataset is a state of the art benchmark for evaluation of articulated human pose estimation. The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations. Following the best practices for the performance evaluation benchmarks in the literature we withhold the test annotations to prevent overfitting and tuning on the test set. We are working on an automatic evaluation server and performance analysis tools based on rich test set annotations. Citing the dataset
Inheritances; inherited wealth, characteristics
cbs.nl
data.overheid.nl
+1more
xml
Updated Mar 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centraal Bureau voor de Statistiek (2025). Inheritances; inherited wealth, characteristics [Dataset]. https://www.cbs.nl/en-gb/figures/detail/84242ENG
Explore at:
xmlAvailable download formats
Dataset updated
Mar 28, 2025
Dataset provided by
Statistics Netherlands
Authors
Centraal Bureau voor de Statistiek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2007 - 2022
Area covered
The Netherlands
Description
This table shows statistics about inherited wealth of deceased people.

The inheritances are made up for all people died, registered from the Dutch population register on January 1st.

Because of the revision of the income statistics, there are differences between 2010 and 2011. From 2007 until 2010 background characteristics of persons represent the situation on the 31st of December in that year. From 2011 onwards characteristics represent the situation on the 1st of January of the given year.

Data available from: 2007.

Status of the figures: The figures for 2007 to 2021 are final. The figures for 2022 are preliminary.

Changes as of March 2025: Figures for 2021 are finalized. Preliminary figures for 2022 are added.

When will new figures be published? New figures will be published in the first quarter of 2026.
F
General domain Human-Human conversation chats in Spanish
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Spanish [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/spanish-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Spanish people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
F
General domain Human-Human conversation chats in Bahasa
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). General domain Human-Human conversation chats in Bahasa [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/bahasa-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
This training dataset comprises more than 10,000 conversational text data between two native Bahasa people in the general domain. We have a collection of chats on a variety of different topics/services/issues of daily life, such as music, books, festivals, health, kids, family, environment, study, childhood, cuisine, internet, movies, etc., and that makes the dataset diverse.
These chats consist of language-specific words, and phrases and follow the native way of talking which makes the chats more information-rich for your NLP model. Apart from each chat being specific to the topic, it contains various attributes like people's names, addresses, contact information, email address, time, date, local currency, telephone numbers, local slang, etc too in various formats to make the text data unbiased.
These chat scripts have between 300 and 700 words and up to 50 turns. 150 people that are a part of the FutureBeeAI crowd community contributed to this dataset. You will also receive chat metadata, such as participant age, gender, and country information, along with the chats. Dataset applications include conversational AI, natural language processing (NLP), smart assistants, text recognition, text analytics, and text prediction.
This dataset is being expanded with new chats all the time. We are able to produce text data in a variety of languages to meet your unique requirements. Check out the FutureBeeAI community for a custom collection.
This training dataset's licence belongs to FutureBeeAI!
u
Financial Diaries Project 2003-2004 - South Africa
datafirst.uct.ac.za
Updated Jun 2, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Southern Africa Labour and Developement Research Unit (SALDRU) (2020). Financial Diaries Project 2003-2004 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/2
Explore at:
Dataset updated
Jun 2, 2020
Dataset authored and provided by
Southern Africa Labour and Developement Research Unit (SALDRU)
Time period covered
2003 - 2004
Area covered
South Africa
Description
Abstract

South African policymakers are endeavouring to ensure that the poor have better access to financial services. However, a lack of understanding of the financial needs of poor households impedes a broad strategy to attend to this need. The Financial Diaries study addresses this knowledge gap by examining financial management in rural and urban households. The study is a year-long household survey based on fortnightly interviews in Diepsloot (Gauteng), Langa (Western Cape) and Lugangeni (Eastern Cape). In total, 160 households were involved in this pioneering study which promises to offer important insights into how poor people manage their money as well as the context in which poor people make financial decisions. The study paints a rich picture of the texture of financial markets in townships, highlighting the prevalence of informal financial products, the role of survivalist business and the contribution made by social grants. The Financial Diaries dataset includes highly detailed, daily cash flow data on income, expenditure and financial flows on both a household and individual basis.

Geographic coverage

Langa in Cape Town, Diepsloot in Johannesburg and Lugangeni, a rural village in the Eastern Cape.

Analysis unit

Households and individuals

Universe

The survey covered households in the three geographic areas.

Kind of data

Sample survey data

Sampling procedure

To create the sampling frame for the Financial Diaries, the researchers echoed the method used in the Rutherford (2002) and Ruthven (2002), a participatory wealth ranking (PWR). Within South Africa, the participatory wealth ranking method is used by the Small Enterprise Foundation (SEF), a prominent NGO microlender based in the rural Limpopo Province. Simanowitz (1999) compared the PWR method to the Visual Indicator of Poverty (VIP) and found that the VIP test was seen to be at best 70% consistent with the PWR tests. At times one third of the list of households that were defined as the poorest by the VIP test was actually some of the richest according to the PWR. The PWR method was also implicitly assessed in van der Ruit, May and Roberts (2001) by comparing it to the Principle Components Analysis (PCA) used by CGAP as a means to assess client poverty. They found that three quarters of those defined as poor by the PCA were also defined as poor by the PWR. We closely followed the SEF manual to conduct our wealth rankings, and consulted with SEF on adapting the method to urban areas.

The first step is to consult with community leaders and ask how they would divide their community. Within each type of areas, representative neighbourhoods of about 100 households each were randomly chosen. Townships in South Africa are organised by street - with each street or zone having its own street committee. The street committees are meant to know everyone on their street and to serve as stewards of all activity within the street. Each street committee in each area was invited to a central meeting and asked to map their area and give a roster of household names. Following the mapping, each area was visited and the maps and rosters were checked by going door to door with the street committee.

Two references groups were then selected from the street committee and senior members of the community with between four and eight people in each reference group. Each reference group was first asked to indicate how they define a poor household versus those that are well off. This discussion had a dual purpose. First, it relayed information about what each community believes is rich or poor. Second, it started the reference group thinking about which households belong under which heading.

Following this discussion, each reference group then ranked each household in the neighbourhood according to their perceived wealth. The SEF methodology of wealth ranking is de-normalised in that reference groups are invited to put households into as many different wealth piles as they feel in appropriate. Only households that are known by both reference groups were kept in the sample.

The SEF guidelines were used to assign a score to each household in a particular pile. The scores were created by dividing 100 by the number of piles multiplied by the level of the pile. This means that if the poorest pile was number 1, then every household in the pile was assigned a score of 100, representing 100% poverty. If the wealthiest pile was pile number 6, then every household in that pile received a score of 16.7 and every household in pile 5 received a score of 33.3. An average score for both reference groups was taken for the distribution.

One way of assessing how good the results are is to analyse how consistent the rankings were between the two reference groups. According to the SEF methodology, a result is consistent if the scores between the two reference groups have no more than a 25 points difference. A result is inconsistent if the difference between the scores is between 26 and 50 points while a result is unreliable is the difference between the scores is above 50 points. SEF uses both consistent and inconsistent rankings, as long as they use the average across two reference groups - this would mean that 91% of the sample could be used. However, because only used two reference groups were used, only the consistent household for the final sample selection was considered.

To test this further,the number of times that the reference groups put a household in the exact same category was counted. The extent of agreement at either end of the wealth spectrum between the two reference groups was also assessed. This result would be unbiased by how many categories the reference groups put households into.

Following the example used in India and Bangladesh, the sample was divided into three different wealth categories depending on the household's overall score. Making a distinction between three different categories of wealth allowed the following of a similar ranking of wealth to Bangladesh and India, but also it kept the sample from being over-stratified. A sample of 60 households each was then drawn randomly from each area. To draw the sample based on a proportion representation of each wealth ranking within the population would likely leave the sample lacking in wealthier households of some rankings to draw conclusions. Therefore the researchers drew equally from each ranking.

Mode of data collection

Face-to-face [f2f]
i
Financial Diaries Project 2003-2004 - South Africa
dev.ihsn.org
catalog.ihsn.org
+2more
Updated Apr 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daryl Collins (2019). Financial Diaries Project 2003-2004 - South Africa [Dataset]. https://dev.ihsn.org/nada//catalog/74100
Explore at:
Dataset updated
Apr 25, 2019
Dataset authored and provided by
Daryl Collins
Time period covered
2003 - 2004
Area covered
South Africa
Description
Abstract

South African policymakers are endeavouring to ensure that the poor have better access to financial services. However, a lack of understanding of the financial needs of poor households impedes a broad strategy to attend to this need.
The Financial Diaries study addresses this knowledge gap by examining financial management in rural and urban households. The study is a year-long household survey based on fortnightly interviews in Diepsloot (Gauteng), Langa (Western Cape) and Lugangeni (Eastern Cape). In total, 160 households were involved in this pioneering study which promises to offer important insights into how poor people manage their money as well as the context in which poor people make financial decisions. The study paints a rich picture of the texture of financial markets in townships, highlighting the prevalence of informal financial products, the role of survivalist business and the contribution made by social grants. The Financial Diaries dataset includes highly detailed, daily cash flow data on income, expenditure and financial flows on both a household and individual basis.

Geographic coverage

Langa in Cape Town, Diepsloot in Johannesburg and Lugangeni, a rural village in the Eastern Cape

Analysis unit

Units of analysis in the Financial Diaries Study 2003-2004 include households and individuals

Kind of data

Sample survey data [ssd]

Sampling procedure

To create the sampling frame for the Financial Diaries, the researchers echoed the method used in the Rutherford (2002) and Ruthven (2002), a participatory wealth ranking (PWR). Within South Africa, the participatory wealth ranking method is used by the Small Enterprise Foundation (SEF), a prominent NGO microlender based in the rural Limpopo Province. Simanowitz (1999) compared the PWR method to the Visual Indicator of Poverty (VIP) and found that the VIP test was seen to be at best 70% consistent with the PWR tests. At times one third of the list of households that were defined as the poorest by the VIP test was actually some of the richest according to the PWR. The PWR method was also implicitly assessed in van der Ruit, May and Roberts (2001) by comparing it to the Principle Components Analysis (PCA) used by CGAP as a means to assess client poverty. They found that three quarters of those defined as poor by the PCA were also defined as poor by the PWR. We closely followed the SEF manual to conduct our wealth rankings, and consulted with SEF on adapting the method to urban areas.

The first step is to consult with community leaders and ask how they would divide their community. Within each type of areas, representative neighbourhoods of about 100 households each were randomly chosen. Townships in South Africa are organised by street - with each street or zone having its own street committee. The street committees are meant to know everyone on their street and to serve as stewards of all activity within the street. Each street committee in each area was invited to a central meeting and asked to map their area and give a roster of household names. Following the mapping, each area was visited and the maps and rosters were checked by going door to door with the street committee.

Two references groups were then selected from the street committee and senior members of the community with between four and eight people in each reference group. Each reference group was first asked to indicate how they define a poor household versus those that are well off. This discussion had a dual purpose. First, it relayed information about what each community believes is rich or poor. Second, it started the reference group thinking about which households belong under which heading.

Following this discussion, each reference group then ranked each household in the neighbourhood according to their perceived wealth. The SEF methodology of wealth ranking is de-normalised in that reference groups are invited to put households into as many different wealth piles as they feel in appropriate. Only households that are known by both reference groups were kept in the sample.

The SEF guidelines were used to assign a score to each household in a particular pile. The scores were created by dividing 100 by the number of piles multiplied by the level of the pile. This means that if the poorest pile was number 1, then every household in the pile was assigned a score of 100, representing 100% poverty. If the wealthiest pile was pile number 6, then every household in that pile received a score of 16.7 and every household in pile 5 received a score of 33.3. An average score for both reference groups was taken for the distribution.

One way of assessing how good the results are is to analyse how consistent the rankings were between the two reference groups. According to the SEF methodology, a result is consistent if the scores between the two reference groups have no more than a 25 points difference. A result is inconsistent if the difference between the scores is between 26 and 50 points while a result is unreliable is the difference between the scores is above 50 points. SEF uses both consistent and inconsistent rankings, as long as they use the average across two reference groups - this would mean that 91% of the sample could be used. However, because only used two reference groups were used, only the consistent household for the final sample selection was considered.

To test this further,the number of times that the reference groups put a household in the exact same category was counted. The extent of agreement at either end of the wealth spectrum between the two reference groups was also assessed. This result would be unbiased by how many categories the reference groups put households into.

Following the example used in India and Bangladesh, the sample was divided into three different wealth categories depending on the household's overall score. Making a distinction between three different categories of wealth allowed the following of a similar ranking of wealth to Bangladesh and India, but also it kept the sample from being over-stratified. A sample of 60 households each was then drawn randomly from each area. To draw the sample based on a proportion representation of each wealth ranking within the population would likely leave the sample lacking in wealthier households of some rankings to draw conclusions. Therefore the researchers drew equally from each ranking.

Mode of data collection

Face-to-face [f2f]
k
The OTC Market: Where Penny Stocks Can Make You Rich (Forecast)
kappasignal.com
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2023). The OTC Market: Where Penny Stocks Can Make You Rich (Forecast) [Dataset]. https://www.kappasignal.com/2023/07/the-otc-market-where-penny-stocks-can.html
Explore at:
Dataset updated
Jul 13, 2023
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

The OTC Market: Where Penny Stocks Can Make You Rich

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
f
Possible channels of influence of income inequality.
figshare.com
plos.figshare.com
xls
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shi Ting; Wenbin Zang; Chen Chen; Dapeng Chen (2023). Possible channels of influence of income inequality. [Dataset]. http://doi.org/10.1371/journal.pone.0263008.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0263008.t007
Dataset updated
Jun 15, 2023
Dataset provided by
PLOS ONE
Authors
Shi Ting; Wenbin Zang; Chen Chen; Dapeng Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Possible channels of influence of income inequality.
LPL Financial (LPLA): Future of Wealth Management in Question? (Forecast)
kappasignal.com
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). LPL Financial (LPLA): Future of Wealth Management in Question? (Forecast) [Dataset]. https://www.kappasignal.com/2024/05/lpl-financial-lpla-future-of-wealth.html
Explore at:
Dataset updated
May 16, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

LPL Financial (LPLA): Future of Wealth Management in Question?

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
k
Watery Wealth Maker (WAT): Is Waters Corporation a Sustainable Investment?...
kappasignal.com
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). Watery Wealth Maker (WAT): Is Waters Corporation a Sustainable Investment? (Forecast) [Dataset]. https://www.kappasignal.com/2024/01/watery-wealth-maker-wat-is-waters.html
Explore at:
Dataset updated
Jan 4, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Watery Wealth Maker (WAT): Is Waters Corporation a Sustainable Investment?

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
Armada Acquisition Corp. (AACIW): Warranting a Wealthy Future? (Forecast)
kappasignal.com
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KappaSignal (2024). Armada Acquisition Corp. (AACIW): Warranting a Wealthy Future? (Forecast) [Dataset]. https://www.kappasignal.com/2024/01/armada-acquisition-corp-aaciw.html
Explore at:
Dataset updated
Jan 4, 2024
Dataset authored and provided by
KappaSignal
License
https://www.kappasignal.com/p/legal-disclaimer.htmlhttps://www.kappasignal.com/p/legal-disclaimer.html
Description
This analysis presents a rigorous exploration of financial data, incorporating a diverse range of statistical features. By providing a robust foundation, it facilitates advanced research and innovative modeling techniques within the field of finance.

Armada Acquisition Corp. (AACIW): Warranting a Wealthy Future?

Financial data:

Historical daily stock prices (open, high, low, close, volume)

Fundamental data (e.g., market capitalization, price to earnings P/E ratio, dividend yield, earnings per share EPS, price to earnings growth, debt-to-equity ratio, price-to-book ratio, current ratio, free cash flow, projected earnings growth, return on equity, dividend payout ratio, price to sales ratio, credit rating)

Technical indicators (e.g., moving averages, RSI, MACD, average directional index, aroon oscillator, stochastic oscillator, on-balance volume, accumulation/distribution A/D line, parabolic SAR indicator, bollinger bands indicators, fibonacci, williams percent range, commodity channel index)

Machine learning features:

Feature engineering based on financial data and technical indicators

Sentiment analysis data from social media and news articles

Macroeconomic data (e.g., GDP, unemployment rate, interest rates, consumer spending, building permits, consumer confidence, inflation, producer price index, money supply, home sales, retail sales, bond yields)

Potential Applications:

Stock price prediction

Portfolio optimization

Algorithmic trading

Market sentiment analysis

Risk management

Use Cases:

Researchers investigating the effectiveness of machine learning in stock market prediction

Analysts developing quantitative trading Buy/Sell strategies

Individuals interested in building their own stock market prediction models

Students learning about machine learning and financial applications

Additional Notes:

The dataset may include different levels of granularity (e.g., daily, hourly)

Data cleaning and preprocessing are essential before model training

Regular updates are recommended to maintain the accuracy and relevance of the data
F
Finnish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-finnish-finland
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Finnish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Finland to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Finnish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Finnish.

•
Voice Assistants: Build smart assistants capable of understanding natural Finnish conversations.

<span
F
Mexican Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
F
Norwegian General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Norwegian General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-norwegian-norway
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Norwegian General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Norwegian speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Norwegian communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Norwegian speech models that understand and respond to authentic Norwegian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Norwegian. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Norwegian speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Norway to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Norwegian speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Norwegian.

•
Voice Assistants: Build smart assistants capable of understanding natural Norwegian conversations.
F
Colombian Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Colombian Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-colombia
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Colombian Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Colombian Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Colombian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Colombian Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Colombian Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Colombia to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Colombian Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Colombian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;
F
Canadian English General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Canadian English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-canada
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Canada
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Canadian English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Canadian English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic Canadian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Canadian English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Canadian English speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Canada to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple English speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Canadian English.

•
Voice Assistants: Build smart assistants capable of understanding natural Canadian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2024). 10,109 People - Face Images Dataset [Dataset]. https://www.nexdata.ai/datasets/1402?source=Github

10,109 People - Face Images Dataset

Explore at:

Dataset updated

Jun 14, 2024

Dataset authored and provided by

Nexdata

Variables measured

Data size, Data format, Data diversity, Age distribution, Race distribution, Gender distribution, Collecting environment

Description

10,109 people - face images dataset includes people collected from many countries. Multiple photos of each person’s daily life are collected, and the gender, race, age, etc. of the person being collected are marked.This Dataset provides a rich resource for artificial intelligence applications. It has been validated by multiple AI companies and proves beneficial for achieving outstanding performance in real-world applications. Throughout the process of Dataset collection, storage, and usage, we have consistently adhered to Dataset protection and privacy regulations to ensure the preservation of user privacy and legal rights. All Dataset comply with regulations such as GDPR, CCPA, PIPL, and other applicable laws.

Clear search

Close search

Google apps

Main menu

10,109 People - Face Images Dataset

People Data | Global |Reach - 900 Million Records for Comprehensive Consumer...

People Data Graph

People Data

Business Needs

Multi-feature Golf Play Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

‘Time Series Forecasting with Yahoo Stock Price ’ analyzed by Analyst-2

Context

Content

Dataset

Starter Kernel(s)

Acknowledgements

Inspiration

Some Readings

*If you download and find the data useful your upvote is an explicit feedback for future works*

MPII Human Pose Dataset

Inheritances; inherited wealth, characteristics

General domain Human-Human conversation chats in Spanish

What’s Included

General domain Human-Human conversation chats in Bahasa

What’s Included

Financial Diaries Project 2003-2004 - South Africa

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Financial Diaries Project 2003-2004 - South Africa

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

The OTC Market: Where Penny Stocks Can Make You Rich (Forecast)

The OTC Market: Where Penny Stocks Can Make You Rich

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Possible channels of influence of income inequality.

LPL Financial (LPLA): Future of Wealth Management in Question? (Forecast)

LPL Financial (LPLA): Future of Wealth Management in Question?

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Watery Wealth Maker (WAT): Is Waters Corporation a Sustainable Investment?...

Watery Wealth Maker (WAT): Is Waters Corporation a Sustainable Investment?

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Armada Acquisition Corp. (AACIW): Warranting a Wealthy Future? (Forecast)

Armada Acquisition Corp. (AACIW): Warranting a Wealthy Future?

Financial data:

Machine learning features:

Potential Applications:

Use Cases:

Additional Notes:

Finnish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

If you download and find the data useful your upvote is an explicit feedback for future works