88 datasets found
  1. Survey of Consumer Finances

    • federalreserve.gov
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Board of Governors of the Federal Reserve Board (2023). Survey of Consumer Finances [Dataset]. http://doi.org/10.17016/8799
    Explore at:
    Dataset updated
    Oct 18, 2023
    Dataset provided by
    Federal Reserve Board of Governors
    Federal Reserve Systemhttp://www.federalreserve.gov/
    Authors
    Board of Governors of the Federal Reserve Board
    Time period covered
    1962 - 2023
    Description

    The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families' balance sheets, pensions, income, and demographic characteristics.

  2. NYC Open Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    NYC Open Data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

    Content

    Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

    • Over 8 million 311 service requests from 2012-2016

    • More than 1 million motor vehicle collisions 2012-present

    • Citi Bike stations and 30 million Citi Bike trips 2013-present

    • Over 1 billion Yellow and Green Taxi rides from 2009-present

    • Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

    This dataset is deprecated and not being updated.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://opendata.cityofnewyork.us/

    https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

    The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

    Banner Photo by @bicadmedia from Unplash.

    Inspiration

    On which New York City streets are you most likely to find a loud party?

    Can you find the Virginia Pines in New York City?

    Where was the only collision caused by an animal that injured a cyclist?

    What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

    https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png

  3. Data from: Robotic manipulation datasets for offline compositional...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Hussing; Jorge Mendez; Anisha Singrodia; Cassandra Kent; Eric Eaton (2024). Robotic manipulation datasets for offline compositional reinforcement learning [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqps
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    University of Pennsylvania
    Massachusetts Institute of Technology
    Authors
    Marcel Hussing; Jorge Mendez; Anisha Singrodia; Cassandra Kent; Eric Eaton
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Offline reinforcement learning (RL) is a promising direction that allows RL agents to be pre-trained from large datasets avoiding recurrence of expensive data collection. To advance the field, it is crucial to generate large-scale datasets. Compositional RL is particularly appealing for generating such large datasets, since 1) it permits creating many tasks from few components, and 2) the task structure may enable trained agents to solve new tasks by combining relevant learned components. This submission provides four offline RL datasets for simulated robotic manipulation created using the 256 tasks from CompoSuite Mendez et al., 2022. In every task in CompoSuite, a robot arm is used to manipulate an object to achieve an objective all while trying to avoid an obstacle. There are for components for each of these four axes that can be combined arbitrarily leading to a total of 256 tasks. The component choices are * Robot: IIWA, Jaco, Kinova3, Panda* Object: Hollow box, box, dumbbell, plate* Objective: Push, pick and place, put in shelf, put in trashcan* Obstacle: None, wall between robot and object, wall between goal and object, door between goal and object The four included datasets are collected using separate agents each trained to a different degree of performance, and each dataset consists of 256 million transitions. The degrees of performance are expert data, medium data, warmstart data and replay data: * Expert dataset: Transitions from an expert agent that was trained to achieve 90% success on every task.* Medium dataset: Transitions from a medium agent that was trained to achieve 30% success on every task.* Warmstart dataset: Transitions from a Soft-actor critic agent trained for a fixed duration of one million steps.* Medium-replay-subsampled dataset: Transitions that were stored during the training of a medium agent up to 30% success. These datasets are intended for the combined study of compositional generalization and offline reinforcement learning. Methods The datasets were collected by using several deep reinforcement learning agents trained to the various degrees of performance described above on the CompoSuite benchmark (https://github.com/Lifelong-ML/CompoSuite) which builds on top of robosuite (https://github.com/ARISE-Initiative/robosuite) and uses the MuJoCo simulator (https://github.com/deepmind/mujoco). During reinforcement learning training, we stored the data that was collected by each agent in a separate buffer for post-processing. Then, after training, to collect the expert and medium dataset, we run the trained agents for 2000 trajectories of length 500 online in the CompoSuite benchmark and store the trajectories. These add up to a total of 1 million state-transitions tuples per dataset, totalling a full 256 million datapoints per dataset. The warmstart and medium-replay-subsampled dataset contain trajectories from the stored training buffer of the SAC agent trained for a fixed duration and the medium agent respectively. For medium-replay-subsampled data, we uniformly sample trajectories from the training buffer until we reach more than 1 million transitions. Since some of the tasks have termination conditions, some of these trajectories are trunctated and not of length 500. This sometimes results in a number of sampled transitions larger than 1 million. Therefore, after sub-sampling, we artificially truncate the last trajectory and place a timeout at the final position. This can in some rare cases lead to one incorrect trajectory if the datasets are used for finite horizon experimentation. However, this truncation is required to ensure consistent dataset sizes, easy data readability and compatibility with other standard code implementations. The four datasets are split into four tar.gz folders each yielding a total of 12 compressed folders. Every sub-folder contains all the tasks for one of the four robot arms for that dataset. In other words, every tar.gz folder contains a total of 64 tasks using the same robot arm and four tar.gz files form a full dataset. This is done to enable people to only download a part of the dataset in case they do not need all 256 tasks. For every task, the data is separately stored in an hdf5 file allowing for the usage of arbitrary task combinations and mixing of data qualities across the four datasets. Every task is contained in a folder that is named after the CompoSuite elements it uses. In other words, every task is represented as a folder named

  4. w

    Data Use in Academia Dataset

    • datacatalog.worldbank.org
    csv, utf-8
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Semantic Scholar Open Research Corpus (S2ORC) (2023). Data Use in Academia Dataset [Dataset]. https://datacatalog.worldbank.org/search/dataset/0065200/data_use_in_academia_dataset
    Explore at:
    utf-8, csvAvailable download formats
    Dataset updated
    Nov 27, 2023
    Dataset provided by
    Semantic Scholar Open Research Corpus (S2ORC)
    Brian William Stacy
    License

    https://datacatalog.worldbank.org/public-licenses?fragment=cchttps://datacatalog.worldbank.org/public-licenses?fragment=cc

    Description

    This dataset contains metadata (title, abstract, date of publication, field, etc) for around 1 million academic articles. Each record contains additional information on the country of study and whether the article makes use of data. Machine learning tools were used to classify the country of study and data use.


    Our data source of academic articles is the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al. 2020). The corpus contains more than 130 million English language academic papers across multiple disciplines. The papers included in the Semantic Scholar corpus are gathered directly from publishers, from open archives such as arXiv or PubMed, and crawled from the internet.


    We placed some restrictions on the articles to make them usable and relevant for our purposes. First, only articles with an abstract and parsed PDF or latex file are included in the analysis. The full text of the abstract is necessary to classify the country of study and whether the article uses data. The parsed PDF and latex file are important for extracting important information like the date of publication and field of study. This restriction eliminated a large number of articles in the original corpus. Around 30 million articles remain after keeping only articles with a parsable (i.e., suitable for digital processing) PDF, and around 26% of those 30 million are eliminated when removing articles without an abstract. Second, only articles from the year 2000 to 2020 were considered. This restriction eliminated an additional 9% of the remaining articles. Finally, articles from the following fields of study were excluded, as we aim to focus on fields that are likely to use data produced by countries’ national statistical system: Biology, Chemistry, Engineering, Physics, Materials Science, Environmental Science, Geology, History, Philosophy, Math, Computer Science, and Art. Fields that are included are: Economics, Political Science, Business, Sociology, Medicine, and Psychology. This third restriction eliminated around 34% of the remaining articles. From an initial corpus of 136 million articles, this resulted in a final corpus of around 10 million articles.


    Due to the intensive computer resources required, a set of 1,037,748 articles were randomly selected from the 10 million articles in our restricted corpus as a convenience sample.


    The empirical approach employed in this project utilizes text mining with Natural Language Processing (NLP). The goal of NLP is to extract structured information from raw, unstructured text. In this project, NLP is used to extract the country of study and whether the paper makes use of data. We will discuss each of these in turn.


    To determine the country or countries of study in each academic article, two approaches are employed based on information found in the title, abstract, or topic fields. The first approach uses regular expression searches based on the presence of ISO3166 country names. A defined set of country names is compiled, and the presence of these names is checked in the relevant fields. This approach is transparent, widely used in social science research, and easily extended to other languages. However, there is a potential for exclusion errors if a country’s name is spelled non-standardly.


    The second approach is based on Named Entity Recognition (NER), which uses machine learning to identify objects from text, utilizing the spaCy Python library. The Named Entity Recognition algorithm splits text into named entities, and NER is used in this project to identify countries of study in the academic articles. SpaCy supports multiple languages and has been trained on multiple spellings of countries, overcoming some of the limitations of the regular expression approach. If a country is identified by either the regular expression search or NER, it is linked to the article. Note that one article can be linked to more than one country.


    The second task is to classify whether the paper uses data. A supervised machine learning approach is employed, where 3500 publications were first randomly selected and manually labeled by human raters using the Mechanical Turk service (Paszke et al. 2019).[1] To make sure the human raters had a similar and appropriate definition of data in mind, they were given the following instructions before seeing their first paper:


    Each of these documents is an academic article. The goal of this study is to measure whether a specific academic article is using data and from which country the data came.

    There are two classification tasks in this exercise:

    1. identifying whether an academic article is using data from any country

    2. Identifying from which country that data came.

    For task 1, we are looking specifically at the use of data. Data is any information that has been collected, observed, generated or created to produce research findings. As an example, a study that reports findings or analysis using a survey data, uses data. Some clues to indicate that a study does use data includes whether a survey or census is described, a statistical model estimated, or a table or means or summary statistics is reported.

    After an article is classified as using data, please note the type of data used. The options are population or business census, survey data, administrative data, geospatial data, private sector data, and other data. If no data is used, then mark "Not applicable". In cases where multiple data types are used, please click multiple options.[2]

    For task 2, we are looking at the country or countries that are studied in the article. In some cases, no country may be applicable. For instance, if the research is theoretical and has no specific country application. In some cases, the research article may involve multiple countries. In these cases, select all countries that are discussed in the paper.

    We expect between 10 and 35 percent of all articles to use data.


    The median amount of time that a worker spent on an article, measured as the time between when the article was accepted to be classified by the worker and when the classification was submitted was 25.4 minutes. If human raters were exclusively used rather than machine learning tools, then the corpus of 1,037,748 articles examined in this study would take around 50 years of human work time to review at a cost of $3,113,244, which assumes a cost of $3 per article as was paid to MTurk workers.


    A model is next trained on the 3,500 labelled articles. We use a distilled version of the BERT (bidirectional Encoder Representations for transformers) model to encode raw text into a numeric format suitable for predictions (Devlin et al. (2018)). BERT is pre-trained on a large corpus comprising the Toronto Book Corpus and Wikipedia. The distilled version (DistilBERT) is a compressed model that is 60% the size of BERT and retains 97% of the language understanding capabilities and is 60% faster (Sanh, Debut, Chaumond, Wolf 2019). We use PyTorch to produce a model to classify articles based on the labeled data. Of the 3,500 articles that were hand coded by the MTurk workers, 900 are fed to the machine learning model. 900 articles were selected because of computational limitations in training the NLP model. A classification of “uses data” was assigned if the model predicted an article used data with at least 90% confidence.


    The performance of the models classifying articles to countries and as using data or not can be compared to the classification by the human raters. We consider the human raters as giving us the ground truth. This may underestimate the model performance if the workers at times got the allocation wrong in a way that would not apply to the model. For instance, a human rater could mistake the Republic of Korea for the Democratic People’s Republic of Korea. If both humans and the model perform the same kind of errors, then the performance reported here will be overestimated.


    The model was able to predict whether an article made use of data with 87% accuracy evaluated on the set of articles held out of the model training. The correlation between the number of articles written about each country using data estimated under the two approaches is given in the figure below. The number of articles represents an aggregate total of

  5. H

    Replication Data for: Education Policies and Systems across Modern History:...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated May 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrián del Río; Carl Henrik Knutsen; Philipp Lutscher (2024). Replication Data for: Education Policies and Systems across Modern History: A Global Dataset [Dataset]. http://doi.org/10.7910/DVN/MNM5Q5
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Adrián del Río; Carl Henrik Knutsen; Philipp Lutscher
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    We introduce a global dataset on education policies and systems across modern history (EPSM), which includes measures on compulsory education, ideological guidance and content of education, governmental intervention and level of education centralization, and teacher training. EPSM covers 157 countries with populations exceeding 1 million people, and the time series extends from 1789 to the present. EPSM opens up for studying several questions concerning political control and the politicized nature of education systems. In addition to describing the measures, we detail how the data were collected and discuss validity and reliability issues. Thereafter, we describe historical trends in various characteristics of the education system. Finally, we illustrate how our data can be used to address key questions about education and politics, replicating and extending recent analyses on the (reciprocal) relationship between education and democratization, the impact of education on political attitudes, and how rural inequality interacts with regime type in influencing education systems.

  6. l

    YouTube RPM by Niche (2025)

    • learningrevolution.net
    html
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jawad Khan, YouTube RPM by Niche (2025) [Dataset]. https://www.learningrevolution.net/how-much-money-does-youtube-pay-for-1-million-views/
    Explore at:
    htmlAvailable download formats
    Dataset provided by
    Learning Revolution
    Authors
    Jawad Khan
    Area covered
    YouTube
    Variables measured
    Gaming, Travel, Finance, Education, Technology, Memes/Vlogs
    Description

    This dataset provides estimated YouTube RPM (Revenue Per Mille) ranges for different niches in 2025, based on ad revenue earned per 1,000 monetized views.

  7. S&P 500 stock data

    • kaggle.com
    zip
    Updated Feb 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cam Nugent (2018). S&P 500 stock data [Dataset]. https://www.kaggle.com/camnugent/sandp500
    Explore at:
    zip(20283917 bytes)Available download formats
    Dataset updated
    Feb 10, 2018
    Authors
    Cam Nugent
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.

    The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.

    Feb 2018 note: I have just updated the dataset to include data up to Feb 2018. I have also accounted for changes in the stocks on the S&P 500 index (RIP whole foods etc. etc.).

    Content

    The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the all_stocks_5yr.csv and corresponding folder).

    The folder individual_stocks_5yr contains files of data for individual stocks, labelled by their stock ticker name. The all_stocks_5yr.csv contains the same data, presented in a merged .csv file. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.

    All the files have the following columns: Date - in format: yy-mm-dd

    Open - price of the stock at market open (this is NYSE data so all in USD)

    High - Highest price reached in the day

    Low Close - Lowest price reached in the day

    Volume - Number of shares traded

    Name - the stock's ticker name

    Acknowledgements

    Due to volatility in google finance, for the newest version I have switched over to acquiring the data from The Investor's Exchange api, the simple script I use to do this is found here. Special thanks to Kaggle, Github, pandas_datareader and The Market.

    Inspiration

    This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!

  8. d

    ICLUS v2.1 land use projections for the Fourth National Climate Assessment...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Environmental Protection Agency, Office of Research and Development-National Center for Environmental Assessment (Publisher) (2025). ICLUS v2.1 land use projections for the Fourth National Climate Assessment (SSP2) [Dataset]. https://catalog.data.gov/dataset/iclus-v2-1-land-use-projections-for-the-fourth-national-climate-assessment-ssp213
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    U.S. Environmental Protection Agency, Office of Research and Development-National Center for Environmental Assessment (Publisher)
    Description

    SSP2 is a “middle-of-the-road” projection, where social, economic and technological trends do not shift markedly from historical patterns, resulting in a U.S. population of 455 million people by 2100. Domestic migration trends remain consistent with the recent past. This version of the ICLUS model does not include climate change projections to dynamically update location-specific amenities when calculating migration. These projections will include the “nocc” label in the file name to indicate this difference.

  9. Global News Dataset

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar Saksham (2023). Global News Dataset [Dataset]. https://www.kaggle.com/datasets/everydaycodings/global-news-dataset/code
    Explore at:
    zip(419253490 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    Kumar Saksham
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    News Dataset

    Context

    This dataset comprises news articles collected over the past few months using the NewsAPI. The primary motivation behind curating this dataset was to develop and experiment with various natural language processing (NLP) models. The dataset aims to support the creation of text summarization models, sentiment analysis models, and other NLP applications.

    Sources

    The data is sourced from the NewsAPI, a comprehensive and up-to-date news aggregation service. The API provides access to a wide range of news articles from various reputable sources, making it a valuable resource for constructing a diverse and informative dataset.

    Data Fetching Script

    The data for this dataset was collected using a custom Python script. You can find the script used for data retrieval dailyWorker.py. This script leverages the NewsAPI to gather information on news articles over a specified period.

    Feel free to explore and modify the script to suit your data collection needs. If you have any questions or suggestions for improvement, please don't hesitate to reach out.

    Labeling Details

    The file ratings.csv in this dataset has been labeled using the NLP model cardiffnlp/twitter-roberta-base-sentiment for sentiment classification.
    This labeling was applied to facilitate sentiment-based research and analysis tasks.

    Inspiration

    The inspiration behind collecting this dataset stems from the growing interest in NLP applications and the need for high-quality, real-world data to train and evaluate these models effectively. By leveraging the NewsAPI, we aim to contribute to the development of robust text summarization and sentiment analysis models that can better understand and process news content.

    Dataset Features

    • Text of news articles
    • Publication date and time
    • Source information
    • Sentiment labels (from ratings.csv)
    • Any additional metadata available through the NewsAPI

    Potential Use Cases

    1. Text Summarization: Develop models to generate concise and informative summaries of news articles.
    2. Sentiment Analysis: Analyze the sentiment expressed in news articles to understand public opinion.
    3. Topic Modeling: Explore trends and topics within the news data.

    Note:
    Please refer to the NewsAPI documentation for terms of use and ensure compliance with their policies when using this dataset.

  10. Z

    Dataset for the Article "A Predictive Method to Improve the Effectiveness of...

    • data.niaid.nih.gov
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero (2021). Dataset for the Article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4782983
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    University of Modena and Reggio Emilia, Italy
    Authors
    Marco Furini; Federica Mandreoli; Riccardo Martoglia; Manuela Montangero
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the dataset for the article "A Predictive Method to Improve the Effectiveness of Twitter Communication in a Cultural Heritage Scenario".

    Abstract:

    Museums are embracing social technologies in the attempt to broaden their audience and to engage people. Although social communication seems an easy task, media managers know how hard it is to reach millions of people with a simple message. Indeed, millions of posts are competing every day to get visibility in terms of likes and shares and very little research focused on museums communication to identify best practices. In this paper, we focus on Twitter and we propose a novel method that exploits interpretable machine learning techniques to: (a) predict whether a tweet will likely be appreciated by Twitter users or not; (b) present simple suggestions that will help enhancing the message and increasing the probability of its success. Using a real-world dataset of around 40,000 tweets written by 23 world famous museums, we show that our proposed method allows identifying tweet features that are more likely to influence the tweet success.

    Code to run a selection of experiments is available at https://github.com/rmartoglia/predict-twitter-ch

    Dataset structure

    The dataset contains the dataset used in the experiments of the above research paper. Only the extracted features for the museum tweet threads (and not the message full text) are provided and needed for the analyses.

    We selected 23 well known world spread art museums and grouped them into five groups: G1 (museums with at least three million of followers); G2 (museums with more than one million of followers); G3 (museums with more than 400,000 followers); G4 (museums with more that 200,000 followers); G5 (Italian museums). From these museums, we analyzed ca. 40,000 tweets, with a number varying from 5k ca. to 11k ca. for each museum group, depending on the number of museums in each group.

    Content features: these are the features that can be drawn form the content of the tweet itself. We further divide such features in the following two categories:

    – Countable: these features have a value ranging into different intervals. We take into consideration: the number of hashtags (i.e., words preceded by #) in the tweet, the number of URLs (i.e., links to external resources), the number of images (e.g., photos and graphical emoticons), the number of mentions (i.e., twitter accounts preceded by @), the length of the tweet;

    – On-Off : these features have binary values in {0, 1}. We observe whether the tweet has exclamation marks, question marks, person names, place names, organization names, other names. Moreover, we also take into consideration the tweet topic density: assuming that the involved topics correspond to the hashtags mentioned in the text, we define a tweet as dense of topics if the number of hashtags it contains is greater than a given threshold, set to 5. Finally, we observe the tweet sentiment that might be present (positive or negative) or not (neutral).

    Context features: these features are not drawn form the content of the tweet itself and might give a larger picture of the context in which the tweet was sent. Namely, we take into consideration the part of the day in which the tweet was sent (morning, afternoon, evening and night respectively from 5:00am to 11:59am, from 12:00pm to 5:59pm, from 6:00pm to 10:59pm and from 11pm to 4:59am), and a boolean feature indicating whether the tweet is a retweet or not.

    User features: these features are proper of the user that sent the tweet, and are the same for all the tweets of this user. Namely we consider the name of the museum and the number of followers of the user.

  11. h

    kvqa

    • huggingface.co
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Korea Electronics Technology Institute Artificial Intelligence Research Center (2023). kvqa [Dataset]. https://huggingface.co/datasets/KETI-AIR/kvqa
    Explore at:
    Dataset updated
    Nov 2, 2023
    Dataset authored and provided by
    Korea Electronics Technology Institute Artificial Intelligence Research Center
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Visual question answering

    VQA understands a provided image and if a person asks question about this, it provides an answer after analyzing (or reasoning) the image via natural language.

    KVQA dataset

    As part of T-Brain’s projects on social value, KVQA dataset, a Korean version of VQA dataset was created. KVQA dataset consists of photos taken by Korean visually impaired people, questions about the photos, and 10 answers from 10 distinct annotators for each question. Currently, it consists of 30,000 sets of images and questions, and 300,000 answers, but by the end of this year, we will increase the dataset size to 100,000 sets of images and questions, and 1 million answers. This dataset can be used only for educational and research purposes. Please refer to the attached license for more details. We hope that the KVQA dataset can simultaneously provide opportunities for the development of Korean VQA technology as well as creation of meaningful social value in Korean society.

    You can download KVQA dataset via this link.

    Evaluation

    We measure the model's accuracy by using answers collected from 10 different people for each question. If the answer provided by a VQA model is equal to 3 or more answers from 10 annotators, it gets 100%; if less than 3, it gets a partial score proportionately. To be consistent with ‘human accuracies’, measured accuracies are averaged over all 10 choose 9 sets of human annotators. Please refer to VQA Evaluation which we follow.

    Usage

    from datasets import load_dataset
    
    raw_datasets = load_dataset(
            "kvqa.py", 
            "default",
            cache_dir="huggingface_datasets", 
            data_dir="data",
            ignore_verifications=True,
          )
    
    dataset_train = raw_datasets["train"]
    
    for item in dataset_train:
      print(item)
      exit()
    

    Data statistics

    v1.0 (Jan. 2020)

    Overall (%)Yes/no (%)Number (%)Etc (%)Unanswerable (%)
    # images100,445 (100)6,124 (6.10)9,332 (9.29)69,069 (68.76)15,920 (15.85)
    # questions100,445 (100)6,124 (6.10)9,332 (9.29)69,069 (68.76)15,920 (15.85)
    # answers1,004,450 (100)61,240 (6.10)93,320 (9.29)690,690 (68.76)159,200 (15.85)

    Data

    Data field description

    NameTypeDescription
    VQA[dict]list of dict holding VQA data
    +- imagestrfilename of image
    +- sourcestrdata source `["kvqa"
    +- answers[dict]list of dict holding 10 answers
    +--- answerstranswer in string
    +--- answer_confidencestr`["yes"
    +- questionstrquestion about the image
    +- answerableintanswerable? `[0
    +- answer_typestranswer type `["number"

    Data example

    [{
        "image": "KVQA_190712_00143.jpg",
        "source": "kvqa",
        "answers": [{
          "answer": "피아노",
          "answer_confidence": "yes"
        }, {
          "answer": "피아노",
          "answer_confidence": "yes"
        }, {
          "answer": "피아노 치고있다",
          "answer_confidence": "maybe"
        }, {
          "answer": "unanswerable",
          "answer_confidence": "maybe"
        }, {
          "answer": "게임",
          "answer_confidence": "maybe"
        }, {
          "answer": "피아노 앞에서 무언가를 보고 있음",
          "answer_confidence": "maybe"
        }, {
          "answer": "피아노치고있어",
          "answer_confidence": "maybe"
        }, {
          "answer": "피아노치고있어요",
          "answer_confidence": "maybe"
        }, {
          "answer": "피아노 연주",
          "answer_confidence": "maybe"
        }, {
          "answer": "피아노 치기",
          "answer_confidence": "yes"
        }],
        "question": "방에 있는 사람은 지금 뭘하고 있지?",
        "answerable": 1,
        "answer_type": "other"
      },
      {
        "image": "VizWiz_train_000000008148.jpg",
        "source": "vizwiz",
        "answers": [{
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "티비 리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "maybe"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }, {
          "answer": "리모컨",
          "answer_confidence": "yes"
        }],
        "question": "이것은 무엇인가요?",
        "answerable": 1,
        "answer_type": "other"
      }
    ]
    
  12. Instagram accounts with the most followers worldwide 2024

    • statista.com
    • de.statista.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stacy Jo Dixon, Instagram accounts with the most followers worldwide 2024 [Dataset]. https://www.statista.com/topics/1164/social-networks/
    Explore at:
    Dataset provided by
    Statistahttp://statista.com/
    Authors
    Stacy Jo Dixon
    Description

    Cristiano Ronaldo has one of the most popular Instagram accounts as of April 2024.

                  The Portuguese footballer is the most-followed person on the photo sharing app platform with 628 million followers. Instagram's own account was ranked first with roughly 672 million followers.
    
                  How popular is Instagram?
    
                  Instagram is a photo-sharing social networking service that enables users to take pictures and edit them with filters. The platform allows users to post and share their images online and directly with their friends and followers on the social network. The cross-platform app reached one billion monthly active users in mid-2018. In 2020, there were over 114 million Instagram users in the United States and experts project this figure to surpass 127 million users in 2023.
    
                  Who uses Instagram?
    
                  Instagram audiences are predominantly young – recent data states that almost 60 percent of U.S. Instagram users are aged 34 years or younger. Fall 2020 data reveals that Instagram is also one of the most popular social media for teens and one of the social networks with the biggest reach among teens in the United States.
    
                  Celebrity influencers on Instagram
                  Many celebrities and athletes are brand spokespeople and generate additional income with social media advertising and sponsored content. Unsurprisingly, Ronaldo ranked first again, as the average media value of one of his Instagram posts was 985,441 U.S. dollars.
    
  13. e

    TREAM: Time series of freshwater macroinvertebrate abundances and site...

    • knb.ecoinformatics.org
    • search.dataone.org
    • +3more
    Updated May 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen Welti; Peter Haase (2024). TREAM: Time series of freshwater macroinvertebrate abundances and site characteristics of European streams and rivers [Dataset]. http://doi.org/10.5063/F1NG4P4R
    Explore at:
    Dataset updated
    May 24, 2024
    Dataset provided by
    Knowledge Network for Biocomplexity
    Authors
    Ellen Welti; Peter Haase
    Time period covered
    Jan 1, 1968 - Jan 1, 2020
    Area covered
    Variables measured
    E10, day, Date, FDis, FDiv, FEve, FRed, FRic, F_to, RaoQ, and 71 more
    Description

    Freshwater macroinvertebrates are a diverse group that play many key ecological roles, including accelerating nutrient cycling, filtering water, controlling aquatic primary producers, and providing food for predators. Since they rapidly respond to environmental changes, macroinvertebrate community composition is a commonly used indicator of water quality. In Europe, efforts to improve water quality following environmental legislation, primarily starting in the 1980s, may have driven a recovery of macroinvertebrate communities. Towards understanding temporal changes of these organisms, we compiled the TREAM database (Time seRies of European freshwAter Macroinvertebrates). The TREAM database consists of whole macroinvertebrate community time series from 1,816 river and stream sites (mean length of 19.2 years with 14.9 sampling years) of 22 European countries sampled between 1968 and 2020. In total, the data include >93 million sampled individuals of 2,648 taxa from 959 genera and 212 families. These data can be used to ask a wide range of questions from identifying drivers of the population dynamics of specific taxa to assessing the success of legislative and management restoration efforts. Usage Notes: Several key characteristics about these data should be noted by future data users. First, and most importantly, sampling methods and effort, and seasonality are standardised within individual time series but can vary across time series. This means that raw data are not directly comparable across the 41 independent projects included in the TREAM dataset. Second, while we use the taxonomic backbone of freshwaterecology.info as it is a common tool used by European freshwater ecologists, we are aware that it does not capture all recent changes to taxonomic names. Third, two pairs of time series overlap in sampling locations: 1) Site ID = 100000001 (SVD) & 100000309 (Bugey_SVD) refer to the same location; 2) Site ID = 100000002 (SVG) & 100000308 (Bugey_SVG) refer to the same location. The data from these sites were collected by the Institut national de la recherche agronomique (referent: Maxence Forcellini) between 1980 and 2014, and by Électricité de France (Referent: Anthony Maire) between 2000 and 2019. Fourth, although standardised taxonomic resolution within time series was a criterion for data inclusion, some datasets switch taxonomic resolution (e.g. from genus to species level) for a given taxon part-way through the time series. This is particularly the case for data from Denmark within the Baetidae, Brachycentridae, Chironomidae, Gammaridae, Oligochaeta and Simuliidae. We did not alter these names because they represent the original information provided for and published in Haase et al. (2023), and standardisation methods may vary depending on intended future use. In the time series provided, these issues could affect analyses of shifts in community composition, which could reflect a shift in identification level rather than compositional change. These issues do not affect analyses of total abundance and have little influence taxa richness or diversity, including their temporal trends, as they are typically substitutions of one unique taxon for another. As with all large datasets of ecological time series, data users should carefully check the data in light of their intended use and when questions arise, contact data providers (listed in TREAM_siteLevel.csv). Finally, since each record was assigned the Hydrography90m stream network subcatchment, users can directly interact with the hydrographr R-package which facilitates subsequent network and distance analyses using the data records.

  14. h

    synthetic-dataset-1m-dalle3-high-quality-captions

    • huggingface.co
    Updated May 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben (2024). synthetic-dataset-1m-dalle3-high-quality-captions [Dataset]. https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 3, 2024
    Authors
    Ben
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Dalle3 1 Million+ High Quality Captions

    Alt name: Human Preference Synthetic Dataset

    Example grids for landscapes, cats, creatures, and fantasy are also available.

      Description:
    

    This dataset comprises of AI-generated images sourced from various websites and individuals, primarily focusing on Dalle 3 content, along with contributions from other AI systems of sufficient quality like Stable Diffusion and Midjourney (MJ v5 and above). As users typically… See the full description on the dataset page: https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions.

  15. Stocks Data- Individual stock 5 years

    • kaggle.com
    zip
    Updated Sep 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    singole (2022). Stocks Data- Individual stock 5 years [Dataset]. https://www.kaggle.com/datasets/singole/stocks-data-individual-stock-5-years
    Explore at:
    zip(10270219 bytes)Available download formats
    Dataset updated
    Sep 7, 2022
    Authors
    singole
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    About Dataset Context Stock market data can be interesting to analyze and as a further incentive, strong predictive models can have large financial payoff. The amount of financial data on the web is seemingly endless. A large and well structured dataset on a wide array of companies can be hard to come by. Here I provide a dataset with historical stock prices (last 5 years) for all companies currently found on the S&P 500 index.

    The script I used to acquire all of these .csv files can be found in this GitHub repository In the future if you wish for a more up to date dataset, this can be used to acquire new versions of the .csv files.

    Feb 2018 note: I have just updated the dataset to include data up to Feb 2018. I have also accounted for changes in the stocks on the S&P 500 index (RIP whole foods etc. etc.).

    Content The data is presented in a couple of formats to suit different individual's needs or computational limitations. I have included files containing 5 years of stock data (in the allstocks5yr.csv and corresponding folder).

    The folder individualstocks5yr contains files of data for individual stocks, labelled by their stock ticker name. The allstocks5yr.csv contains the same data, presented in a merged .csv file. Depending on the intended use (graphing, modelling etc.) the user may prefer one of these given formats.

    All the files have the following columns: Date - in format: yy-mm-dd

    Open - price of the stock at market open (this is NYSE data so all in USD)

    High - Highest price reached in the day

    Low Close - Lowest price reached in the day

    Volume - Number of shares traded

    Name - the stock's ticker name

    Acknowledgements Due to volatility in google finance, for the newest version I have switched over to acquiring the data from The Investor's Exchange api, the simple script I use to do this is found here. Special thanks to Kaggle, Github, pandas_datareader and The Market.

    Inspiration This dataset lends itself to a some very interesting visualizations. One can look at simple things like how prices change over time, graph an compare multiple stocks at once, or generate and graph new metrics from the data provided. From these data informative stock stats such as volatility and moving averages can be easily calculated. The million dollar question is: can you develop a model that can beat the market and allow you to make statistically informed trades!

  16. 50Million Rows Turkish Market Sales Dataset(MSSQL)

    • kaggle.com
    Updated Aug 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omer Colakoglu (2023). 50Million Rows Turkish Market Sales Dataset(MSSQL) [Dataset]. https://www.kaggle.com/datasets/omercolakoglu/50million-rows-turkish-market-sales-datasetmssql
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Omer Colakoglu
    Description

    50 Million Rows MSSQL Backup File with Clustered Columnstore Index.

    This dataset contains -27K categorized Turkish supermarket items. -81 stores (Every city of Turkey has a store) -100K real Turkish names customer, address -10M rows sales data generated randomly. -All data has a near real price with influation factor by the time.

    All the data generated randomly. So the usernames have been generated with real Turkish names and surnames but they are not real people. The sale data generated randomly. But it has some rules. For example, every order can contains 1-9 kind of item. Every orderline amount can be 1-9 pieces. The randomise function works according to population of the city. So the number of orders for Istanbul (the biggest city of Turkey) is about 20% of all data and another city for example orders for the Gaziantep (the population is 2.5% of Turkey population) is about 2.5% off all data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2F9442f2a1dbae7f05ead4fde9e1033ac6%2Finbox_1611072_135236e39b79d6fae8830dec3fca4961_1.png?generation=1693509562300174&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2F1c39195270db87250e59d9f2917ccea1%2Finbox_1611072_b73d9ca432dae956564cfa5bfe42268c_3.png?generation=1693509575061587&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1611072%2Fa908389f33ae5c983e383d17f0d9a763%2Finbox_1611072_c5d349aa1f33c0fc4fc74b79b7167d3a_F3za81TXkAA1Il4.png?generation=1693509586158658&alt=media" alt="">

  17. Stock Portfolio - financial Risk Analytics

    • kaggle.com
    zip
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankur (2022). Stock Portfolio - financial Risk Analytics [Dataset]. https://www.kaggle.com/ankurnapa/stock-portfolio-financial-risk-analytics
    Explore at:
    zip(1325783 bytes)Available download formats
    Dataset updated
    Feb 15, 2022
    Authors
    Ankur
    Description

    INTRODUCTION

    This case study is the Capstone Project of **upGrad PG Diploma - Data Science **. The 6 steps of Data Analysis is used to present this analysis.

    Title: Financial & Risk Analytics

    Author: Ankur Napa

    Date: 14 Feb 2022

    Portfolio Manager : How to identify the right investment opportunity and recommend a portfolio as per the client's exact need?

    STEP 1: ASK

    1.0 Background

    We have 2 investors here:

    1. Patrick Jyengar - He is a conservative investor and he wants to invest 1 million dollars and expects double of his capital with less risk in the coming 5 years. He also wants to invest 500K dollars in a magazine(Naturo) and later wants to buy a minority portion of the same.

    2. Peter Jyengar - He is an aggressive investor and he wants to invest 1 million dollars into most high margin stocks & expects retunes within 5 years.

    1.2 Business Task:

    Analysing the portfolio of stocks to provide consultation on investment management based on the client’s requirements.
    
    

    1.3 Business Objectives:

    1. What are the trends identified?
    2. How could these trends apply to customers?
    3. How could these trends help influence investment strategy?

    1.4 Deliverables:

    1. A clean version of the final dataset.
    2. A well commented Jupyter notebook containing the entire work.
    3. A file containing a dashboard with all the important visualisations used in this project.
    4. A PPT file with an executive summary containing your understanding of the investor, insights and recommended steps of action for the investors.
    5. A video explaining the presentation: As the portfolio manager, you are expected to share a video presentation that you will share with the investors.

    1.5 Key Stakeholders:

    1. Patrick Jyengar - A successful entrepreneur - Jayengar Waterworks
    2. Peter Jyengar - Inheritor of Patrick Jyengar
  18. 1-million Instances - Vehicle Orientation Dataset

    • kaggle.com
    zip
    Updated Nov 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chuka J. Uzo (2023). 1-million Instances - Vehicle Orientation Dataset [Dataset]. https://www.kaggle.com/datasets/chukajuzo/high-resolution-imu-car-trajectory-dataset
    Explore at:
    zip(66612061 bytes)Available download formats
    Dataset updated
    Nov 10, 2023
    Authors
    Chuka J. Uzo
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset represents 1-million logs of a vehicle's directional orientation from sensor data obtained from an inertial measurement unit (IMU) mounted on a moving car, recorded over a distance of 76 kilometers. It includes Yaw, Pitch and other variables.

    The table includes the following fields:

    ```
    time: A timestamp for each data point, likely in Unix epoch time.
    
    seconds_elapsed: The amount of time in seconds that has passed since the start of data collection.
    
    qz, qy, qx, qw: Components of a quaternion representing the car's orientation in three-dimensional space. 
    
     Quaternions are used to avoid the gimbal lock problem that can occur with Euler angles.
    
    roll, pitch, yaw: The roll, pitch, and yaw angles, which are rotation angles around the x, y, and z axes, respectively.
    
    
    These represent the car's orientation in terms of Euler angles, which are more intuitive but susceptible to gimbal lock.
    
    Each row in the dataset represents a unique instance of data collected at a different time point, with over 1 million such instances. This kind of data is typically used for tasks such as navigation, stabilization, and tracking the vehicle's trajectory over time. It is essential for autonomous vehicle systems, robotics, and various simulations and analyses related to vehicle dynamics.
    
  19. N

    California annual income distribution by work experience and gender dataset...

    • neilsberg.com
    csv, json
    Updated Jan 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). California annual income distribution by work experience and gender dataset (Number of individuals ages 15+ with income, 2022) [Dataset]. https://www.neilsberg.com/research/datasets/237966af-981b-11ee-99cf-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Jan 9, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    California
    Variables measured
    Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time, Number of males working full time for a given income bracket, Number of males working part time for a given income bracket, Number of females working full time for a given income bracket, Number of females working part time for a given income bracket
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2022 1-Year Estimates. To portray the number of individuals for both the genders (Male and Female), within each income bracket we conducted an initial analysis and categorization of the American Community Survey data. Households are categorized, and median incomes are reported based on the self-identified gender of the head of the household. For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within California. The dataset can be utilized to gain insights into gender-based income distribution within the California population, aiding in data analysis and decision-making..

    Key observations

    • Employment patterns: Within California, among individuals aged 15 years and older with income, there were 13.97 million men and 13.04 million women in the workforce. Among them, 7.85 million men were engaged in full-time, year-round employment, while 5.61 million women were in full-time, year-round roles.
    • Annual income under $24,999: Of the male population working full-time, 7.46% fell within the income range of under $24,999, while 9.87% of the female population working full-time was represented in the same income bracket.
    • Annual income above $100,000: 33.14% of men in full-time roles earned incomes exceeding $100,000, while 25.33% of women in full-time positions earned within this income bracket.
    • Refer to the research insights for more key observations on more income brackets ( Annual income under $24,999, Annual income between $25,000 and $49,999, Annual income between $50,000 and $74,999, Annual income between $75,000 and $99,999 and Annual income above $100,000) and employment types (full-time year-round and part-time)

    https://i.neilsberg.com/ch/california-income-distribution-by-gender-and-employment-type.jpeg" alt="California gender and employment-based income distribution analysis (Ages 15+)">

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2022 1-Year Estimates.

    Income brackets:

    • $1 to $2,499 or loss
    • $2,500 to $4,999
    • $5,000 to $7,499
    • $7,500 to $9,999
    • $10,000 to $12,499
    • $12,500 to $14,999
    • $15,000 to $17,499
    • $17,500 to $19,999
    • $20,000 to $22,499
    • $22,500 to $24,999
    • $25,000 to $29,999
    • $30,000 to $34,999
    • $35,000 to $39,999
    • $40,000 to $44,999
    • $45,000 to $49,999
    • $50,000 to $54,999
    • $55,000 to $64,999
    • $65,000 to $74,999
    • $75,000 to $99,999
    • $100,000 or more

    Variables / Data Columns

    • Income Bracket: This column showcases 20 income brackets ranging from $1 to $100,000+..
    • Full-Time Males: The count of males employed full-time year-round and earning within a specified income bracket
    • Part-Time Males: The count of males employed part-time and earning within a specified income bracket
    • Full-Time Females: The count of females employed full-time year-round and earning within a specified income bracket
    • Part-Time Females: The count of females employed part-time and earning within a specified income bracket

    Employment type classifications include:

    • Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.
    • Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for California median household income by gender. You can refer the same here

  20. N

    Florida annual income distribution by work experience and gender dataset...

    • neilsberg.com
    csv, json
    Updated Jan 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2024). Florida annual income distribution by work experience and gender dataset (Number of individuals ages 15+ with income, 2022) [Dataset]. https://www.neilsberg.com/research/datasets/23a94c92-981b-11ee-99cf-3860777c1fe6/
    Explore at:
    json, csvAvailable download formats
    Dataset updated
    Jan 9, 2024
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Florida
    Variables measured
    Income for Male Population, Income for Female Population, Income for Male Population working full time, Income for Male Population working part time, Income for Female Population working full time, Income for Female Population working part time, Number of males working full time for a given income bracket, Number of males working part time for a given income bracket, Number of females working full time for a given income bracket, Number of females working part time for a given income bracket
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2022 1-Year Estimates. To portray the number of individuals for both the genders (Male and Female), within each income bracket we conducted an initial analysis and categorization of the American Community Survey data. Households are categorized, and median incomes are reported based on the self-identified gender of the head of the household. For additional information about these estimations, please contact us via email at research@neilsberg.com
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset presents the detailed breakdown of the count of individuals within distinct income brackets, categorizing them by gender (men and women) and employment type - full-time (FT) and part-time (PT), offering valuable insights into the diverse income landscapes within Florida. The dataset can be utilized to gain insights into gender-based income distribution within the Florida population, aiding in data analysis and decision-making..

    Key observations

    • Employment patterns: Within Florida, among individuals aged 15 years and older with income, there were 8.14 million men and 8.10 million women in the workforce. Among them, 4.38 million men were engaged in full-time, year-round employment, while 3.43 million women were in full-time, year-round roles.
    • Annual income under $24,999: Of the male population working full-time, 9.81% fell within the income range of under $24,999, while 14.09% of the female population working full-time was represented in the same income bracket.
    • Annual income above $100,000: 22.97% of men in full-time roles earned incomes exceeding $100,000, while 12.96% of women in full-time positions earned within this income bracket.
    • Refer to the research insights for more key observations on more income brackets ( Annual income under $24,999, Annual income between $25,000 and $49,999, Annual income between $50,000 and $74,999, Annual income between $75,000 and $99,999 and Annual income above $100,000) and employment types (full-time year-round and part-time)

    https://i.neilsberg.com/ch/florida-income-distribution-by-gender-and-employment-type.jpeg" alt="Florida gender and employment-based income distribution analysis (Ages 15+)">

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2022 1-Year Estimates.

    Income brackets:

    • $1 to $2,499 or loss
    • $2,500 to $4,999
    • $5,000 to $7,499
    • $7,500 to $9,999
    • $10,000 to $12,499
    • $12,500 to $14,999
    • $15,000 to $17,499
    • $17,500 to $19,999
    • $20,000 to $22,499
    • $22,500 to $24,999
    • $25,000 to $29,999
    • $30,000 to $34,999
    • $35,000 to $39,999
    • $40,000 to $44,999
    • $45,000 to $49,999
    • $50,000 to $54,999
    • $55,000 to $64,999
    • $65,000 to $74,999
    • $75,000 to $99,999
    • $100,000 or more

    Variables / Data Columns

    • Income Bracket: This column showcases 20 income brackets ranging from $1 to $100,000+..
    • Full-Time Males: The count of males employed full-time year-round and earning within a specified income bracket
    • Part-Time Males: The count of males employed part-time and earning within a specified income bracket
    • Full-Time Females: The count of females employed full-time year-round and earning within a specified income bracket
    • Part-Time Females: The count of females employed part-time and earning within a specified income bracket

    Employment type classifications include:

    • Full-time, year-round: A full-time, year-round worker is a person who worked full time (35 or more hours per week) and 50 or more weeks during the previous calendar year.
    • Part-time: A part-time worker is a person who worked less than 35 hours per week during the previous calendar year.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for Florida median household income by gender. You can refer the same here

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Board of Governors of the Federal Reserve Board (2023). Survey of Consumer Finances [Dataset]. http://doi.org/10.17016/8799
Organization logoOrganization logo

Survey of Consumer Finances

Explore at:
345 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 18, 2023
Dataset provided by
Federal Reserve Board of Governors
Federal Reserve Systemhttp://www.federalreserve.gov/
Authors
Board of Governors of the Federal Reserve Board
Time period covered
1962 - 2023
Description

The Survey of Consumer Finances (SCF) is normally a triennial cross-sectional survey of U.S. families. The survey data include information on families' balance sheets, pensions, income, and demographic characteristics.

Search
Clear search
Close search
Google apps
Main menu