Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consumer Spending in the United States increased to 16445.70 USD Billion in the second quarter of 2025 from 16345.80 USD Billion in the first quarter of 2025. This dataset provides the latest reported value for - United States Consumer Spending - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
The San Francisco Controller's Office maintains a database of budgetary data that appears in summarized form in each Annual Appropriation Ordinance (AAO). This data is presented on the Budget report hosted at http://openbook.sfgov.org, and is also available in this dataset in CSV format. New data is added on an annual basis when the AAO is published for each new fiscal year. Data is available from fiscal year 2010 forward. The City and County of San Francisco's budget is a two-year plan for how the City government will spend money with available resources. In the budget process, a budget is proposed by the Mayor, and then modified and approved by the Board of Supervisors as the Appropriation Ordinance. Each year, the City will update the Budget for the upcoming fiscal year and also set a budget for the subsequent fiscal year, which will be updated and approved in the following year. Enterprise departments do not submit a budget for the second year of the two year budget; rather, estimates of enterprise department budgets in the second year of the budget are incorporated into high-level spending and revenue figures. This dataset and the Appropriation Ordinance departmental views answer the question "How much does each department spend?". To show how much is spent by departments from the General Fund we make the following adjustments to the regular revenues and fund balance & reserves: + Transfers from one department to another (leaving out transfers within the same department) + Recoveries from one department to another (leaving out recoveries within the same department) - GF spent in other funds (this is deducted from GF Sources and added to the other fund's Sources) This is the gross total. By removing the transfers and recoveries that go from one department to the another we see the same net total that is in the Appropriation Ordinance Consolidated Schedule of Sources and Uses. Note that the amount added for transfers into the General Fund that move from one department to another is different than the amount deducted to eliminate the double counting caused by transfers. Transfer Adjustments: To meet accounting needs, money can be moved from one fund or department to another. For example, Public Works provides building maintenance services for the Fire Department for which the Fire Department pays Public Works. To solve this double counting problem, this dataset shows a reduction of $100,000 called Transfer Adjustments (Citywide) to the budgeted spending & revenue for the department providing the service. This lets the dataset display both the gross total of activity for both departments and the net total use of City and County revenues. In the example above, the money is moving both between departments, from Fire to Public Works, and between funds, from General Fund Operating to General Fund Works Orders/Overhead. Transfer Adjustments (Citywide): -Transfer Adjustments (Citywide) are used when money is moved from one department to another. These are deducted from the gross total to create the net total. -Transfer Adjustments are included in the gross total when they are within the same department. A separate sub-object is used to distinguish departmental Transfer Adjustments from Transfer Adjustments (Citywide). -Transfer Adjustments (Citywide) may differ from transfer adjustment lines in other public reports as a result of different approaches used to report transfers; however, the net total will remain the same across this dataset, the Mayor's Budget Book, and the Appropriations Ordinance, with limited exceptions due to error corrections and different methodologies used to present net totals. For more information, contact us. An example of a Transfer Adjustment within a department would be Public Works overhead allocations. Overhead costs cannot easily be isolated to a direct service or unit and so are allocated across those units using accepted accounting methods. Central m
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personal Spending in the United States increased 0.60 percent in August of 2025 over the previous month. This dataset provides the latest reported value for - United States Personal Spending - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
What We Eat in America (WWEIA) is the dietary intake interview component of the National Health and Nutrition Examination Survey (NHANES). WWEIA is conducted as a partnership between the U.S. Department of Agriculture (USDA) and the U.S. Department of Health and Human Services (DHHS). Two days of 24-hour dietary recall data are collected through an initial in-person interview, and a second interview conducted over the telephone within three to 10 days. Participants are given three-dimensional models (measuring cups and spoons, a ruler, and two household spoons) and/or USDA's Food Model Booklet (containing drawings of various sizes of glasses, mugs, bowls, mounds, circles, and other measures) to estimate food amounts. WWEIA data are collected using USDA's dietary data collection instrument, the Automated Multiple-Pass Method (AMPM). The AMPM is a fully computerized method for collecting 24-hour dietary recalls either in-person or by telephone. For each 2-year data release cycle, the following dietary intake data files are available: Individual Foods File - Contains one record per food for each survey participant. Foods are identified by USDA food codes. Each record contains information about when and where the food was consumed, whether the food was eaten in combination with other foods, amount eaten, and amounts of nutrients provided by the food. Total Nutrient Intakes File - Contains one record per day for each survey participant. Each record contains daily totals of food energy and nutrient intakes, daily intake of water, intake day of week, total number foods reported, and whether intake was usual, much more than usual or much less than usual. The Day 1 file also includes salt use in cooking and at the table; whether on a diet to lose weight or for other health-related reason and type of diet; and frequency of fish and shellfish consumption (examinees one year or older, Day 1 file only). DHHS is responsible for the sample design and data collection, and USDA is responsible for the survey’s dietary data collection methodology, maintenance of the databases used to code and process the data, and data review and processing. USDA also funds the collection and processing of Day 2 dietary intake data, which are used to develop variance estimates and calculate usual nutrient intakes. Resources in this dataset:Resource Title: What We Eat In America (WWEIA) main web page. File Name: Web Page, url: https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/food-surveys-research-group/docs/wweianhanes-overview/ Contains data tables, research articles, documentation data sets and more information about the WWEIA program. (Link updated 05/13/2020)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Government spending in the United States was last recorded at 39.7 percent of GDP in 2024 . This dataset provides - United States Government Spending To Gdp- actual values, historical data, forecast, chart, statistics, economic calendar and news.
More details about each file are in the individual file descriptions.
This is a dataset from the U.S. Census Bureau hosted by the Federal Reserve Economic Database (FRED). FRED has a data platform found here and they update their information according the amount of data that is brought in. Explore the U.S. Census Bureau using Kaggle and all of the data sources available through the U.S. Census Bureau organization page!
This dataset is maintained using FRED's API and Kaggle's API.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Per capita total expenditure on health at average exchange rate (US$)
Per capita total expenditure on health expressed at average exchange rate for that year in US$. Current prices.
It is downloaded from WHO, Gapminder.
It seems good to me for forecasting the total spending of money on health around the world, It can be good use case for prediction of money spending on health spending.
The City of Boston reports on our discretionary spending as part of our commitment to transparency and to growing pathways to equitable procurement for diverse suppliers. This dataset contains all discretionary spending data from Fiscal Year 2019 - Fiscal Year 2025, Quarter 3.
Discretionary spending refers to the total value of payments made by the City to a supplier through a contract ($10,000 or over) or a purchase order (under $10,000). This includes procurements where there is competition and discretion over choosing a supplier. Discretionary spending does not include mandatory or fixed obligations, such as payments of employee benefits.
You can view an interactive dashboard of our discretionary spending at https://www.boston.gov/equitable-procurement.
To learn more about the City’s procurement and supplier diversity efforts, visit:
** Notes on the Data **
The Discretionary Spending dataset differs from our Checkbook Explorer (https://data.boston.gov/dataset/checkbook-explorer) dataset in two key ways.
If you have questions about this data or would like to provide feedback, please use this form: https://docs.google.com/forms/d/10_VGn3OEaa-JA5VJynZ9JKw-I3t0BAzw3ckv8rxVMLc/viewform?edit_requested=true
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Government Spending in the United States decreased to 3993 USD Billion in the second quarter of 2025 from 3993.90 USD Billion in the first quarter of 2025. This dataset provides the latest reported value for - United States Government Spending - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Action recognition has received increasing attentions from the computer vision and machine learning community in the last decades. Ever since then, the recognition task has evolved from single view recording under controlled laboratory environment to unconstrained environment (i.e., surveillance environment or user generated videos). Furthermore, recent work focused on other aspect of action recognition problem, such as cross-view classification, cross domain learning, multi-modality learning, and action localization. Despite the large variations of studies, we observed limited works that explore the open-set and open-view classification problem, which is a genuine inherited properties in action recognition problem. In other words, a well designed algorithm should robustly identify an unfamiliar action as "unknown" and achieved similar performance across sensors with similar field of view. The Multi-Camera Action Dataset (MCAD) is designed to evaluate the open-view classification problem under surveillance environment.
In our multi-camera action dataset, different from common action datasets we use a total of five cameras, which can be divided into two types of cameras (StaticandPTZ), to record actions. Particularly, there are three Static cameras (Cam04 & Cam05 & Cam06) with fish eye effect and two PanTilt-Zoom (PTZ) cameras (PTZ04 & PTZ06). Static camera has a resolution of 1280×960 pixels, while PTZ camera has a resolution of 704×576 pixels and a smaller field of view than Static camera. What's more, we don't control the illumination environment. We even set two contrasting conditions (Daytime and Nighttime environment) which makes our dataset more challenge than many controlled datasets with strongly controlled illumination environment.The distribution of the cameras is shown in the picture on the right.
We identified 18 units single person daily actions with/without object which are inherited from the KTH, IXMAS, and TRECIVD datasets etc. The list and the definition of actions are shown in the table. These actions can also be divided into 4 types actions. Micro action without object (action ID of 01, 02 ,05) and with object (action ID of 10, 11, 12 ,13). Intense action with object (action ID of 03, 04 ,06, 07, 08, 09) and with object (action ID of 14, 15, 16, 17, 18). We recruited a total of 20 human subjects. Each candidate repeats 8 times (4 times during the day and 4 times in the evening) of each action under one camera. In the recording process, we use five cameras to record each action sample separately. During recording stage we just tell candidates the action name then they could perform the action freely with their own habit, only if they do the action in the field of view of the current camera. This can make our dataset much closer to reality. As a results there is high intra action class variation among different action samples as shown in picture of action samples.
URL: http://mmas.comp.nus.edu.sg/MCAD/MCAD.html
Resources:
IDXXXX.mp4.tar.gz contains video data for each individual
boundingbox.tar.gz contains person bounding box for all videos
protocol.json contains the evaluation protocol
img_list.txt contains the download URLs for the images version of the video data
idt_list.txt contians the download URLs for the improved Dense Trajectory feature
stip_list.txt contians the download URLs for the STIP feature
Manual annotated 2D joints for selected camera view and action class (available via http://zju-capg.org/heightmap/)
How to Cite:
Please cite the following paper if you use the MCAD dataset in your work (papers, articles, reports, books, software, etc):
Wenhui Liu, Yongkang Wong, An-An Liu, Yang Li, Yu-Ting Su, Mohan Kankanhalli Multi-Camera Action Dataset for Cross-Camera Action Recognition Benchmarking IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. http://doi.org/10.1109/WACV.2017.28
This dataset shows the total amount of State Prison Expenditures for Medical Care, Food expenses, and Utilities in the year 2001. Over a quarter of prison operating costs are for basic living expenses. Prisoner medical care, food service, utilities, and contract housing totaled $7.3 billion, or about 26% of State prison current operating expenses. Inmate medical care totaled $3.3 billion, or about 12% of operating expenditures. Supplies and services of government staff and full-time and part-time managed care and fee-for service providers averaged $2,625 per inmate, or $7.19 per day. By comparison, the average annual health care expenditure of U.S. residents, including all sources in FY 2001, was $4,370, or $11.97 per day. Factors beyond the scope of this report contributed to the variation in spending levels for prisoner medical care. Lacking economies of scale, some States had significantly higher than average medical costs for everyone, and some had higher proportions of inmates whose abuse of drugs or alcohol had led to disease. Also influencing variations in expenditures were staffing and funding of prisoner health care and distribution of specialized medical equipment for prisoner treatment. Food service in FY 2001 cost $1.2 billion, or approximately 4% of State prison operating expenditures. On average nationwide, State departments of correction spent $2.62 to feed inmates each day. Utility services for electricity, natural gas, heating oil, water, sewerage, trash removal, and telephone in State prisons totaled $996 million in FY 2001. Utilities accounted for about 3.5% of State prison operating expenditure. For more information see the url source of this dataset.
This dataset shows the amount of money that each state spent on their Corrections program both in percentage of the Overall amount of money spent in the State and as a total amount of money. This data was brought to our attention by the Pew Charitable Trusts in their report titled, One in 100: Behind Bars in America 2008. The main emphasis of the article emphasizes the point that in 2007 1 in every 100 Americans were in prison. To note: The District of Columbia is not included. D.C. prisoners were transferred to federal custody in 2001.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the ongoing energy transition, power grids are evolving fast. They operate more and more often close to their technical limit, under more and more volatile conditions. Fast, essentially real-time computational approaches to evaluate their operational safety, stability and reliability are therefore highly desirable. Machine Learning methods have been advocated to solve this challenge, however they are heavy consumers of training and testing data, while historical operational data for real-world power grids are hard if not impossible to access.
This dataset contains long time series for production, consumption, and line flows, amounting to 20 years of data with a time resolution of one hour, for several thousands of loads and several hundreds of generators of various types representing the ultra-high-voltage transmission grid of continental Europe. The synthetic time series have been statistically validated agains real-world data.
The algorithm is described in a Nature Scientific Data paper. It relies on the PanTaGruEl model of the European transmission network -- the admittance of its lines as well as the location, type and capacity of its power generators -- and aggregated data gathered from the ENTSO-E transparency platform, such as power consumption aggregated at the national level.
The network information is encoded in the file europe_network.json. It is given in PowerModels format, which it itself derived from MatPower and compatible with PandaPower. The network features 7822 power lines and 553 transformers connecting 4097 buses, to which are attached 815 generators of various types.
The time series forming the core of this dataset are given in CSV format. Each CSV file is a table with 8736 rows, one for each hourly time step of a 364-day year. All years are truncated to exactly 52 weeks of 7 days, and start on a Monday (the load profiles are typically different during weekdays and weekends). The number of columns depends on the type of table: there are 4097 columns in load files, 815 for generators, and 8375 for lines (including transformers). Each column is described by a header corresponding to the element identifier in the network file. All values are given in per-unit, both in the model file and in the tables, i.e. they are multiples of a base unit taken to be 100 MW.
There are 20 tables of each type, labeled with a reference year (2016 to 2020) and an index (1 to 4), zipped into archive files arranged by year. This amount to a total of 20 years of synthetic data. When using loads, generators, and lines profiles together, it is important to use the same label: for instance, the files loads_2020_1.csv, gens_2020_1.csv, and lines_2020_1.csv represent a same year of the dataset, whereas gens_2020_2.csv is unrelated (it actually shares some features, such as nuclear profiles, but it is based on a dispatch with distinct loads).
The time series can be used without a reference to the network file, simply using all or a selection of columns of the CSV files, depending on the needs. We show below how to select series from a particular country, or how to aggregate hourly time steps into days or weeks. These examples use Python and the data analyis library pandas, but other frameworks can be used as well (Matlab, Julia). Since all the yearly time series are periodic, it is always possible to define a coherent time window modulo the length of the series.
This example illustrates how to select generation data for Switzerland in Python. This can be done without parsing the network file, but using instead gens_by_country.csv, which contains a list of all generators for any country in the network. We start by importing the pandas library, and read the column of the file corresponding to Switzerland (country code CH):
import pandas as pd
CH_gens = pd.read_csv('gens_by_country.csv', usecols=['CH'], dtype=str)
The object created in this way is Dataframe with some null values (not all countries have the same number of generators). It can be turned into a list with:
CH_gens_list = CH_gens.dropna().squeeze().to_list()
Finally, we can import all the time series of Swiss generators from a given data table with
pd.read_csv('gens_2016_1.csv', usecols=CH_gens_list)
The same procedure can be applied to loads using the list contained in the file loads_by_country.csv.
This second example shows how to change the time resolution of the series. Suppose that we are interested in all the loads from a given table, which are given by default with a one-hour resolution:
hourly_loads = pd.read_csv('loads_2018_3.csv')
To get a daily average of the loads, we can use:
daily_loads = hourly_loads.groupby([t // 24 for t in range(24 * 364)]).mean()
This results in series of length 364. To average further over entire weeks and get series of length 52, we use:
weekly_loads = hourly_loads.groupby([t // (24 * 7) for t in range(24 * 364)]).mean()
The code used to generate the dataset is freely available at https://github.com/GeeeHesso/PowerData. It consists in two packages and several documentation notebooks. The first package, written in Python, provides functions to handle the data and to generate synthetic series based on historical data. The second package, written in Julia, is used to perform the optimal power flow. The documentation in the form of Jupyter notebooks contains numerous examples on how to use both packages. The entire workflow used to create this dataset is also provided, starting from raw ENTSO-E data files and ending with the synthetic dataset given in the repository.
This work was supported by the Cyber-Defence Campus of armasuisse and by an internal research grant of the Engineering and Architecture domain of HES-SO.
We conducted a cross-sectional study of the publicly available 2022 Open Payments data to characterize and quantify sponsored events (available for download at: https://www.cms.gov/priorities/key-initiatives/open-payments/data/dataset-downloads). Data sources We downloaded the 2022 dataset ZIP files from the Open Payments website on June 30th, 2023. We included all records for nurse practitioners, clinical nurse specialists, certified registered nurse anesthetists, and certified nurse-midwives (hereafter advanced practiced registered nurses (APRNs)); and allopathic and osteopathic physicians (hereafter, ‘physicians’). To ensure consistency in provider classification, we linked Payments data to the National Plan and Provider Enumeration System data (June 2023) by National Provider Identifier (NPI) and the National Uniform Claim Committee (NUCC) and excluded individuals with an ambiguous provider type. Event-centric analysis of Open Payments records: Creating an event typology We included only payments classified as “food and beverage” to reliably identify distinct sponsored events. We reasoned that food and beverage would be consumed on the same day in the same place, thus assumed that records for food and beverage associated with the same event would share the date of payment and location. We also assumed that the reported value of a food and beverage payment is the total cost of the hospitality divided by the number of attendees, thus grouped payment records with the same amount, rounded to the nearest dollar. Inferring which Open Payment records relate to the same sponsored event requires analytic decisions regarding the selection and representation of variables that define an event. To understand the impact of these choices, we undertook a sensitivity analysis to explore alternative ways to group Open Payments records for food and beverage, to determine how combination of variables, including date (specific date or within the same calendar week), amount (rounded to nearest dollar), and recipient’s state, affected the identification of sponsored events in the Open Payments data set. We chose to define a sponsored event as a cluster of three or more individual payment records for food and beverage (nature of payment) with the following matching Open Payments record variables: • Submitting applicable manufacturer (name) • Product category or therapeutic area • Name of drug or biological or device or medical supply • Recipient state • Total amount of payment (USD, rounded to nearest dollar) • Date of payment (exact) After examining the distribution of the data, we classified events in terms of size (≥20 attendees as “large” and 3-<20 as “small”) and amount per person. We categorized events <$10 as “coffee”, $10-<$30 as “lunch”, $30-<$150 as “dinner”, and ≥$150 as “banquet”.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
by Vizonix
This dataset differentiates between 4 similar object classes: 4 types of canned goods. We built this dataset with cans of olives, beans, stewed tomatoes, and refried beans.
The dataset is pre-augmented. That is to say, all required augmentations are applied to the actual native dataset prior to inference. We have found that augmenting this way provides our users maximum visibility and flexibility in tuning their dataset (and classifier) to achieve their specific use-case goals. Augmentations are present and visible in the native dataset prior to the classifier - so it's never a mystery what augmentation tweaks produce a more positive or negative outcome during training. It also eliminates the risk of downsizing affecting annotations.
The training images in this dataset were created in our studio in Florida from actual physical objects to the following specifications:
The training images in this dataset were composited / augmented in this way:
1,600 (+) different images were uploaded for each class (out of the 25,000 total images created for each class).
Understanding our Dataset Insights File
As users train their classifiers, they often wish to enhance accuracy by experimenting with or tweaking their dataset. With our Dataset Insights documents, they can easily determine which images possess which augmentations. Dataset Insights allow users to easily add or remove images with specific augmentations as they wish. This also provides a detailed profile and inventory of each file in the dataset.
The Dataset Insights document enables the user to see exactly which source image, angle, augmentation(s), etc. were used to create each image in the dataset.
Dataset Insight Files:
About Vizonix
Vizonix (vizonix.com) creates from-scratch datasets created from 100% in-house generated photography. Our images and backgrounds are generated in-house in our Florida studio. We typically image smaller items, deliver in 72 hours, and specialize in Manufacturer Quality Assurance (MQA) datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Lending Club offers peer-to-peer (P2P) loans through a technological platform for various personal finance purposes and is today one of the companies that dominate the US P2P lending market. The original dataset is publicly available on Kaggle and corresponds to all the loans issued by Lending Club between 2007 and 2018. The present version of the dataset is for constructing a granting model, that is, a model designed to make decisions on whether to grant a loan based on information available at the time of the loan application. Consequently, our dataset only has a selection of variables from the original one, which are the variables known at the moment the loan request is made. Furthermore, the target variable of a granting model represents the final status of the loan, that are "default" or "fully paid". Thus, we filtered out from the original dataset all the loans in transitory states. Our dataset comprises 1,347,681 records or obligations (approximately 60% of the original) and it was also cleaned for completeness and consistency (less than 1% of our dataset was filtered out).
TARGET VARIABLE
The dataset includes a target variable based on the final resolution of the credit: the default category corresponds to the event charged off and the non-default category to the event fully paid. It does not consider other values in the loan status variable since this variable represents the state of the loan at the end of the considered time window. Thus, there are no loans in transitory states. The original dataset includes the target variable “loan status”, which contains several categories ('Fully Paid', 'Current', 'Charged Off', 'In Grace Period', 'Late (31-120 days)', 'Late (16-30 days)', 'Default'). However, in our dataset, we just consider loans that are either “Fully Paid” or “Default” and transform this variable into a binary variable called “Default”, with a 0 for fully paid loans and a 1 for defaulted loans.
EXPLANATORY VARIABLES
The explanatory variables that we use correspond only to the information available at the time of the application. Variables such as the interest rate, grade, or subgrade are generated by the company as a result of a credit risk assessment process, so they were filtered out from the dataset as they must not be considered in risk models to predict the default in granting of credit.
Loan identification variables:
id: Loan id (unique identifier).
issue_d: Month and year in which the loan was approved.
Quantitative variables:
revenue: Borrower's self-declared annual income during registration.
dti_n: Indebtedness ratio for obligations excluding mortgage. Monthly information. This ratio has been calculated considering the indebtedness of the whole group of applicants. It is estimated as the ratio calculated using the co-borrowers’ total payments on the total debt obligations divided by the co-borrowers’ combined monthly income.
loan_amnt: Amount of credit requested by the borrower.
fico_n: Defined between 300 and 850, reported by Fair Isaac Corporation as a risk measure based on historical credit information reported at the time of application. This value has been calculated as the average of the variables “fico_range_low” and “fico_range_high” in the original dataset.
experience_c: Binary variable that indicates whether the borrower is new to the entity. This variable is constructed from the credit date of the previous obligation in LC and the credit date of the current obligation; if the difference between dates is positive, it is not considered as a new experience with LC.
Categorical variables:
emp_length: Categorical variable with the employment length of the borrower (includes the no information category)
purpose: Credit purpose category for the loan request.
home_ownership_n: Homeownership status provided by the borrower in the registration process. Categories defined by LC: “mortgage”, “rent”, “own”, “other”, “any”, “none”. We merged the categories “other”, “any” and “none” as “other”.
addr_state: Borrower's residence state from the USA.
zip_code: Zip code of the borrower's residence.
Textual variables
title: Title of the credit request description provided by the borrower.
desc: Description of the credit request provided by the borrower.
We cleaned the textual variables. First, we removed all those descriptions that contained the default description provided by Lending Club on its web form (“Tell your story. What is your loan for?”). Moreover, we removed the prefix “Borrower added on DD/MM/YYYY >” from the descriptions to avoid any temporal background on them. Finally, as these descriptions came from a web form, we substituted all the HTML elements by their character (e.g. “&” was substituted by “&”, “<” was substituted by “<”, etc.).
This dataset has been used in the following academic articles:
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Normative learning theories dictate that we should preferentially attend to informative sources, but only up to the point that our limited learning systems can process their content. Humans, including infants, show this predicted strategic deployment of attention. Here we demonstrate that rhesus monkeys, much like humans, attend to events of moderate surprisingness over both more and less surprising events. They do this in the absence of any specific goal or contingent reward, indicating that the behavioral pattern is spontaneous. We suggest this U-shaped attentional preference represents an evolutionarily preserved strategy for guiding intelligent organisms toward material that is maximally useful for learning. Methods How the data were collected: In this project, we collected gaze data of 5 macaques when they watched sequential visual displays designed to elicit probabilistic expectations using the Eyelink Toolbox and were sampled at 1000 Hz by an infrared eye-monitoring camera system. Dataset:
"csv-combined.csv" is an aggregated dataset that includes one pop-up event per row for all original datasets for each trial. Here are descriptions of each column in the dataset:
subj: subject_ID = {"B":104, "C":102,"H":101,"J":103,"K":203} trialtime: start time of current trial in second trial: current trial number (each trial featured one of 80 possible visual-event sequences)(in order) seq current: sequence number (one of 80 sequences) seq_item: current item number in a seq (in order) active_item: pop-up item (active box) pre_active: prior pop-up item (actve box) {-1: "the first active object in the sequence/ no active object before the currently active object in the sequence"} next_active: next pop-up item (active box) {-1: "the last active object in the sequence/ no active object after the currently active object in the sequence"} firstappear: {0: "not first", 1: "first appear in the seq"} looks_blank: csv: total amount of time look at blank space for current event (ms); csv_timestamp: {1: "look blank at timestamp", 0: "not look blank at timestamp"} looks_offscreen: csv: total amount of time look offscreen for current event (ms); csv_timestamp: {1: "look offscreen at timestamp", 0: "not look offscreen at timestamp"} time till target: time spent to first start looking at the target object (ms) {-1: "never look at the target"} looks target: csv: time spent to look at the target object (ms);csv_timestamp: look at the target or not at current timestamp (1 or 0) look1,2,3: time spent look at each object (ms) location 123X, 123Y: location of each box (location of the three boxes for a given sequence were chosen randomly, but remained static throughout the sequence) item123id: pop-up item ID (remained static throughout a sequence) event time: total time spent for the whole event (pop-up and go back) (ms) eyeposX,Y: eye position at current timestamp
"csv-surprisal-prob.csv" is an output file from Monkilock_Data_Processing.ipynb. Surprisal values for each event were calculated and added to the "csv-combined.csv". Here are descriptions of each additional column:
rt: time till target {-1: "never look at the target"}. In data analysis, we included data that have rt > 0. already_there: {NA: "never look at the target object"}. In data analysis, we included events that are not the first event in a sequence, are not repeats of the previous event, and already_there is not NA. looks_away: {TRUE: "the subject was looking away from the currently active object at this time point", FALSE: "the subject was not looking away from the currently active object at this time point"} prob: the probability of the occurrence of object surprisal: unigram surprisal value bisurprisal: transitional surprisal value std_surprisal: standardized unigram surprisal value std_bisurprisal: standardized transitional surprisal value binned_surprisal_means: the means of unigram surprisal values binned to three groups of evenly spaced intervals according to surprisal values. binned_bisurprisal_means: the means of transitional surprisal values binned to three groups of evenly spaced intervals according to surprisal values.
"csv-surprisal-prob_updated.csv" is a ready-for-analysis dataset generated by Analysis_Code_final.Rmd after standardizing controlled variables, changing data types for categorical variables for analysts, etc. "AllSeq.csv" includes event information of all 80 sequences
Empty Values in Datasets:
There is no missing value in the original dataset "csv-combined.csv". Missing values (marked as NA in datasets) happen in columns "prev_active", "next_active", "already_there", "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" in "csv-surprisal-prob.csv" and "csv-surprisal-prob_updated.csv". NAs in columns "prev_active" and "next_active" mean that the first or the last active object in the sequence/no active object before or after the currently active object in the sequence. When we analyzed the variable "already_there", we eliminated data that their "prev_active" variable is NA. NAs in column "already there" mean that the subject never looks at the target object in the current event. When we analyzed the variable "already there", we eliminated data that their "already_there" variable is NA. Missing values happen in columns "bisurprisal", "std_bisurprisal", "sq_std_bisurprisal" when it is the first event in the sequence and the transitional probability of the event cannot be computed because there's no event happening before in this sequence. When we fitted models for transitional statistics, we eliminated data that their "bisurprisal", "std_bisurprisal", and "sq_std_bisurprisal" are NAs.
Codes:
In "Monkilock_Data_Processing.ipynb", we processed raw fixation data of 5 macaques and explored the relationship between their fixation patterns and the "surprisal" of events in each sequence. We computed the following variables which are necessary for further analysis, modeling, and visualizations in this notebook (see above for details): active_item, pre_active, next_active, firstappear ,looks_blank, looks_offscreen, time till target, looks target, look1,2,3, prob, surprisal, bisurprisal, std_surprisal, std_bisurprisal, binned_surprisal_means, binned_bisurprisal_means. "Analysis_Code_final.Rmd" is the main scripts that we further processed the data, built models, and created visualizations for data. We evaluated the statistical significance of variables using mixed effect linear and logistic regressions with random intercepts. The raw regression models include standardized linear and quadratic surprisal terms as predictors. The controlled regression models include covariate factors, such as whether an object is a repeat, the distance between the current and previous pop up object, trial number. A generalized additive model (GAM) was used to visualize the relationship between the surprisal estimate from the computational model and the behavioral data. "helper-lib.R" includes helper functions used in Analysis_Code_final.Rmd
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personal Protective Equipment Dataset (PPED)
This dataset serves as a benchmark for PPE in chemical plants We provide datasets and experimental results.
We produced a data set based on the actual needs and relevant regulations in chemical plants. The standard GB 39800.1-2020 formulated by the Ministry of Emergency Management of the People’s Republic of China defines the protective requirements for plants and chemical laboratories. The complete dataset is contained in the folder PPED/data.
1.1. Image collection
We took more than 3300 pictures. We set the following different characteristics, including different environments, different distances, different lighting conditions, different angles, and the diversity of the number of people photographed.
Backgrounds: There are 4 backgrounds, including office, near machines, factory and regular outdoor scenes.
Scale: By taking pictures from different distances, the captured PPEs are classified in small, medium and large scales.
Light: Good lighting conditions and poor lighting conditions were studied.
Diversity: Some images contain a single person, and some contain multiple people.
Angle: The pictures we took can be divided into front and side.
A total of more than 3300 photos were taken in the raw data under all conditions. All images are located in the folder “PPED/data/JPEGImages”.
1.2. Label
We use Labelimg as the labeling tool, and we use the PASCAL-VOC labelimg format. Yolo use the txt format, we can use trans_voc2yolo.py to convert the XML file in PASCAL-VOC format to txt file. Annotations are stored in the folder PPED/data/Annotations
1.3. Dataset Features
The pictures are made by us according to the different conditions mentioned above. The file PPED/data/feature.csv is a CSV file which notes all the .os of all the image. It records every feature of the picture, including lighting conditions, angles, backgrounds, number of people and scale.
1.4. Dataset Division
The data set is divided into 9:1 training set and test set.
We provide baseline results with five models, namely Faster R-CNN ®, Faster R-CNN (M), SSD, YOLOv3-spp, and YOLOv5. All code and results is given in folder PPED/experiment.
2.1. Environment and Configuration:
Intel Core i7-8700 CPU
NVIDIA GTX1060 GPU
16 GB of RAM
Python: 3.8.10
pytorch: 1.9.0
pycocotools: pycocotools-win
Windows 10
2.2. Applied Models
The source codes and results of the applied models is given in folder PPED/experiment with sub-folders corresponding to the model names.
2.2.1. Faster R-CNN
Faster R-CNN
backbone: resnet50+fpn
We downloaded the pre-training weights from https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth.
We modified the dataset path, training classes and training parameters including batch size.
We run train_res50_fpn.py start training.
Then, the weights are trained by the training set.
Finally, we validate the results on the test set.
backbone: mobilenetv2
the same training method as resnet50+fpn, but the effect is not as good as resnet50+fpn, so it is directly discarded.
The Faster R-CNN source code used in our experiment is given in folder PPED/experiment/Faster R-CNN. The weights of the fully-trained Faster R-CNN (R), Faster R-CNN (M) model are stored in file PPED/experiment/trained_models/resNetFpn-model-19.pth and mobile-model.pth. The performance measurements of Faster R-CNN (R) Faster R-CNN (M) are stored in folder PPED/experiment/results/Faster RCNN(R)and Faster RCNN(M).
2.2.2. SSD
backbone: resnet50
We downloaded pre-training weights from https://download.pytorch.org/models/resnet50-19c8e357.pth.
The same training method as Faster R-CNN is applied.
The SSD source code used in our experiment is given in folder PPED/experiment/ssd. The weights of the fully-trained SSD model are stored in file PPED/experiment/trained_models/SSD_19.pth. The performance measurements of SSD are stored in folder PPED/experiment/results/SSD.
2.2.3. YOLOv3-spp
backbone: DarkNet53
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov3-spp-ultralytics-608.pt.
The YOLOv3-spp source code used in our experiment is given in folder PPED/experiment/YOLOv3-spp. The weights of the fully-trained YOLOv3-spp model are stored in file PPED/experiment/trained_models/YOLOvspp-19.pt. The performance measurements of YOLOv3-spp are stored in folder PPED/experiment/results/YOLOv3-spp.
2.2.4. YOLOv5
backbone: CSP_DarkNet
We modified the type information of the XML file to match our application.
We run trans_voc2yolo.py to convert the XML file in VOC format to a txt file.
The weights used are: yolov5s.
The YOLOv5 source code used in our experiment is given in folder PPED/experiment/yolov5. The weights of the fully-trained YOLOv5 model are stored in file PPED/experiment/trained_models/YOLOv5.pt. The performance measurements of YOLOv5 are stored in folder PPED/experiment/results/YOLOv5.
2.3. Evaluation
The computed evaluation metrics as well as the code needed to compute them from our dataset are provided in the folder PPED/experiment/eval.
Faster R-CNN (R and M)
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/faster_rcnn.py
SSD
official code: https://github.com/pytorch/vision/blob/main/torchvision/models/detection/ssd.py
YOLOv3-spp
YOLOv5
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Money Supply M2 in the United States increased to 21942 USD Billion in May from 21862.40 USD Billion in April of 2025. This dataset provides - United States Money Supply M2 - actual values, historical data, forecast, chart, statistics, economic calendar and news.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
We are publishing a walking activity dataset including inertial and positioning information from 19 volunteers, including reference distance measured using a trundle wheel. The dataset includes a total of 96.7 Km walked by the volunteers, split into 203 separate tracks. The trundle wheel is of two types: it is either an analogue trundle wheel, which provides the total amount of meters walked in a single track, or it is a sensorized trundle wheel, which measures every revolution of the wheel, therefore recording a continuous incremental distance.
Each track has data from the accelerometer and gyroscope embedded in the phones, location information from the Global Navigation Satellite System (GNSS), and the step count obtained by the device. The dataset can be used to implement walking distance estimation algorithms and to explore data quality in the context of walking activity and physical capacity tests, fitness, and pedestrian navigation.
Methods
The proposed dataset is a collection of walks where participants used their own smartphones to capture inertial and positioning information. The participants involved in the data collection come from two sites. The first site is the Oxford University Hospitals NHS Foundation Trust, United Kingdom, where 10 participants (7 affected by cardiovascular diseases and 3 healthy individuals) performed unsupervised 6MWTs in an outdoor environment of their choice (ethical approval obtained by the UK National Health Service Health Research Authority protocol reference numbers: 17/WM/0355). All participants involved provided informed consent. The second site is at Malm ̈o University, in Sweden, where a group of 9 healthy researchers collected data. This dataset can be used by researchers to develop distance estimation algorithms and how data quality impacts the estimation.
All walks were performed by holding a smartphone in one hand, with an app collecting inertial data, the GNSS signal, and the step counting. On the other free hand, participants held a trundle wheel to obtain the ground truth distance. Two different trundle wheels were used: an analogue trundle wheel that allowed the registration of a total single value of walked distance, and a sensorized trundle wheel which collected timestamps and distance at every 1-meter revolution, resulting in continuous incremental distance information. The latter configuration is innovative and allows the use of temporal windows of the IMU data as input to machine learning algorithms to estimate walked distance. In the case of data collected by researchers, if the walks were done simultaneously and at a close distance from each other, only one person used the trundle wheel, and the reference distance was associated with all walks that were collected at the same time.The walked paths are of variable length, duration, and shape. Participants were instructed to walk paths of increasing curvature, from straight to rounded. Irregular paths are particularly useful in determining limitations in the accuracy of walked distance algorithms. Two smartphone applications were developed for collecting the information of interest from the participants' devices, both available for Android and iOS operating systems. The first is a web-application that retrieves inertial data (acceleration, rotation rate, orientation) while connecting to the sensorized trundle wheel to record incremental reference distance [1]. The second app is the Timed Walk app [2], which guides the user in performing a walking test by signalling when to start and when to stop the walk while collecting both inertial and positioning data. All participants in the UK used the Timed Walk app.
The data collected during the walk is from the Inertial Measurement Unit (IMU) of the phone and, when available, the Global Navigation Satellite System (GNSS). In addition, the step count information is retrieved by the sensors embedded in each participant’s smartphone. With the dataset, we provide a descriptive table with the characteristics of each recording, including brand and model of the smartphone, duration, reference total distance, types of signals included and additionally scoring some relevant parameters related to the quality of the various signals. The path curvature is one of the most relevant parameters. Previous literature from our team, in fact, confirmed the negative impact of curved-shaped paths with the use of multiple distance estimation algorithms [3]. We visually inspected the walked paths and clustered them in three groups, a) straight path, i.e. no turns wider than 90 degrees, b) gently curved path, i.e. between one and five turns wider than 90 degrees, and c) curved path, i.e. more than five turns wider than 90 degrees. Other features relevant to the quality of collected signals are the total amount of time above a threshold (0.05s and 6s) where, respectively, inertial and GNSS data were missing due to technical issues or due to the app going in the background thus losing access to the sensors, sampling frequency of different data streams, average walking speed and the smartphone position. The start of each walk is set as 0 ms, thus not reporting time-related information. Walks locations collected in the UK are anonymized using the following approach: the first position is fixed to a central location of the city of Oxford (latitude: 51.7520, longitude: -1.2577) and all other positions are reassigned by applying a translation along the longitudinal and latitudinal axes which maintains the original distance and angle between samples. This way, the exact geographical location is lost, but the path shape and distances between samples are maintained. The difference between consecutive points “as the crow flies” and path curvature was numerically and visually inspected to obtain the same results as the original walks. Computations were made possible by using the Haversine Python library.
Multiple datasets are available regarding walking activity recognition among other daily living tasks. However, few studies are published with datasets that focus on the distance for both indoor and outdoor environments and that provide relevant ground truth information for it. Yan et al. [4] introduced an inertial walking dataset within indoor scenarios using a smartphone placed in 4 positions (on the leg, in a bag, in the hand, and on the body) by six healthy participants. The reference measurement used in this study is a Visual Odometry System embedded in a smartphone that has to be worn at the chest level, using a strap to hold it. While interesting and detailed, this dataset lacks GNSS data, which is likely to be used in outdoor scenarios, and the reference used for localization also suffers from accuracy issues, especially outdoors. Vezovcnik et al. [5] analysed estimation models for step length and provided an open-source dataset for a total of 22 km of only inertial walking data from 15 healthy adults. While relevant, their dataset focuses on steps rather than total distance and was acquired on a treadmill, which limits the validity in real-world scenarios. Kang et al. [6] proposed a way to estimate travelled distance by using an Android app that uses outdoor walking patterns to match them in indoor contexts for each participant. They collect data outdoors by including both inertial and positioning information and they use average values of speed obtained by the GPS data as reference labels. Afterwards, they use deep learning models to estimate walked distance obtaining high performances. Their results share that 3% to 11% of the data for each participant was discarded due to low quality. Unfortunately, the name of the used app is not reported and the paper does not mention if the dataset can be made available.
This dataset is heterogeneous under multiple aspects. It includes a majority of healthy participants, therefore, it is not possible to generalize the outcomes from this dataset to all walking styles or physical conditions. The dataset is heterogeneous also from a technical perspective, given the difference in devices, acquired data, and used smartphone apps (i.e. some tests lack IMU or GNSS, sampling frequency in iPhone was particularly low). We suggest selecting the appropriate track based on desired characteristics to obtain reliable and consistent outcomes.
This dataset allows researchers to develop algorithms to compute walked distance and to explore data quality and reliability in the context of the walking activity. This dataset was initiated to investigate the digitalization of the 6MWT, however, the collected information can also be useful for other physical capacity tests that involve walking (distance- or duration-based), or for other purposes such as fitness, and pedestrian navigation.
The article related to this dataset will be published in the proceedings of the IEEE MetroXRAINE 2024 conference, held in St. Albans, UK, 21-23 October.
This research is partially funded by the Swedish Knowledge Foundation and the Internet of Things and People research center through the Synergy project Intelligent and Trustworthy IoT Systems.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Consumer Spending in the United States increased to 16445.70 USD Billion in the second quarter of 2025 from 16345.80 USD Billion in the first quarter of 2025. This dataset provides the latest reported value for - United States Consumer Spending - plus previous releases, historical high and low, short-term forecast and long-term prediction, economic calendar, survey consensus and news.