100+ datasets found

Spending Habits by Category and Item

kaggle.com

Updated Jan 16, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Spending Habits by Category and Item [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/spending-habits

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ahmed Mohamed

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Spending Patterns Dataset

Dataset Overview

The Spending Patterns Dataset provides a synthetic representation of consumer spending behavior across various categories. This dataset is ideal for exploratory data analysis, statistical modeling, and machine learning applications related to financial forecasting, customer segmentation, or consumer behavior analysis.

Dataset Features

The dataset contains 10,000 transactions for 200 unique customers. Each transaction is associated with detailed information, including category, item, quantity, price, payment method, and transaction date.

Columns

Column Name	Description
`Customer ID`	Unique identifier for each customer (e.g., `CUST_0001`).
`Category`	The spending category (e.g., Groceries, Shopping, Travel).
`Item`	The specific item purchased within the category (e.g., Milk, Plane Ticket).
`Quantity`	Number of units purchased. For specific categories (e.g., Subscriptions, Housing and Utilities, Transportation, Medical/Dental, Travel), this is always `1`.
`Price Per Unit`	The price of one unit of the item (in USD).
`Total Spent`	Total expenditure for the transaction (`Quantity` × `Price Per Unit`).
`Payment Method`	The payment method used (e.g., Credit Card, Cash).
`Location`	Where the transaction occurred (e.g., Online, In-store, Mobile App).
`Transaction Date`	The date of the transaction (YYYY-MM-DD format).

Categories and Items

The dataset includes the following spending categories with example items:

Groceries: Milk, Bread, Fruits, Vegetables, Meat, etc.
Shopping: Clothes, Shoes, Electronics, Car.
Subscriptions: Streaming Service, Gym Membership (Quantity always 1).
Housing and Utilities: Rent, Electricity Bill, Gas Bill (Quantity always 1).
Transportation: Gas, Public Transit, Car Repair (Quantity always 1).
Food: Restaurant Meal, Fast Food, Coffee.
Medical/Dental: Doctor Visit, Dentist Visit, Medicine (Quantity always 1).
Personal Hygiene: Toothpaste, Shampoo, Soap.
Fitness: Yoga Class, Personal Trainer, Workout Equipment.
Travel: Plane Ticket, Hotel Stay, Taxi/Uber (Plane Ticket and Hotel Stay have Quantity always 1).
Hobbies: Books, Art Supplies, Video Games.
Friend Activities: Movie Tickets, Concert Tickets, Dinner with Friends.
Gifts: Flowers, Gift Cards, Jewelry.

Usage Examples

Example 1: Total Spending by Category

Example 2: Spending Habits of Each Customer

Example 13 Price Change Over Time

m
Dataset for The effects of a number line intervention on calculation skills
figshare.mq.edu.au
researchdata.edu.au
txt
Updated May 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carola Ruiz Hornblas; Saskia Kohnen; Rebecca Bull (2023). Dataset for The effects of a number line intervention on calculation skills [Dataset]. http://doi.org/10.25949/22799717.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.25949/22799717.v1
Dataset updated
May 12, 2023
Dataset provided by
Macquarie University
Authors
Carola Ruiz Hornblas; Saskia Kohnen; Rebecca Bull
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Study information The sample included in this dataset represents five children who participated in a number line intervention study. Originally six children were included in the study, but one of them fulfilled the criterion for exclusion after missing several consecutive sessions. Thus, their data is not included in the dataset. All participants were currently attending Year 1 of primary school at an independent school in New South Wales, Australia. For children to be able to eligible to participate they had to present with low mathematics achievement by performing at or below the 25th percentile in the Maths Problem Solving and/or Numerical Operations subtests from the Wechsler Individual Achievement Test III (WIAT III A & NZ, Wechsler, 2016). Participants were excluded from participating if, as reported by their parents, they have any other diagnosed disorders such as attention deficit hyperactivity disorder, autism spectrum disorder, intellectual disability, developmental language disorder, cerebral palsy or uncorrected sensory disorders. The study followed a multiple baseline case series design, with a baseline phase, a treatment phase, and a post-treatment phase. The baseline phase varied between two and three measurement points, the treatment phase varied between four and seven measurement points, and all participants had 1 post-treatment measurement point. The number of measurement points were distributed across participants as follows: Participant 1 – 3 baseline, 6 treatment, 1 post-treatment Participant 3 – 2 baseline, 7 treatment, 1 post-treatment Participant 5 – 2 baseline, 5 treatment, 1 post-treatment Participant 6 – 3 baseline, 4 treatment, 1 post-treatment Participant 7 – 2 baseline, 5 treatment, 1 post-treatment In each session across all three phases children were assessed in their performance on a number line estimation task, a single-digit computation task, a multi-digit computation task, a dot comparison task and a number comparison task. Furthermore, during the treatment phase, all children completed the intervention task after these assessments. The order of the assessment tasks varied randomly between sessions.

Measures Number Line Estimation. Children completed a computerised bounded number line task (0-100). The number line is presented in the middle of the screen, and the target number is presented above the start point of the number line to avoid signalling the midpoint (Dackermann et al., 2018). Target numbers included two non-overlapping sets (trained and untrained) of 30 items each. Untrained items were assessed on all phases of the study. Trained items were assessed independent of the intervention during baseline and post-treatment phases, and performance on the intervention is used to index performance on the trained set during the treatment phase. Within each set, numbers were equally distributed throughout the number range, with three items within each ten (0-10, 11-20, 21-30, etc.). Target numbers were presented in random order. Participants did not receive performance-based feedback. Accuracy is indexed by percent absolute error (PAE) [(number estimated - target number)/ scale of number line] x100.

Single-Digit Computation. The task included ten additions with single-digit addends (1-9) and single-digit results (2-9). The order was counterbalanced so that half of the additions present the lowest addend first (e.g., 3 + 5) and half of the additions present the highest addend first (e.g., 6 + 3). This task also included ten subtractions with single-digit minuends (3-9), subtrahends (1-6) and differences (1-6). The items were presented horizontally on the screen accompanied by a sound and participants were required to give a verbal response. Participants did not receive performance-based feedback. Performance on this task was indexed by item-based accuracy.

Multi-digit computational estimation. The task included eight additions and eight subtractions presented with double-digit numbers and three response options. None of the response options represent the correct result. Participants were asked to select the option that was closest to the correct result. In half of the items the calculation involved two double-digit numbers, and in the other half one double and one single digit number. The distance between the correct response option and the exact result of the calculation was two for half of the trials and three for the other half. The calculation was presented vertically on the screen with the three options shown below. The calculations remained on the screen until participants responded by clicking on one of the options on the screen. Participants did not receive performance-based feedback. Performance on this task is measured by item-based accuracy.

Dot Comparison and Number Comparison. Both tasks included the same 20 items, which were presented twice, counterbalancing left and right presentation. Magnitudes to be compared were between 5 and 99, with four items for each of the following ratios: .91, .83, .77, .71, .67. Both quantities were presented horizontally side by side, and participants were instructed to press one of two keys (F or J), as quickly as possible, to indicate the largest one. Items were presented in random order and participants did not receive performance-based feedback. In the non-symbolic comparison task (dot comparison) the two sets of dots remained on the screen for a maximum of two seconds (to prevent counting). Overall area and convex hull for both sets of dots is kept constant following Guillaume et al. (2020). In the symbolic comparison task (Arabic numbers), the numbers remained on the screen until a response was given. Performance on both tasks was indexed by accuracy.

The Number Line Intervention During the intervention sessions, participants estimated the position of 30 Arabic numbers in a 0-100 bounded number line. As a form of feedback, within each item, the participants’ estimate remained visible, and the correct position of the target number appeared on the number line. When the estimate’s PAE was lower than 2.5, a message appeared on the screen that read “Excellent job”, when PAE was between 2.5 and 5 the message read “Well done, so close! and when PAE was higher than 5 the message read “Good try!” Numbers were presented in random order.

Variables in the dataset Age = age in ‘years, months’ at the start of the study Sex = female/male/non-binary or third gender/prefer not to say (as reported by parents) Math_Problem_Solving_raw = Raw score on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Math Problem Solving subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Num_Ops_Raw = Raw score on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016). Math_Problem_Solving_Percentile = Percentile equivalent on the Numerical Operations subtest from the WIAT III (WIAT III A & NZ, Wechsler, 2016).

The remaining variables refer to participants’ performance on the study tasks. Each variable name is composed by three sections. The first one refers to the phase and session. For example, Base1 refers to the first measurement point of the baseline phase, Treat1 to the first measurement point on the treatment phase, and post1 to the first measurement point on the post-treatment phase.

The second part of the variable name refers to the task, as follows: DC = dot comparison SDC = single-digit computation NLE_UT = number line estimation (untrained set) NLE_T= number line estimation (trained set) CE = multidigit computational estimation NC = number comparison The final part of the variable name refers to the type of measure being used (i.e., acc = total correct responses and pae = percent absolute error).

Thus, variable Base2_NC_acc corresponds to accuracy on the number comparison task during the second measurement point of the baseline phase and Treat3_NLE_UT_pae refers to the percent absolute error on the untrained set of the number line task during the third session of the Treatment phase.

Retail Store Sales: Dirty for Data Cleaning

kaggle.com

zip

Updated Jan 18, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Retail Store Sales: Dirty for Data Cleaning [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/retail-store-sales-dirty-for-data-cleaning

Explore at:

zip(226740 bytes)Available download formats

Dataset updated

Jan 18, 2025

Authors

Ahmed Mohamed

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dirty Retail Store Sales Dataset

Overview

The Dirty Retail Store Sales dataset contains 12,575 rows of synthetic data representing sales transactions from a retail store. The dataset includes eight product categories with 25 items per category, each having static prices. It is designed to simulate real-world sales data, including intentional "dirtiness" such as missing or inconsistent values. This dataset is suitable for practicing data cleaning, exploratory data analysis (EDA), and feature engineering.

File Information

File Name: retail_store_sales.csv
Number of Rows: 12,575
Number of Columns: 11

Columns Description

Column Name	Description	Example Values
`Transaction ID`	A unique identifier for each transaction. Always present and unique.	`TXN_1234567`
`Customer ID`	A unique identifier for each customer. 25 unique customers.	`CUST_01`
`Category`	The category of the purchased item.	`Food`, `Furniture`
`Item`	The name of the purchased item. May contain missing values or `None`.	`Item_1_FOOD`, `None`
`Price Per Unit`	The static price of a single unit of the item. May contain missing or `None` values.	`4.00`, `None`
`Quantity`	The quantity of the item purchased. May contain missing or `None` values.	`1`, `None`
`Total Spent`	The total amount spent on the transaction. Calculated as `Quantity * Price Per Unit`.	`8.00`, `None`
`Payment Method`	The method of payment used. May contain missing or invalid values.	`Cash`, `Credit Card`
`Location`	The location where the transaction occurred. May contain missing or invalid values.	`In-store`, `Online`
`Transaction Date`	The date of the transaction. Always present and valid.	`2023-01-15`
`Discount Applied`	Indicates if a discount was applied to the transaction. May contain missing values.	`True`, `False`, `None`

Categories and Items

The dataset includes the following categories, each containing 25 items with corresponding codes, names, and static prices:

Electric Household Essentials

Item Code	Item Name	Price
Item_1_EHE	Blender	5.0
Item_2_EHE	Microwave	6.5
Item_3_EHE	Toaster	8.0
Item_4_EHE	Vacuum Cleaner	9.5
Item_5_EHE	Air Purifier	11.0
Item_6_EHE	Electric Kettle	12.5
Item_7_EHE	Rice Cooker	14.0
Item_8_EHE	Iron	15.5
Item_9_EHE	Ceiling Fan	17.0
Item_10_EHE	Table Fan	18.5
Item_11_EHE	Hair Dryer	20.0
Item_12_EHE	Heater	21.5
Item_13_EHE	Humidifier	23.0
Item_14_EHE	Dehumidifier	24.5
Item_15_EHE	Coffee Maker	26.0
Item_16_EHE	Portable AC	27.5
Item_17_EHE	Electric Stove	29.0
Item_18_EHE	Pressure Cooker	30.5
Item_19_EHE	Induction Cooktop	32.0
Item_20_EHE	Water Dispenser	33.5
Item_21_EHE	Hand Blender	35.0
Item_22_EHE	Mixer Grinder	36.5
Item_23_EHE	Sandwich Maker	38.0
Item_24_EHE	Air Fryer	39.5
Item_25_EHE	Juicer	41.0

Furniture

Item Code	Item Name	Price
Item_1_FUR	Office Chair	5.0
Item_2_FUR	Sofa	6.5
Item_3_FUR	Coffee Table	8.0
Item_4_FUR	Dining Table	9.5
Item_5_FUR	Bookshelf	11.0
Item_6_FUR	Bed F...

Risky Business: Factor Analysis of Survey Data – Assessing the Probability...
plos.figshare.com
txt
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cees van der Eijk; Jonathan Rose (2023). Risky Business: Factor Analysis of Survey Data – Assessing the Probability of Incorrect Dimensionalisation [Dataset]. http://doi.org/10.1371/journal.pone.0118900
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0118900
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Cees van der Eijk; Jonathan Rose
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper undertakes a systematic assessment of the extent to which factor analysis the correct number of latent dimensions (factors) when applied to ordered-categorical survey items (so-called Likert items). We simulate 2400 data sets of uni-dimensional Likert items that vary systematically over a range of conditions such as the underlying population distribution, the number of items, the level of random error, and characteristics of items and item-sets. Each of these datasets is factor analysed in a variety of ways that are frequently used in the extant literature, or that are recommended in current methodological texts. These include exploratory factor retention heuristics such as Kaiser’s criterion, Parallel Analysis and a non-graphical scree test, and (for exploratory and confirmatory analyses) evaluations of model fit. These analyses are conducted on the basis of Pearson and polychoric correlations. We find that, irrespective of the particular mode of analysis, factor analysis applied to ordered-categorical survey data very often leads to over-dimensionalisation. The magnitude of this risk depends on the specific way in which factor analysis is conducted, the number of items, the properties of the set of items, and the underlying population distribution. The paper concludes with a discussion of the consequences of over-dimensionalisation, and a brief mention of alternative modes of analysis that are much less prone to such problems.
E
Dataset for training classifiers of comparative sentences
live.european-language-grid.eu
csv
Updated Apr 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Dataset for training classifiers of comparative sentences [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7607
Explore at:
csvAvailable download formats
Dataset updated
Apr 19, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As there was no large publicly available cross-domain dataset for comparative argument mining, we create one composed of sentences, potentially annotated with BETTER / WORSE markers (the first object is better / worse than the second object) or NONE (the sentence does not contain a comparison of the target objects). The BETTER sentences stand for a pro-argument in favor of the first compared object and WORSE-sentences represent a con-argument and favor the second object. We aim for minimizing dataset domain-specific biases in order to capture the nature of comparison and not the nature of the particular domains, thus decided to control the specificity of domains by the selection of comparison targets. We hypothesized and could confirm in preliminary experiments that comparison targets usually have a common hypernym (i.e., are instances of the same class), which we utilized for selection of the compared objects pairs. The most specific domain we choose, is computer science with comparison targets like programming languages, database products and technology standards such as Bluetooth or Ethernet. Many computer science concepts can be compared objectively (e.g., on transmission speed or suitability for certain applications). The objects for this domain were manually extracted from List of-articles at Wikipedia. In the annotation process, annotators were asked to only label sentences from this domain if they had some basic knowledge in computer science. The second, broader domain is brands. It contains objects of different types (e.g., cars, electronics, and food). As brands are present in everyday life, anyone should be able to label the majority of sentences containing well-known brands such as Coca-Cola or Mercedes. Again, targets for this domain were manually extracted from `List of''-articles at Wikipedia.The third domain is not restricted to any topic: random. For each of 24~randomly selected seed words 10 similar words were collected based on the distributional similarity API of JoBimText (http://www.jobimtext.org). Seed words created using randomlists.com: book, car, carpenter, cellphone, Christmas, coffee, cork, Florida, hamster, hiking, Hoover, Metallica, NBC, Netflix, ninja, pencil, salad, soccer, Starbucks, sword, Tolkien, wine, wood, XBox, Yale.Especially for brands and computer science, the resulting object lists were large (4493 in brands and 1339 in computer science). In a manual inspection, low-frequency and ambiguous objects were removed from all object lists (e.g., RAID (a hardware concept) and Unity (a game engine) are also regularly used nouns). The remaining objects were combined to pairs. For each object type (seed Wikipedia list page or the seed word), all possible combinations were created. These pairs were then used to find sentences containing both objects. The aforementioned approaches to selecting compared objects pairs tend minimize inclusion of the domain specific data, but do not solve the problem fully though. We keep open a question of extending dataset with diverse object pairs including abstract concepts for future work. As for the sentence mining, we used the publicly available index of dependency-parsed sentences from the Common Crawl corpus containing over 14 billion English sentences filtered for duplicates. This index was queried for sentences containing both objects of each pair. For 90% of the pairs, we also added comparative cue words (better, easier, faster, nicer, wiser, cooler, decent, safer, superior, solid, terrific, worse, harder, slower, poorly, uglier, poorer, lousy, nastier, inferior, mediocre) to the query in order to bias the selection towards comparisons but at the same time admit comparisons that do not contain any of the anticipated cues. This was necessary as a random sampling would have resulted in only a very tiny fraction of comparisons. Note that even sentences containing a cue word do not necessarily express a comparison between the desired targets (dog vs. cat: He's the best pet that you can get, better than a dog or cat.). It is thus especially crucial to enable a classifier to learn not to rely on the existence of clue words only (very likely in a random sample of sentences with very few comparisons). For our corpus, we keep pairs with at least 100 retrieved sentences.From all sentences of those pairs, 2500 for each category were randomly sampled as candidates for a crowdsourced annotation that we conducted on figure-eight.com in several small batches. Each sentence was annotated by at least five trusted workers. We ranked annotations by confidence, which is the figure-eight internal measure of combining annotator trust and voting, and discarded annotations with a confidence below 50%. Of all annotated items, 71% received unanimous votes and for over 85% at least 4 out of 5 workers agreed -- rendering the collection procedure aimed at ease of annotation successful.The final dataset contains 7199 sentences with 271 distinct object pairs. The majority of sentences (over 72%) are non-comparative despite biasing the selection with cue words; in 70% of the comparative sentences, the favored target is named first.You can browse though the data here: https://docs.google.com/spreadsheets/d/1U8i6EU9GUKmHdPnfwXEuBxi0h3aiRCLPRC-3c9ROiOE/edit?usp=sharing Full description of the dataset is available in the workshop paper at ACL 2019 conference. Please cite this paper if you use the data: Franzek, Mirco, Alexander Panchenko, and Chris Biemann. ""Categorization of Comparative Sentences for Argument Mining."" arXiv preprint arXiv:1809.06152 (2018).@inproceedings{franzek2018categorization, title={Categorization of Comparative Sentences for Argument Mining}, author={Panchenko, Alexander and Bondarenko, and Franzek, Mirco and Hagen, Matthias and Biemann, Chris}, booktitle={Proceedings of the 6th Workshop on Argument Mining at ACL'2019}, year={2019}, address={Florence, Italy}}
Collections (from American Folklife Center)
zenodo.org
csv
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Egan; Patrick Egan (2024). Collections (from American Folklife Center) [Dataset]. http://doi.org/10.5281/zenodo.14140570
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14140570
Dataset updated
Nov 13, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Patrick Egan; Patrick Egan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
2019
Description
Dataset originally created 03/01/2019 UPDATE: Packaged on 04/18/2019 UPDATE: Edited README on 04/18/2019

I. About this Data Set This data set is a snapshot of work that is ongoing as a collaboration between Kluge Fellow in Digital Studies, Patrick Egan and an intern at the Library of Congress in the American Folklife Center. It contains a combination of metadata from various collections that contain audio recordings of Irish traditional music. The development of this dataset is iterative, and it integrates visualizations that follow the key principles of trust and approachability. The project, entitled, “Connections In Sound” invites you to use and re-use this data.

The text available in the Items dataset is generated from multiple collections of audio material that were discovered at the American Folklife Center. Each instance of a performance was listed and “sets” or medleys of tunes or songs were split into distinct instances in order to allow machines to read each title separately (whilst still noting that they were part of a group of tunes). The work of the intern was then reviewed before publication, and cross-referenced with the tune index at www.irishtune.info. The Items dataset consists of just over 1000 rows, with new data being added daily in a separate file.

The collections dataset contains at least 37 rows of collections that were located by a reference librarian at the American Folklife Center. This search was complemented by searches of the collections by the scholar both on the internet at https://catalog.loc.gov and by using card catalogs.

Updates to these datasets will be announced and published as the project progresses.

II. What’s included? This data set includes:

The Items Dataset – a .CSV containing Media Note, OriginalFormat, On Website, Collection Ref, Missing In Duplication, Collection, Outside Link, Performer, Solo/multiple, Sub-item, type of tune, Tune, Position, Location, State, Date, Notes/Composer, Potential Linked Data, Instrument, Additional Notes, Tune Cleanup. This .CSV is the direct export of the Items Google Spreadsheet

III. How Was It Created? These data were created by a Kluge Fellow in Digital Studies and an intern on this program over the course of three months. By listening, transcribing, reviewing, and tagging audio recordings, these scholars improve access and connect sounds in the American Folklife Collections by focusing on Irish traditional music. Once transcribed and tagged, information in these datasets is reviewed before publication.

IV. Data Set Field Descriptions

IV

a) Collections dataset field descriptions

ItemId – this is the identifier for the collection that was found at the AFC
Viewed – if the collection has been viewed, or accessed in any way by the researchers.
On LOC – whether or not there are audio recordings of this collection available on the Library of Congress website.
On Other Website – if any of the recordings in this collection are available elsewhere on the internet
Original Format – the format that was used during the creation of the recordings that were found within each collection
Search – this indicates the type of search that was performed in order that resulted in locating recordings and collections within the AFC
Collection – the official title for the collection as noted on the Library of Congress website
State – The primary state where recordings from the collection were located
Other States – The secondary states where recordings from the collection were located
Era / Date – The decade or year associated with each collection
Call Number – This is the official reference number that is used to locate the collections, both in the urls used on the Library website, and in the reference search for catalog cards (catalog cards can be searched at this address: https://memory.loc.gov/diglib/ihas/html/afccards/afccards-home.html)
Finding Aid Online? – Whether or not a finding aid is available for this collection on the internet

b) Items dataset field descriptions

id – the specific identification of the instance of a tune, song or dance within the dataset
Media Note – Any information that is included with the original format, such as identification, name of physical item, additional metadata written on the physical item
Original Format – The physical format that was used when recording each specific performance. Note: this field is used in order to calculate the number of physical items that were created in each collection such as 32 wax cylinders.
On Webste? – Whether or not each instance of a performance is available on the Library of Congress website
Collection Ref – The official reference number of the collection
Missing In Duplication – This column marks if parts of some recordings had been made available on other websites, but not all of the recordings were included in duplication (see recordings from Philadelphia Céilí Group on Villanova University website)
Collection – The official title of the collection given by the American Folklife Center
Outside Link – If recordings are available on other websites externally
Performer – The name of the contributor(s)
Solo/multiple – This field is used to calculate the amount of solo performers vs group performers in each collection
Sub-item – In some cases, physical recordings contained extra details, the sub-item column was used to denote these details
Type of item – This column describes each individual item type, as noted by performers and collectors
Item – The item title, as noted by performers and collectors. If an item was not described, it was entered as “unidentified”
Position – The position on the recording (in some cases during playback, audio cassette player counter markers were used)
Location – Local address of the recording
State – The state where the recording was made
Date – The date that the recording was made
Notes/Composer – The stated composer or source of the item recorded
Potential Linked Data – If items may be linked to other recordings or data, this column was used to provide examples of potential relationships between them
Instrument – The instrument(s) that was used during the performance
Additional Notes – Notes about the process of capturing, transcribing and tagging recordings (for researcher and intern collaboration purposes)
Tune Cleanup – This column was used to tidy each item so that it could be read by machines, but also so that spelling mistakes from the Item column could be corrected, and as an aid to preserving iterations of the editing process

V. Rights statement The text in this data set was created by the researcher and intern and can be used in many different ways under creative commons with attribution. All contributions to Connections In Sound are released into the public domain as they are created. Anyone is free to use and re-use this data set in any way they want, provided reference is given to the creators of these datasets.

VI. Creator and Contributor Information

Creator: Connections In Sound

Contributors: Library of Congress Labs

VII. Contact Information Please direct all questions and comments to Patrick Egan via www.twitter.com/drpatrickegan or via his website at www.patrickegan.org. You can also get in touch with the Library of Congress Labs team via LC-Labs@loc.gov.

Data Management Training Clearinghouse Metadata and Collection Statistics...

data-staging.niaid.nih.gov
data.niaid.nih.gov
+1more

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Benedict, Karl; Hoebelheinrich, Nancy (2024). Data Management Training Clearinghouse Metadata and Collection Statistics Report [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7786963

Explore at:

Dataset updated

Jul 12, 2024

Dataset provided by

Knowledge Motifs LLC
University of New Mexico

Authors

Benedict, Karl; Hoebelheinrich, Nancy

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This collection contains a snapshot of the learning resource metadata from ESIP's Data management Training Clearinghouse (DMTC) associated with the closeout (March 30, 2023) of the Institute of Museum and Library Services funded (Award Number: LG-70-18-0092-18) Development of an Enhanced and Expanded Data Management Training Clearinghouse project. The shared metadata are a snapshot associated with the final reporting date for the project, and the associated data report is also based upon the same data snapshot on the same date.

The materials included in the collection consist of the following:

esip-dev-02.edacnm.org.json.zip - a zip archive containing the metadata for 587 published learning resources as of March 30, 2023. These metadata include all publicly available metadata elements for the published learning resources with the exception of the metadata elements containing individual email addresses (submitter and contact) to reduce the exposure of these data.

statistics.pdf - an automatically generated report summarizing information about the collection of materials in the DMTC Clearinghouse, including both published and unpublished learning resources. This report includes the numbers of published and unpublished resources through time; the number of learning resources within subject categories and detailed subject categories, the dates items assigned to each category were first added to the Clearinghouse, and the most recent data that items were added to that category; the distribution of learning resources across target audiences; and the frequency of keywords within the learning resource collection. This report is based on the metadata for published resourced included in this collection, and preliminary metadata for unpublished learning resources that are not included in the shared dataset.

The metadata fields consist of the following:

    Fieldname
    Description




    abstract_data
    A brief synopsis or abstract about the learning resource


    abstract_format
    Declaration for how the abstract description will be represented.


    access_conditions
    Conditions upon which the resource can be accessed beyond cost, e.g., login required.


    access_cost
    Yes or No choice stating whether othere is a fee for access to or use of the resource.


    accessibililty_features_name
    Content features of the resource, such as accessible media, alternatives and supported enhancements for accessibility.


    accessibililty_summary
    A human-readable summary of specific accessibility features or deficiencies.


    author_names
    List of authors for a resource derived from the given/first and family/last names of the personal author fields by the system


    author_org
    - name
    - name_identifier
    - name_identifier_type



    - Name of organization authoring the learning resource.
    - The unique identifier for the organization authoring the resource.
    - The identifier scheme associated with the unique identifier for the organization authoring the resource.

authors - givenName - familyName - name_identifier - name_identifier_type

    - Given or first name of person(s) authoring the resource.
    - Last or family name of person(s) authoring the resource.
    - The unique identifier for the person(s) authoring the resource.
    - The identifier scheme associated with the unique identifier for the person(s) authoring the resource, e.g., ORCID.



    citation
    Preferred Form of Citation.


    completion_time
    Intended Time to Complete

contact - name - org - email

    - Name of person(s) who has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
    - Name of organization that has/have been asserted as the contact(s) for the resource in case of questions or follow-up by resource user.
    - (excluded) Contact email address.



    contributor_orgs
    - name
    - name_identifier
    - name_identifier_type
    - type
    - Name of organization that is a secondary contributor to the learningresource. A contributor can also be an individual person.
    - The unique identifier for the organization contributing to the resource.
    - The identifier scheme associated with the unique identifier for the organization contributing to the resource.
    - Type of contribution to the resource made by an organization.


    contributors
    - familyName
    - givenName
    - name_identifier
    - name_identifier_type

Last or family name of person(s) contributing to the resource. - Given or first name of person(s) contributing to the resource. - The unique identifier for the person(s) contributing to the resource. - The identifier scheme associated with the unique identifier for the person(s) contributing to the resource, e.g., ORCID.

contributors.type

Type of contribution to the resource made by a person.

    created
    The date on which the metadata record was first saved as part of the input workflow.


    creator
    The name of the person creating the MD record for a resource.


    credential_status
    Declaration of whether a credential is offered for comopletion of the resource.

ed_frameworks - name - description - nodes.name

    - The name of the educational framework to which the resource is aligned, if any. An educational framework is a structured description of educational concepts such as a shared curriculum, syllabus or set of learning objectives, or a vocabulary for describing some other aspect of education such as educational levels or reading ability.
    - A description of one or more subcategories of an educational framework to which a resource is associated.
    - The name of a subcategory of an educational framework to which a resource is associated.


    expertise_level
    The skill level targeted for the topic being taught.


    id
    Unique identifier for the MD record generated by the system in UUID format.


    keywords
    Important phrases or words used to describe the resource.


    language_primary
    Original language in which the learning resource being described is published or made available.


    languages_secondary
    Additional languages in which the resource is tranlated or made available, if any.


    license
    A license for use of that applies to the resource, typically indicated by URL.


    locator_data
    The identifier for the learning resource used as part of a citation, if available.


    locator_type
    Designation of citation locatorr type, e.g., DOI, ARK, Handle.


    lr_outcomes
    Descriptions of what knowledge, skills or abilities students should learn from the resource.


    lr_type
    A characteristic that describes the predominant type or kind of learning resource.


    media_type
    Media type of resource.


    modification_date
    System generated date and time when MD record is modified.


    notes
    MD Record Input Notes


    pub_status
    Status of metadata record within the system, i.e., in-process, in-review, pre-pub-review, deprecate-request, deprecated or published.


    published
    Date of first broadcast / publication.


    publisher
    The organization credited with publishing or broadcasting the resource.


    purpose
    The purpose of the resource in the context of education; e.g., instruction, professional education, assessment.


    rating
    The aggregation of input from all user assessments evaluating users' reaction to the learning resource following Kirkpatrick's model of training evaluation.


    ratings
    Inputs from users assessing each user's reaction to the learning resource following Kirkpatrick's model of training evaluation.


    resource_modification_date
    Date in which the resource has last been modified from the original published or broadcast version.


    status
    System generated publication status of the resource w/in the registry as a yes for published or no for not published.


    subject
    Subject domain(s) toward which the resource is targeted. There may be more than one value for this field.


    submitter_email
    (excluded) Email address of person who submitted the resource.


    submitter_name
    Submission Contact Person


    target_audience
    Audience(s) for which the resource is intended.


    title
    The name of the resource.


    url
    URL that resolves to a downloadable version of the learning resource or to a landing page for the resource that contains important contextual information including the direct resolvable link to the resource, if applicable.


    usage_info
    Descriptive information about using the resource, not addressed by the License information field.


    version
    The specific version of the resource, if declared.

WIC Participant and Program Characteristics 2020
agdatacommons.nal.usda.gov
docx
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
USDA Food and Nutrition Service, Office of Policy Support (2025). WIC Participant and Program Characteristics 2020 [Dataset]. http://doi.org/10.15482/USDA.ADC/1527885
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1527885
Dataset updated
Nov 21, 2025
Dataset provided by
United States Department of Agriculturehttp://usda.gov/
Food and Nutrition Servicehttps://www.fns.usda.gov/
Authors
USDA Food and Nutrition Service, Office of Policy Support
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Background: In 1986, the Congress enacted Public Laws 99-500 and 99-591, requiring a biennial report on the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC). In response to these requirements, FNS developed a prototype system that allowed for the routine acquisition of information on WIC participants from WIC State Agencies. Since 1992, State Agencies have provided electronic copies of these data to FNS on a biennial basis.FNS and the National WIC Association (formerly National Association of WIC Directors) agreed on a set of data elements for the transfer of information. In addition, FNS established a minimum standard dataset for reporting participation data. For each biennial reporting cycle, each State Agency is required to submit a participant-level dataset containing standardized information on persons enrolled at local agencies for the reference month of April. The 2020 Participant and Program Characteristics (PC2020) is the 17th to be completed using the prototype PC reporting system. In April 2020, there were 89 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the U.S. Virgin Islands, and 33 Indian Tribal Organizations (ITOs).Processing methods and equipment used: Specifications on formats (“Guidance for States Providing Participant Data”) were provided to all State agencies in January 2020. This guide specified 20 minimum dataset (MDS) elements and 11 supplemental dataset (SDS) elements to be reported on each WIC participant. Each State Agency was required to submit all 20 MDS items and any SDS items collected by the State agency. Study date(s) and duration The information for each participant was from the participants’ most current WIC certification as of April 2020.Study spatial scale (size of replicates and spatial scale of study area): In April 2020, there were 89 State agencies: the 50 States, American Samoa, the District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, the U.S. Virgin Islands, and 33 Indian Tribal Organizations (ITOs).Level of true replication: UnknownSampling precision (within-replicate sampling or pseudoreplication):State Agency Data Submissions. PC2020 is a participant dataset consisting of 7,036,867 active records. The records, submitted to USDA by the State Agencies, comprise a census of all WIC enrollees, so there is no sampling involved in the collection of this data.PII Analytic Datasets. State agency files were combined to create a national census participant file of approximately 7 million records. The census dataset contains potentially personally identifiable information (PII) and is therefore not made available to the public.National Sample Dataset. The public use SAS analytic dataset made available to the public has been constructed from a nationally representative sample drawn from the census of WIC participants, selected by participant category. The national sample consists of 1 percent of the total number of participants, or 70,368 records. The distribution by category is 5,469 pregnant women, 6,131 breastfeeding women, 4,373 postpartum women, 16,817 infants, and 37,578 children.Level of subsampling (number and repeat or within-replicate sampling): The proportionate (or self-weighting) sample was drawn by WIC participant category: pregnant women, breastfeeding women, postpartum women, infants, and children. In this type of sample design, each WIC participant has the same probability of selection across all strata. Sampling weights are not needed when the data are analyzed. In a proportionate stratified sample, the largest stratum accounts for the highest percentage of the analytic sample.Study design (before–after, control–impacts, time series, before–after-control–impacts): None – Non-experimentalDescription of any data manipulation, modeling, or statistical analysis undertaken: Each entry in the dataset contains all MDS and SDS information submitted by the State agency on the sampled WIC participant. In addition, the file contains constructed variables used for analytic purposes. To protect individual privacy, the public use file does not include State agency, local agency, or case identification numbers.Description of any gaps in the data or other limiting factors: All State agencies provided data on a census of their WIC participants.Resources in this dataset:Resource Title: WIC PC 2020 National Sample File Public Use Codebook.; File Name: PC2020 National Sample File Public Use Codebook.docx; Resource Description: WIC PC 2020 National Sample File Public Use CodebookResource Title: WIC PC 2020 Public Use CSV Data.; File Name: wicpc2020_public_use.csv; Resource Description: WIC PC 2020 Public Use CSV DataResource Title: WIC PC 2020 Data Set SAS, R, SPSS, Stata.; File Name: PC2020 Ag Data Commons.zipResource; Description: WIC PC 2020 Data Set SAS, R, SPSS, Stata One dataset in multiple formats
F
Native American Children Facial Image Dataset for Facial Recognition
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Native American Children Facial Image Dataset for Facial Recognition [Dataset]. https://www.futurebeeai.com/dataset/image-dataset/facial-images-minor-native-american
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The Native American Children Facial Image Dataset is a thoughtfully curated collection designed to support the development of advanced facial recognition systems, biometric identity verification, age estimation tools, and child-specific AI models. This dataset enables researchers and developers to build highly accurate, inclusive, and ethically sourced AI solutions for real-world applications.
Facial Image Data
The dataset includes over 1000 high-resolution image sets of children under the age of 18. Each participant contributes approximately 15 unique facial images, captured to reflect natural variations in appearance and context.
Diversity and Representation
•
Geographic Coverage: Children from USA, Canada, Mexico and more

•
Age Group: All participants are minors, with a wide age spread across childhood and adolescence.

•
Gender Balance: Includes both boys and girls, representing a balanced gender distribution.

•
File Formats: Images are available in JPEG and HEIC formats.

Quality and Image Conditions
To ensure robust model training and generalizability, images are captured under varied natural conditions:
•
Lighting: A mix of lighting setups, including indoor, outdoor, bright, and low-light scenarios.

•
Backgrounds: Diverse backgrounds—plain, natural, and everyday environments—are included to promote realism.

•
Capture Devices: All photos are taken using modern mobile devices, ensuring high resolution and sharp detail.

Metadata
Each child’s image set is paired with detailed, structured metadata, enabling granular control and filtering during model training:
•Unique Participant ID
•File Name
•Age
•Gender
•Country
•Demographic Attributes
•File Format
This metadata is essential for applications that require demographic awareness, such as region-specific facial recognition or bias mitigation in AI models.
Applications
This dataset is ideal for a wide range of computer vision use cases, including:
•
Facial Recognition: Improving identification accuracy across diverse child demographics.

•
KYC and Identity Verification: Enabling more inclusive onboarding processes for child-specific platforms.

•
Biometric Systems: Supporting child-focused identity verification in education, healthcare, or travel.

•
Age Estimation: Training AI models to estimate age ranges of children from facial features.

•
Child Safety Models: Assisting in missing child identification or online content moderation.

•
Generative AI Training: Creating more representative synthetic data using real-world diverse inputs.

Ethical Collection and Data Security
We maintain the highest ethical and security standards throughout the data lifecycle:
•
Guardian Consent: Every participant’s guardian provided informed, written consent, clearly outlining the dataset’s use cases.

•
Privacy-First Approach: Personally identifiable information is not shared. Only anonymized metadata is included.

•
Secure Storage: All data is
Data from: Composition of Foods Raw, Processed, Prepared USDA National...
agdatacommons.nal.usda.gov
datasetcatalog.nlm.nih.gov
+4more
pdf
Updated Nov 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson (2025). Composition of Foods Raw, Processed, Prepared USDA National Nutrient Database for Standard Reference, Release 28 [Dataset]. http://doi.org/10.15482/USDA.ADC/1324304
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.15482/USDA.ADC/1324304
Dataset updated
Nov 21, 2025
Dataset provided by
Agricultural Research Servicehttps://www.ars.usda.gov/
Authors
David B. Haytowitz; Jaspreet K.C. Ahuja; Bethany Showell; Meena Somanchi; Melissa Nickle; Quynh Anh Nguyen; Juhi R. Williams; Janet M. Roseland; Mona Khan; Kristine Y. Patterson; Jacob Exler; Shirley Wasswa-Kintu; Robin Thomas; Pamela R. Pehrsson
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
[Note: Integrated as part of FoodData Central, April 2019.] The database consists of several sets of data: food descriptions, nutrients, weights and measures, footnotes, and sources of data. The Nutrient Data file contains mean nutrient values per 100 g of the edible portion of food, along with fields to further describe the mean value. Information is provided on household measures for food items. Weights are given for edible material without refuse. Footnotes are provided for a few items where information about food description, weights and measures, or nutrient values could not be accommodated in existing fields. Data have been compiled from published and unpublished sources. Published data sources include the scientific literature. Unpublished data include those obtained from the food industry, other government agencies, and research conducted under contracts initiated by USDA’s Agricultural Research Service (ARS). Updated data have been published electronically on the USDA Nutrient Data Laboratory (NDL) web site since 1992. Standard Reference (SR) 28 includes composition data for all the food groups and nutrients published in the 21 volumes of "Agriculture Handbook 8" (US Department of Agriculture 1976-92), and its four supplements (US Department of Agriculture 1990-93), which superseded the 1963 edition (Watt and Merrill, 1963). SR28 supersedes all previous releases, including the printed versions, in the event of any differences. Attribution for photos: Photo 1: k7246-9 Copyright free, public domain photo by Scott Bauer Photo 2: k8234-2 Copyright free, public domain photo by Scott Bauer Resources in this dataset:Resource Title: READ ME - Documentation and User Guide - Composition of Foods Raw, Processed, Prepared - USDA National Nutrient Database for Standard Reference, Release 28. File Name: sr28_doc.pdfResource Software Recommended: Adobe Acrobat Reader,url: http://www.adobe.com/prodindex/acrobat/readstep.html Resource Title: ASCII (6.0Mb; ISO/IEC 8859-1). File Name: sr28asc.zipResource Description: Delimited file suitable for importing into many programs. The tables are organized in a relational format, and can be used with a relational database management system (RDBMS), which will allow you to form your own queries and generate custom reports.Resource Title: ACCESS (25.2Mb). File Name: sr28db.zipResource Description: This file contains the SR28 data imported into a Microsoft Access (2007 or later) database. It includes relationships between files and a few sample queries and reports.Resource Title: ASCII (Abbreviated; 1.1Mb; ISO/IEC 8859-1). File Name: sr28abbr.zipResource Description: Delimited file suitable for importing into many programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Title: Excel (Abbreviated; 2.9Mb). File Name: sr28abxl.zipResource Description: For use with Microsoft Excel (2007 or later), but can also be used by many other spreadsheet programs. This file contains data for all food items in SR28, but not all nutrient values--starch, fluoride, betaine, vitamin D2 and D3, added vitamin E, added vitamin B12, alcohol, caffeine, theobromine, phytosterols, individual amino acids, individual fatty acids, or individual sugars are not included. These data are presented per 100 grams, edible portion. Up to two household measures are also provided, allowing the user to calculate the values per household measure, if desired.Resource Software Recommended: Microsoft Excel,url: https://www.microsoft.com/ Resource Title: ASCII (Update Files; 1.1Mb; ISO/IEC 8859-1). File Name: sr28upd.zipResource Description: Update Files - Contains updates for those users who have loaded Release 27 into their own programs and wish to do their own updates. These files contain the updates between SR27 and SR28. Delimited file suitable for import into many programs.
e
Posts Source and Distribution Items HTA/HTA
data.europa.eu
csv, json, pdf, zip
Updated Jan 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agence ORE (2024). Posts Source and Distribution Items HTA/HTA [Dataset]. https://data.europa.eu/data/datasets/https-opendata-agenceore-fr-explore-dataset-postes-source-
Explore at:
zip, pdf, json, csvAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Agence ORE
License
https://www.etalab.gouv.fr/licence-ouverte-open-licencehttps://www.etalab.gouv.fr/licence-ouverte-open-licence
Description
This dataset represents the position of the source stations (HTB/HTA) and distribution stations (HTA/HTA) of the electricity distribution system operators.

Source stations interface between the transmission system and distribution system operators, transforming high voltage (HTB) into medium voltage (HTA). HTA distribution stations make it possible to switch electricity to different lines in medium voltage.

_The data are made available for information only without guarantee as to their degree of reliability. ORE Agency cannot therefore be held liable in the event of a lack of reliability. The use of data is exclusive of compliance with any mandatory procedure and in particular with the procedures provided for in the so-called “DT-DICT” regulations relating to the execution of works in the vicinity of certain underground, aerial or underwater transport or distribution structures (see Article L. 554-1 et seq. and Article R. 554-1 et seq. of the Environmental Code).

_The use of data cannot, moreover, exempt the user from having to solicit distribution system operators for all operations falling within their public service tasks and in particular those aimed at assessing, by carrying out a pre-study or a connection study, the impact on the public electricity distribution network linked to the connection of a possible project.

A question about the dataset? A use case to share with other users? The Forum of open data experts electricity and gas is here for that! _ _

_
I
Data for Post-retraction citation: A review of scholarly research on the...
databank.illinois.edu
Updated Jul 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jodi Schneider; Susmita Das; Jacqueline Léveillé; Randi Proescholdt (2023). Data for Post-retraction citation: A review of scholarly research on the spread of retracted science [Dataset]. http://doi.org/10.13012/B2IDB-3254797_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-3254797_V1
Dataset updated
Jul 14, 2023
Authors
Jodi Schneider; Susmita Das; Jacqueline Léveillé; Randi Proescholdt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset funded by
Alfred P. Sloan Foundation
The David F. Linowes Fellows Program
Description
Data for Post-retraction citation: A review of scholarly research on the spread of retracted science Schneider, Jodi; Das, Susmita; Léveillé, Jacqueline; Proescholdt, Randi Contact: Jodi Schneider jodi@illinois.edu & jschneider@pobox.com ********** OVERVIEW ********** This dataset provides further analysis for an ongoing literature review about post-retraction citation. This ongoing work extends a poster presented as: Jodi Schneider, Jacqueline Léveillé, Randi Proescholdt, Susmita Das, and The RISRS Team. Characterization of Publications on Post-Retraction Citation of Retracted Articles. Presented at the Ninth International Congress on Peer Review and Scientific Publication, September 8-10, 2022 hybrid in Chicago. https://hdl.handle.net/2142/114477 (now also in https://peerreviewcongress.org/abstract/characterization-of-publications-on-post-retraction-citation-of-retracted-articles/ ) Items as of the poster version are listed in the bibliography 92-PRC-items.pdf. Note that following the poster, we made several changes to the dataset (see changes-since-PRC-poster.txt). For both the poster dataset and the current dataset, 5 items have 2 categories (see 5-items-have-2-categories.txt). Articles were selected from the Empirical Retraction Lit bibliography (https://infoqualitylab.org/projects/risrs2020/bibliography/ and https://doi.org/10.5281/zenodo.5498474 ). The current dataset includes 92 items; 91 items were selected from the 386 total items in Empirical Retraction Lit bibliography version v.2.15.0 (July 2021); 1 item was added because it is the final form publication of a grouping of 2 items from the bibliography: Yang (2022) Do retraction practices work effectively? Evidence from citations of psychological retracted articles http://doi.org/10.1177/01655515221097623 Items were classified into 7 topics; 2 of the 7 topics have been analyzed to date. ********************** OVERVIEW OF ANALYSIS ********************** DATA ANALYZED: 2 of the 7 topics have been analyzed to date: field-based case studies (n = 20) author-focused case studies of 1 or several authors with many retracted publications (n = 15) FUTURE DATA TO BE ANALYZED, NOT YET COVERED: 5 of the 7 topics have not yet been analyzed as of this release: database-focused analyses (n = 33) paper-focused case studies of 1 to 125 selected papers (n = 15) studies of retracted publications cited in review literature (n = 8) geographic case studies (n = 4) studies selecting retracted publications by method (n = 2) ************** FILE LISTING ************** ------------------ BIBLIOGRAPHY ------------------ 92-PRC-items.pdf ------------------ TEXT FILES ------------------ README.txt 5-items-have-2-categories.txt changes-since-PRC-poster.txt ------------------ CODEBOOKS ------------------ Codebook for authors.docx Codebook for authors.pdf Codebook for field.docx Codebook for field.pdf Codebook for KEY.docx Codebook for KEY.pdf ------------------ SPREADSHEETS ------------------ field.csv field.xlsx multipleauthors.csv multipleauthors.xlsx multipleauthors-not-named.csv multipleauthors-not-named.xlsx singleauthors.csv singleauthors.xlsx *************************** DESCRIPTION OF FILE TYPES *************************** BIBLIOGRAPHY (92-PRC-items.pdf) presents the items, as of the poster version. This has minor differences from the current data set. Consult changes-since-PRC-poster.txt for details on the differences. TEXT FILES provide notes for additional context. These files end in .txt. CODEBOOKS describe the data we collected. The same data is provided in both Word (.docx) and PDF format. There is one general codebook that is referred to in the other codebooks: Codebook for KEY lists fields assigned (e.g., for a journal or conference). Note that this is distinct from the overall analysis in the Empirical Retraction Lit bibliography of fields analyzed; for that analysis see Proescholdt, Randi (2021): RISRS Retraction Review - Field Variation Data. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-2070560_V1 Other codebooks document specific information we entered on each column of a spreadsheet. SPREADSHEETS present the data collected. The same data is provided in both Excel (.xlsx) and CSV format. Each data row describes a publication or item (e.g., thesis, poster, preprint). For column header explainations, see the associated codebook. ***************************** DETAILS ON THE SPREADSHEETS ***************************** field-based case studies CODEBOOK: Codebook for field --REFERS TO: Codebook for KEY DATA SHEET: field REFERS TO: Codebook for KEY --NUMBER OF DATA ROWS: 20 NOTE: Each data row describes a publication/item. --NUMBER OF PUBLICATION GROUPINGS: 17 --GROUPED PUBLICATIONS: Rubbo (2019) - 2 items, Yang (2022) - 3 items author-focused case studies of 1 or several authors with many retracted publications CODEBOOK: Codebook for authors --REFERS TO: Codebook for KEY DATA SHEET 1: singleauthors (n = 9) --NUMBER OF DATA ROWS: 9 --NUMBER OF PUBLICATION GROUPINGS: 9 DATA SHEET 2: multipleauthors (n = 5 --NUMBER OF DATA ROWS: 5 --NUMBER OF PUBLICATION GROUPINGS: 5 DATA SHEET 3: multipleauthors-not-named (n = 1) --NUMBER OF DATA ROWS: 1 --NUMBER OF PUBLICATION GROUPINGS: 1 ********************************* CRediT http://credit.niso.org ********************************* Susmita Das: Conceptualization, Data curation, Investigation, Methodology Jaqueline Léveillé: Data curation, Investigation Randi Proescholdt: Conceptualization, Data curation, Investigation, Methodology Jodi Schneider: Conceptualization, Data curation, Funding acquisition, Investigation, Methodology, Project administration, Supervision
TODO: name of the dataset
openneuro.org
Updated Sep 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TODO:; First1 Last1; First2 Last2; ... (2024). TODO: name of the dataset [Dataset]. http://doi.org/10.18112/openneuro.ds005295.v1.0.1
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds005295.v1.0.1
Dataset updated
Sep 25, 2024
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
TODO:; First1 Last1; First2 Last2; ...
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data for Reading reshapes stimulus selectivity in the visual word form area

This contains the raw and pre-processed fMRI data and structural images (T1) used in the article, "Reading reshapes stimulus selectivity in the visual word form area. The preprint is available here, and the article will be in press at eNeuro.

Additional processed data and analysis code are available in an OSF repository.

Details about the study are included here.

Participants

We recruited 17 participants (Age range 19 to 38, 21.12 ± 4.44, 4 self-identified as male, 1 left-handed) from the Barnard College and Columbia University student body. The study was approved by the Internal Review Board at Barnard College, Columbia University. All participants provided written informed consent, acquired digitally, and were monetarily compensated for their participation. All participants had learned English before the age of five.

To ensure high data quality, we used the following criteria for excluding functional runs and participants. If the participant moved by a distance greater than 2 voxels (4 mm) within a single run, that run was excluded from analysis. Additionally, if the participant responded in less than 50% of the trials in the main experiment, that run was removed. Finally, if half or more of the runs met any of these criteria for a single participant, that participant was dropped from the dataset. Using these constraints, the analysis reported here is based on data from 16 participants. They ranged in age from 19 to 38 years (mean = 21.12 ± 4.58,). 4 participants self-identified as male, and 1 was left-handed. A total of 6 runs were removed from three of the remaining participants due to excessive head motion.

Equipment

We collected MRI data at the Zuckerman Institute, Columbia University, a 3T Siemens Prisma scanner and a 64-channel head coil. In each MR session, we acquired a T1 weighted structural scan, with voxels measuring 1 mm isometrically. We acquired functional data with a T2* echo planar imaging sequences with multiband echo sequencing (SMS3) for whole brain coverage. The TR was 1.5s, TE was 30 ms and the flip angle was 62°. The voxel size was 2 mm isotropic.

Stimuli were presented on an LCD screen that the participants viewed through a mirror with a viewing distance of 142 cm. The display had a resolution of 1920 by 1080 pixels, and a refresh rate of 60 Hz. We presented the stimuli using custom code written in MATLAB and the Psychophysics Toolbox (Brainard, 1997; Pelli, 1997). Throughout the scan, we recorded monocular gaze position using an SR Research Eyelink 1000 tracker. Participants responded with their right hand via three buttons on an MR-safe response pad.

Tasks

Main Task

Participants performed three different tasks during different runs, two of which required attending to the character strings, and one that encouraged participants to ignore them. In the lexical decision task, participants reported whether the character string on each trial was a real word or not. In the stimulus color task, participants reported whether the color of the character string was red or gray. In the fixation color task, participants reported whether or not the fixation dot turned red.

On each trial, a single character string flashed for 150 ms at one of three locations: centered at fixation, 3 dva left, or 3 dva right). The stimulus was followed by a blank with only the fixation mark present for 3850 ms, during which the participant had the opportunity to respond with a button press. After every five trials, there was a rest period (no task except to fixation on the dot). The duration of the rest period was either 4, 6 or 8 s in duration (randomly and uniformly selected).

Localizer for visual category-selective ventral temporal regions

Participants viewed sequences of images, each of which contained 3 items of one category: words, pseudowords, false fonts, faces, and limbs. Participants performed a one-back repetition detection task. On 33% of the trials, the exact same images flashed twice in a row. The participant’s task was to push a button with their right index finger whenever they detected such a repetition. Each participant performed 4 runs of the localizer task. Each run consisted of 77 four-second trials, lasting for approximately 6 minutes. Each category was presented 56 times through the course of the experiment.

Localizer for language processing regions

The stimuli on each trial were a sequence of 12 written words or pronounceable pseudowords, presented one at a time. The words were presented as meaningful sentences, while pseudowords formed “Jabberwocky” phrases that served as a control condition. Participants were instructed to read the stimuli silently to themselves, and also to push a button upon seeing the icon of a hand that appeared between trials. Participants performed three runs of the language localizer. Each run included 16 trials and lasted for 6 minutes. Each trial lasted for 6s, beginning with a blank screen for 100ms, followed by the presentation of 12 words or pseudowords for 450ms each (5400s total), followed by a response prompt for 400ms and a final blank screen for 100ms. Each run also included 5 blank trials (6 seconds each).

Data organization

This repository contains three main folders, complying with BIDS specifications. - Inputs contain BIDS compliant raw data, with the only change being defacing the anatomicals using pydeface. Data was converted to BIDS format using heudiconv.
- Outputs contain preprocessed data obtained using fMRIPrep. In addition to subject specific folders, we also provide the freesurfer reconstructions obtained using fMRIPrep, with defaced anatomicals. Subject specific ROIs are also included in the label folder for each subject in the freesurfer directory. - Derivatives contain all additional whole brain analyses performed on this dataset.
F
East Asian Children Facial Image Dataset for Facial Recognition
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). East Asian Children Facial Image Dataset for Facial Recognition [Dataset]. https://www.futurebeeai.com/dataset/image-dataset/facial-images-minor-east-asian
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
East Asia
Dataset funded by
FutureBeeAI
Description
Introduction
The East Asian Children Facial Image Dataset is a thoughtfully curated collection designed to support the development of advanced facial recognition systems, biometric identity verification, age estimation tools, and child-specific AI models. This dataset enables researchers and developers to build highly accurate, inclusive, and ethically sourced AI solutions for real-world applications.
Facial Image Data
The dataset includes over 1500 high-resolution image sets of children under the age of 18. Each participant contributes approximately 15 unique facial images, captured to reflect natural variations in appearance and context.
Diversity and Representation
•
Geographic Coverage: Children from China, Japan, Philippines, Malaysia, Singapore, Thailand, Vietnam, Indonesia, and more

•
Age Group: All participants are minors, with a wide age spread across childhood and adolescence.

•
Gender Balance: Includes both boys and girls, representing a balanced gender distribution.

•
File Formats: Images are available in JPEG and HEIC formats.

Quality and Image Conditions
To ensure robust model training and generalizability, images are captured under varied natural conditions:
•
Lighting: A mix of lighting setups, including indoor, outdoor, bright, and low-light scenarios.

•
Backgrounds: Diverse backgrounds—plain, natural, and everyday environments—are included to promote realism.

•
Capture Devices: All photos are taken using modern mobile devices, ensuring high resolution and sharp detail.

Metadata
Each child’s image set is paired with detailed, structured metadata, enabling granular control and filtering during model training:
•Unique Participant ID
•File Name
•Age
•Gender
•Country
•Demographic Attributes
•File Format
This metadata is essential for applications that require demographic awareness, such as region-specific facial recognition or bias mitigation in AI models.
Applications
This dataset is ideal for a wide range of computer vision use cases, including:
•
Facial Recognition: Improving identification accuracy across diverse child demographics.

•
KYC and Identity Verification: Enabling more inclusive onboarding processes for child-specific platforms.

•
Biometric Systems: Supporting child-focused identity verification in education, healthcare, or travel.

•
Age Estimation: Training AI models to estimate age ranges of children from facial features.

•
Child Safety Models: Assisting in missing child identification or online content moderation.

•
Generative AI Training: Creating more representative synthetic data using real-world diverse inputs.

Ethical Collection and Data Security
We maintain the highest ethical and security standards throughout the data lifecycle:
•
Guardian Consent: Every participant’s guardian provided informed, written consent, clearly outlining the dataset’s use cases.

•
Privacy-First Approach: Personally identifiable information is not shared. Only anonymized metadata is included.

•
Secure Storage:
d
Panel Conditioning in Sensitive Items - Dataset - B2FIND
demo-b2find.dkrz.de
Updated Sep 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Panel Conditioning in Sensitive Items - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/0df58d57-0f86-5108-8ad0-7e8692ea3879
Explore at:
Dataset updated
Sep 21, 2025
Description
Standardized CAMA dataset based on: Burgard, T., Wedderhoff, N., & Bosnjak, M. (2020). Konditionierungseffekte in Panel-Untersuchungen: Systematische Übersichtsarbeit und Meta-Analyse am Beispiel sensitiver Fragen. Psychologische Rundschau, 71, 89-95. https://doi.org/10.1026/0033-3042/a000479. Panel data are indispensable for investigating causal relationships and answering longitudinal questions. However, it is controversial how the repeated survey of panel participants affects the quality of panel data. The expected learning effect of repeated participation is called panel conditioning and can have both positive and negative consequences for the validity of panel data. Sensitive items in particular are expected to have an impact on the social desirability of the information provided. The available evidence on conditioning effects for sensitive questions suggests different effects depending on the type of question and has so far only been processed in the form of narrative reviews. In the present meta-analysis, conditioning effects are examined on the basis of the available experimental evidence (154 effect strengths from 19 reports), depending on the type of question, as well as the frequency and intervals between surveys (dosage effects). Standardized mean differences between experienced and fresh participants are analyzed by multi-level meta-regressions. The effects of previous surveys on the response behaviour in subsequent waves are only minor. At present, it can therefore be assumed that the quality of panel data is not influenced to a relevant extent by conditioning effects. Limits of the present meta-analysis and relevant research gaps are discussed.
F
East Asian Multi-Year Facial Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). East Asian Multi-Year Facial Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/image-dataset/facial-images-historical-east-asian
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the East Asian Multi-Year Facial Image Dataset, thoughtfully curated to support the development of advanced facial recognition systems, biometric identification models, KYC verification tools, and other computer vision applications. This dataset is ideal for training AI models to recognize individuals over time, track facial changes, and enhance age progression capabilities.
Facial Image Data
This dataset includes over 10,000+ high-quality facial images, organized into individual participant sets, each containing:
•
Historical Images: 22 facial images per participant captured across a span of 10 years

•
Enrollment Image: One recent high-resolution facial image for reference or ground truth

Diversity & Representation
•
Geographic Coverage: Participants from China, Japan, Philippines, Malaysia, Singapore, Thailand, Vietnam, Indonesia, and more and other East Asian regions

•
Demographics: Individuals aged 18 to 70 years, with a gender distribution of 60% male and 40% female

•
File Formats: All images are available in JPEG and HEIC formats

Image Quality & Capture Conditions
To ensure model generalization and practical usability, images in this dataset reflect real-world diversity:
•
Lighting Conditions: Images captured under various natural and artificial lighting setups

•
Backgrounds: A wide range of indoor and outdoor backgrounds

•
Device Quality: Captured using modern, high-resolution mobile devices for consistency and clarity

Metadata
Each participant’s dataset is accompanied by rich metadata to support advanced model training and analysis, including:
•Unique participant ID
•File name
•Age at the time of image capture
•Gender
•Country of origin
•Demographic profile
•File format
Use Cases & Applications
This dataset is highly valuable for a wide range of AI and computer vision applications:
•
Facial Recognition Systems: Train models for high-accuracy face matching across time

•
KYC & Identity Verification: Improve time-spanning verification for banks, insurance, and government services

•
Biometric Security Solutions: Build reliable identity authentication models

•
Age Progression & Estimation Models: Train AI to predict aging patterns or estimate age from facial features

•
Generative AI: Support creation and validation of synthetic age progression or longitudinal face generation

Secure & Ethical Collection
•
Platform: All data was securely collected and processed through FutureBeeAI’s proprietary systems

•
Ethical Compliance: Full participant consent obtained with transparent communication of use cases

•
Privacy-Protected: No personally identifiable information is included; all data is anonymized and handled with care

Dataset Updates & Customization
To keep pace with evolving AI needs, this dataset is regularly updated and customizable. Custom data collection options include:
<div style="margin-top:10px;
F
African Occluded Facial Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). African Occluded Facial Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/image-dataset/facial-images-occlusion-african
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the African Human Face with Occlusion Dataset, carefully curated to support the development of robust facial recognition systems, occlusion detection models, biometric identification technologies, and KYC verification tools. This dataset provides real-world variability by including facial images with common occlusions, helping AI models perform reliably under challenging conditions.
Facial Image Data
The dataset comprises over 5,000 high-quality facial images, organized into participant-wise sets. Each set includes:
•
Occluded Images: 5 images per individual featuring different types of facial occlusions, masks, caps, sunglasses, or combinations of these accessories

•
Normal Image: 1 reference image of the same individual without any occlusion

Diversity & Representation
•
Geographic Coverage: Participants from across Kenya, Malawi, Nigeria, Ethiopia, Benin, Somalia, Uganda, and more African countries

•
Demographics: Individuals aged 18 to 70 years, with a 60:40 male-to-female ratio

•
File Formats: Images available in JPEG and HEIC formats

Image Quality & Capture Conditions
To ensure robustness and real-world utility, images were captured under diverse conditions:
•
Lighting Variations: Includes both natural and artificial lighting scenarios

•
Background Diversity: Indoor and outdoor backgrounds for model generalization

•
Device Quality: Captured using the latest smartphones to ensure high resolution and consistency

Metadata
Each image is paired with detailed metadata to enable advanced filtering, model tuning, and analysis:
•Unique Participant ID
•File Name
•Age
•Gender
•Country
•Demographic Profile
•Type of Occlusion
•File Format
This rich metadata helps train models that can recognize faces even when partially obscured.
Use Cases & Applications
This dataset is ideal for a wide range of real-world and research-focused applications, including:
•
Facial Recognition under Occlusion: Improve model performance when faces are partially hidden

•
Occlusion Detection: Train systems to detect and classify facial accessories like masks or sunglasses

•
Biometric Identity Systems: Enhance verification accuracy across varying conditions

•
KYC & Compliance: Support face matching even when the selfie includes common occlusions.

•
Security & Surveillance: Strengthen access control and monitoring systems in environments with mask usage

Secure & Ethical Collection
•
Data Security: Collected and processed securely on FutureBeeAI’s proprietary platform

•
Ethical Compliance: Follows strict guidelines for participant privacy and informed consent

•
Transparent Participation: All contributors provided written consent and were informed of the intended use

Dataset
Mapping of Goods and Services Identification Number to United Nations...
open.canada.ca
datasets.ai
csv, html, xml
Updated Jan 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public Services and Procurement Canada (2025). Mapping of Goods and Services Identification Number to United Nations Standard Products and Services Code [Dataset]. https://open.canada.ca/data/en/dataset/588eab5b-7b16-4a26-b996-23b955965ffa
Explore at:
xml, csv, htmlAvailable download formats
Dataset updated
Jan 21, 2025
Dataset provided by
Public Services and Procurement Canadahttp://www.pwgsc.gc.ca/
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description
In 2021, an international goods and services classification for procurement called the United Nations Standard Products and Services Code (UNSPSC, v21) was implemented to replace the Government of Canada’s Goods and Services Identification Numbers (GSIN) codes for categorizing procurement activities undertaken by the Government of Canada. For the transition from GSIN to UNSPSC, a subset of the entire version 21 UNSPSC list was created. The Mapping of GSIN-UNSPSC file below provides a suggested linkage between the subset of UNSPSC and higher levels of the GSIN code list. As procurement needs evolve, this file may be updated to include other UNSPSC v21 codes that are deemed to be required. In the interim, if the lowest level values within the UNSPSC structure does not relate to a specific category of goods or services, the use of the higher (related) level code from within the UNSPSC structure is appropriate. --- >Please note: This dataset is offered as a means to assist the user in finding specific UNSPSC codes, based on high-level comparisons to the legacy GSIN codes. It should not be considered a direct one-to-one mapping of these two categorization systems. For some categories, the linkages were only assessed at higher levels of the two structures (and then simply carried through indiscriminately to the related lower categories beneath those values). But given that the two systems do not necessarily group items in the same way throughout their structures, this could result in confusing connections in some cases. Please always select the UNSPSC code that best describes the applicable goods or services, even if the associated GSIN value as shown in this file is not directly relevant. --- The data is available in Comma Separated Values (CSV) file format and can be downloaded to sort, filter, and search information. The United Nations Standard Products and Services Code (UNSPSC) page on CanadaBuys offers a comprehensive guide on how to use this reference file. The Finding and using UNSPSC Codes page from CanadaBuys also contains additional information which may be of use. This dataset was originally published on June 22, 2016. The format and contents of the CSV file were revised on May 12, 2021. A copy of the original file was archived as a secondary resource to this dataset at that time (labelled ARCHIVED - Mapping of GSIN-UNSPSC in the resource list below). --- As of March 23, 2023, the data dictionary linked below includes entries for both the current and archived versions of the datafile, as well as for the datafiles of Goods and Services Identification Number (GSIN) dataset and the archived United Nations Standard Products and Services Codes (v10, released 2007) dataset.
CLiP Vanuatu Marine Litter Abundance and Composition
cefas.co.uk
environment.data.gov.uk
+1more
Updated 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Centre for Environment, Fisheries and Aquaculture Science (2019). CLiP Vanuatu Marine Litter Abundance and Composition [Dataset]. http://doi.org/10.14466/CefasDataHub.64
Explore at:
Unique identifier
https://doi.org/10.14466/CefasDataHub.64
Dataset updated
2019
Dataset authored and provided by
Centre for Environment, Fisheries and Aquaculture Science
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Time period covered
Nov 1, 2018 - Feb 28, 2019
Area covered
Vanuatu
Description
This dataset contains 3 csv files. One file contains data on marine litter abundance and composition expressed in number of items per category. One file contains location names and GPS coordinates in decimal degrees. One file contains metadata collected during the clean-ups. A READ-ME text file contains description of columns content for the three csv files. Marine litter items were collected during beach clean-ups in the Efate Island (Vanuatu) between November 2018 and February 2019. Category list was obtain merging OSPAR and Tangaroa Blue protocols with the addition of region-specific items. The dataset includes metadata collected during the clean-ups. The United Nations Environment Programme (UNEP) defines marine litter as 'any persistent, manufactured or processed solid material discarded, disposed of or abandoned in the marine and coastal environment'. The data were collected on seven sandy beaches in Efate Island in Vanuatu between November 2018 and February 2019. Marine litter items were removed from a 100 m linear transects (if not otherwise stated in the comments/metadata) between the high tide line and the back of the beach. The latter was identified by change in topography and/or vegetation. The items are divided in 11 material-categories (Plastic, Rubber, Textile, Paper, Wood, Metal, Glass, Ceramic, Sanitary, Medical, Other) and 168 sub categories (specific items). The list of categories was obtained by merging protocols from OSPAR and Tangaroa Blue and was expanded with country-specific items found more frequently during initial clean-ups (the number of items for these latter categories had been annotated even before their official inclusion in the list). The country-specific categories are marked by '*' after their name. Metadata were collected between the high tide line and the back of the beach as per OSPAR protocol. All data were collected by or under the supervision of Cefas personnel, but some trainees helped during some of the activities. GPS positions at the start and end of the transects are reported for each activity. Clean-ups undertaken on the same beach targeted the same transect with a higher accuracy than GPS instrumentation, so the GPS coordinates are the same. All items were removed from the beach and properly disposed of and recycled (where possible).
d
Louisville Metro KY - Expenditures Data For Fiscal Year 2017
catalog.data.gov
data.lojic.org
+4more
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louisville/Jefferson County Information Consortium (2025). Louisville Metro KY - Expenditures Data For Fiscal Year 2017 [Dataset]. https://catalog.data.gov/dataset/louisville-metro-ky-expenditures-data-for-fiscal-year-2017
Explore at:
Dataset updated
Jul 30, 2025
Dataset provided by
Louisville/Jefferson County Information Consortium
Area covered
Kentucky, Louisville
Description
This dataset includes all expenditures by agency, with additional data including funding source, vendor name, and amount invoiced. The dates for this dataset run from Fiscal Year 2008 to the present. A Fiscal Year starts July 1 and runs through June 30 of the given fiscal year. For example, Fiscal Year 2012 would include July 1, 2011 through June 30, 2012.Data Dictionary:Fiscal_Year - The fiscal year represents the timeframe in which goods are received or services are performed. Note: A fiscal year starts July 1 and runs through June 30 of the given year. For example, Fiscal Year 2012 would include July 1, 2011 through June 30, 2012.Budget_Type - Whether the expenditures were due to capital or operational expenses.Agency_Name - The agency group for which the expenditures belongs.Sub_Agency_Name - The sub group of the agency group for which the expenditures belongs.DepartmentName - The department group for which the expenditures belongs.Sub_DepartmentName - The sub group of the department group for which the expenditures belongs.Category - A breakdown of what the expenditures were for.Sub_Category - A further more detailed breakdown of what the expenditures were for.Stimulus_Type -Funding_Source - The group which provided the funding.Vendor_Name - The name of the vendor being paid.InvoiceID - The invoice ID from the vendor.InvoiceDt - The date of service as provided on the invoice from the vendor.InvoiceAmt - The amount invoiced for payment by the vendor.DistributionAmt - A breakdown of the accounting strings used to pay an invoice. Note: A single invoice amount could be split between more than one distribution.CheckID - The ID number for the check distributed to the vendor which paid all or part of the invoice.CheckDt - The date the check was issued to the vendor.CheckAmt - The amount issued on the check to the vendor. Note: Sometimes a single check can be issued for multiple invoices. The check amount may then be larger than the invoice amount but the invoice amounts from all included invoices will add up to the total check amount.CheckVoidDt - If applicable, the date a check was voided. Note: 1/1/1900 is the default value. If this is the value the check has not been voided.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Mohamed (2025). Spending Habits by Category and Item [Dataset]. https://www.kaggle.com/datasets/ahmedmohamed2003/spending-habits

Spending Habits by Category and Item

Spending Habits with Category and Item set

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 16, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Ahmed Mohamed

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Spending Patterns Dataset

Dataset Overview

Dataset Features

Columns

Column Name	Description
`Customer ID`	Unique identifier for each customer (e.g., `CUST_0001`).
`Category`	The spending category (e.g., Groceries, Shopping, Travel).
`Item`	The specific item purchased within the category (e.g., Milk, Plane Ticket).
`Quantity`	Number of units purchased. For specific categories (e.g., Subscriptions, Housing and Utilities, Transportation, Medical/Dental, Travel), this is always `1`.
`Price Per Unit`	The price of one unit of the item (in USD).
`Total Spent`	Total expenditure for the transaction (`Quantity` × `Price Per Unit`).
`Payment Method`	The payment method used (e.g., Credit Card, Cash).
`Location`	Where the transaction occurred (e.g., Online, In-store, Mobile App).
`Transaction Date`	The date of the transaction (YYYY-MM-DD format).

Categories and Items

The dataset includes the following spending categories with example items:

Groceries: Milk, Bread, Fruits, Vegetables, Meat, etc.
Shopping: Clothes, Shoes, Electronics, Car.
Subscriptions: Streaming Service, Gym Membership (Quantity always 1).
Housing and Utilities: Rent, Electricity Bill, Gas Bill (Quantity always 1).
Transportation: Gas, Public Transit, Car Repair (Quantity always 1).
Food: Restaurant Meal, Fast Food, Coffee.
Medical/Dental: Doctor Visit, Dentist Visit, Medicine (Quantity always 1).
Personal Hygiene: Toothpaste, Shampoo, Soap.
Fitness: Yoga Class, Personal Trainer, Workout Equipment.
Travel: Plane Ticket, Hotel Stay, Taxi/Uber (Plane Ticket and Hotel Stay have Quantity always 1).
Hobbies: Books, Art Supplies, Video Games.
Friend Activities: Movie Tickets, Concert Tickets, Dinner with Friends.
Gifts: Flowers, Gift Cards, Jewelry.

Usage Examples

Example 1: Total Spending by Category

Example 2: Spending Habits of Each Customer

Example 13 Price Change Over Time

Clear search

Close search

Google apps

Main menu

Spending Habits by Category and Item

Spending Patterns Dataset

Dataset Overview

Dataset Features

Columns

Categories and Items

Usage Examples

Example 1: Total Spending by Category

Example 2: Spending Habits of Each Customer

Example 13 Price Change Over Time

Dataset for The effects of a number line intervention on calculation skills

Retail Store Sales: Dirty for Data Cleaning

Dirty Retail Store Sales Dataset

Overview

File Information

Columns Description

Categories and Items

Electric Household Essentials

Furniture

Risky Business: Factor Analysis of Survey Data – Assessing the Probability...

Dataset for training classifiers of comparative sentences

Collections (from American Folklife Center)

Data Management Training Clearinghouse Metadata and Collection Statistics...

WIC Participant and Program Characteristics 2020

Native American Children Facial Image Dataset for Facial Recognition

Introduction

Facial Image Data

Diversity and Representation

Quality and Image Conditions

Metadata

Applications

Ethical Collection and Data Security

Data from: Composition of Foods Raw, Processed, Prepared USDA National...

Posts Source and Distribution Items HTA/HTA

Data for Post-retraction citation: A review of scholarly research on the...

TODO: name of the dataset

Data for Reading reshapes stimulus selectivity in the visual word form area

Participants

Equipment

Tasks

Main Task

Localizer for visual category-selective ventral temporal regions

Localizer for language processing regions

Data organization

East Asian Children Facial Image Dataset for Facial Recognition

Introduction

Facial Image Data

Diversity and Representation

Quality and Image Conditions

Metadata

Applications

Ethical Collection and Data Security

Panel Conditioning in Sensitive Items - Dataset - B2FIND

East Asian Multi-Year Facial Image Dataset

Introduction

Facial Image Data

Diversity & Representation

Image Quality & Capture Conditions

Metadata

Use Cases & Applications

Secure & Ethical Collection

Dataset Updates & Customization

African Occluded Facial Image Dataset

Introduction

Facial Image Data

Diversity & Representation

Image Quality & Capture Conditions

Metadata

Use Cases & Applications

Secure & Ethical Collection

Dataset

Mapping of Goods and Services Identification Number to United Nations...

CLiP Vanuatu Marine Litter Abundance and Composition

Louisville Metro KY - Expenditures Data For Fiscal Year 2017

Spending Habits by Category and Item

Spending Habits with Category and Item set

Spending Patterns Dataset

Dataset Overview

Dataset Features

Columns