19 datasets found

Fake Employee Dataset
kaggle.com
zip
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oyekanmi Olamilekan (2023). Fake Employee Dataset [Dataset]. https://www.kaggle.com/datasets/oyekanmiolamilekan/fake-employee-dataset
Explore at:
zip(162874 bytes)Available download formats
Dataset updated
Nov 20, 2023
Authors
Oyekanmi Olamilekan
Description
Creating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.

Code Url: https://github.com/intellisenseCodez/faker-data-generator
Parameters for the logistic regression model to predict Name Generator ties....
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luke J. Matthews; Peter DeWan; Elizabeth Y. Rula (2023). Parameters for the logistic regression model to predict Name Generator ties. [Dataset]. http://doi.org/10.1371/journal.pone.0055234.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0055234.t002
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Luke J. Matthews; Peter DeWan; Elizabeth Y. Rula
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
*This field indicates a dummy variable was also included. If a data point for the row variable was a 0, the dummy took on a value of 1. Otherwise the dummy was 0. Row variables with blank entries did not exhibit over-dispersion of zeros and so did not require dummy variables.†Variable was log transformed to better meet generalized linear model assumptions.
Finance Dataset by Faker Library
kaggle.com
Updated Feb 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hamza Obaydallah (2024). Finance Dataset by Faker Library [Dataset]. https://www.kaggle.com/datasets/hamzazaki/finance-dataset-by-faker-library
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Hamza Obaydallah
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9365842%2F5d270d8701f4dc2687f0ae193ee018ae%2F20-Best-Finance-Economic-Datasets-for-Machine-Learning-Social.jpg?generation=1708443878634431&alt=media" alt=""> Finance dataset with fake information such as transaction ID, date, amount, currency, description, category, merchant, customer, city, and country. It can be used for educational purposes as well as for testing.

This script generates a dataset with fake information such as name, email, phone number, address, date of birth, job, and company. Adjust the num_rows variable to specify the number of rows you want in your dataset. Finally, the dataset is saved to a CSV file named fake_dataset.csv. You can modify the fields or add additional fields according to your requirements.

`

Define the number of rows for your dataset

num_rows = 15000

Generate fake finance data

data = { 'Transaction_ID': [fake.uuid4() for _ in range(num_rows)], 'Date': [fake.date_time_this_year() for _ in range(num_rows)],

'Amount': [round(random.uniform(10, 10000), 2) for _ in range(num_rows)], 'Currency': [fake.currency_code() for _ in range(num_rows)], 'Description': [fake.bs() for _ in range(num_rows)], 'Category': [random.choice(['Food', 'Transport', 'Shopping', 'Entertainment', 'Utilities']) for _ in range(num_rows)], 'Merchant': [fake.company() for _ in range(num_rows)], 'Customer': [fake.name() for _ in range(num_rows)], 'City': [fake.city() for _ in range(num_rows)], 'Country': [fake.country() for _ in range(num_rows)]

}

Create a DataFrame

df = pd.DataFrame(data)

Save the DataFrame to a CSV file

df.to_csv('finance_dataset.csv', index=False)

Display the DataFrame

df.head()`
Synthetic E-Commerce Relational Datasets
kaggle.com
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nael Aqel (2025). Synthetic E-Commerce Relational Datasets [Dataset]. https://www.kaggle.com/datasets/naelaqel/synthetic-e-commerce-relational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 31, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nael Aqel
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic E-Commerce Relational Dataset

This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.

Purpose

To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

customer_id (int): Unique identifier for each customer

name (string): Customer full name

email (string): Customer email address

gender (string): Customer gender ('Male', 'Female', 'Other')

signup_date (date): Date customer signed up

country (string): Customer country of residence

2. Products

product_id (int): Unique identifier for each product

product_name (string): Name of the product

category (string): Product category (e.g., Electronics, Books)

price (float): Price per unit

stock_quantity (int): Available stock count

brand (string): Product brand name

3. Orders

order_id (int): Unique identifier for each order

customer_id (int): ID of the customer who placed the order (foreign key to Customers)

order_date (date): Date when order was placed

total_amount (float): Total amount for the order

payment_method (string): Payment method used (Credit Card, PayPal, etc.)

shipping_country (string): Country where the order is shipped

4. Order Items

order_item_id (int): Unique identifier for each order item

order_id (int): ID of the order this item belongs to (foreign key to Orders)

product_id (int): ID of the product ordered (foreign key to Products)

quantity (int): Number of units ordered

unit_price (float): Price per unit at order time

5. Product Reviews

review_id (int): Unique identifier for each review

product_id (int): ID of the reviewed product (foreign key to Products)

customer_id (int): ID of the customer who wrote the review (foreign key to Customers)

rating (int): Rating score (1 to 5)

review_text (string): Text content of the review

review_date (date): Date the review was written

Visual EDR

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">

Notes

All data is randomly generated using Python’s Faker library, so it does not reflect any real individuals or companies.

The data is provided in both CSV and Parquet formats.

The generator script is available in the accompanying GitHub repository for reproducibility and customization.

Output

The script saves two folders inside the specified output path:

csv/ # CSV files parquet/ # Parquet files

License

MIT License

References

Github Repo: https://github.com/NaelAqel/db_gen

Notebook: https://www.kaggle.com/code/naelaqel/synthetic-e-commerce-relational-dataset-generator
c
/GluGluToHToZZTo4L_M-550_7TeV-minloHJJ-pythia6-tauola/Summer11LegDR-PU_S13_START53_LV6-v1/AODSIM...
opendata.cern.ch
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS collaboration (2016). /GluGluToHToZZTo4L_M-550_7TeV-minloHJJ-pythia6-tauola/Summer11LegDR-PU_S13_START53_LV6-v1/AODSIM [Dataset]. http://doi.org/10.7483/OPENDATA.CMS.VPJH.JZHB
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.CMS.VPJH.JZHB
Dataset updated
2016
Dataset provided by
CERN Open Data Portal
Authors
CMS collaboration
Description
Simulated dataset GluGluToHToZZTo4L_M-550_7TeV-minloHJJ-pythia6-tauola in AODSIM format for 2011 collision data (SM Higgs)

See the description of the simulated dataset names in: About CMS simulated dataset names.

These simulated datasets correspond to the collision data collected by the CMS experiment in 2011.
G
AI-Generated Product Naming Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). AI-Generated Product Naming Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/ai-generated-product-naming-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
AI-Generated Product Naming Market Outlook

According to our latest research, the AI-Generated Product Naming market size reached USD 612.4 million in 2024, reflecting a robust adoption curve across industries worldwide. With a compound annual growth rate (CAGR) of 17.8% from 2025 to 2033, the market is forecasted to attain a value of USD 2,183.6 million by 2033. The principal growth factor driving this expansion is the increasing demand for rapid, creative, and data-driven branding solutions that can keep pace with product proliferation and global market entry.

The primary growth driver for the AI-Generated Product Naming market is the exponential rise in product launches across diverse sectors, especially in retail, FMCG, and technology. As businesses strive to differentiate themselves in saturated markets, the need for unique, memorable, and linguistically appropriate product names has intensified. AI-powered naming solutions leverage natural language processing, machine learning, and big data analytics to generate names that resonate with target audiences, are culturally sensitive, and are optimized for search engines. This capability not only accelerates time-to-market but also minimizes the risk of legal or cultural missteps, making AI-based naming indispensable for global enterprises and startups alike.

Another significant factor contributing to the market’s growth is the shift towards digitalization and automation in branding processes. Traditional product naming often involves lengthy brainstorming sessions, focus groups, and iterative testing, leading to time delays and increased costs. AI-Generated Product Naming tools streamline these workflows by instantly generating hundreds of name options that can be filtered by language, tone, industry relevance, and domain availability. The integration of AI solutions with branding agencies’ and enterprises’ existing marketing stacks further enhances efficiency and enables data-driven decision-making. This technological advancement is particularly valuable in highly competitive sectors such as pharmaceuticals and technology, where speed and compliance are critical.

Furthermore, the increasing investment in artificial intelligence and machine learning technologies by both established companies and innovative startups is fueling the development of more sophisticated and context-aware naming solutions. These platforms are becoming adept at understanding brand values, target demographics, and even emotional triggers, resulting in names that are not only creative but also strategically aligned with broader marketing goals. As AI algorithms evolve, their ability to generate names that pass linguistic, legal, and SEO checks will only improve, further solidifying their role in the product development lifecycle.

From a regional perspective, North America currently dominates the AI-Generated Product Naming market, accounting for the largest share due to its advanced technological infrastructure, high adoption rate of AI-powered marketing tools, and the presence of leading branding agencies and multinational companies. Europe follows closely, driven by its vibrant FMCG and e-commerce sectors, while Asia Pacific is emerging as the fastest-growing region, propelled by the rapid digital transformation of retail and consumer goods industries in China, India, and Southeast Asia. Latin America and the Middle East & Africa are also witnessing steady growth, supported by increasing entrepreneurial activity and digitalization efforts.

Component Analysis

The Component segment of the AI-Generated Product Naming market is bifurcated into Software and Services. The software sub-segment encompasses AI-powered platforms and tools that autonomously generate product names based on user inputs, industry context, and linguistic guidelines. These solutions are increasingly leveraging advanced natural language generation and deep learning algorithms to produce names that are no
H
Replication Data for: The Dynamics of Partisan Identification when Party...
dataverse.harvard.edu
dataone.org
Updated Sep 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harvard Dataverse (2018). Replication Data for: The Dynamics of Partisan Identification when Party Brands Change: The Case of the Workers Party in Brazil [Dataset]. http://doi.org/10.7910/DVN/XSCFX5
Explore at:
docx(12684), application/x-stata-syntax(44213), doc(1231860), tsv(53049450), text/plain; charset=us-ascii(338338)Available download formats
Unique identifier
https://doi.org/10.7910/DVN/XSCFX5
Dataset updated
Sep 24, 2018
Dataset provided by
Harvard Dataverse
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Brazil
Description
Replications materials for "The Dynamics of Partisan Identification when Party Brands Change: The Case of the Workers Party in Brazil" "Two-City, Six-Wave Panel Survey, Brazil" (2002, 2004, 2006). Sample: Representative samples of (1) Caxias do Sul, Rio Grande do Sul and (2) Juiz de Fora, Minas Gerais. Topic areas: Neighborhood quality of life, worst problems, economic assessments, political participation, media and campaign attention, civil society and neighborhood involvement, political discussion frequency, trust in government and institutions, vote choice, core values, interpersonal persuasion, feeling thermometers of groups and politicians, party identification, ideology, candidate trait assessments, candidate ideological and issues placement, issues self-placement, evaluation of Lula's government, political knowledge, discussant name generator. Sample size: About 25,000 interviews. Special features: Interviews with named political discussants, 100 interviews per neighborhood.
SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution
figshare.com
csv
Updated Oct 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur (2025). SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution [Dataset]. http://doi.org/10.6084/m9.figshare.30472712.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.30472712.v1
Dataset updated
Oct 29, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Praveen Chinnappa; Rose Mary Arokiya Dass; yash mathur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
DoH-Gen-F-CCDDD
data.niaid.nih.gov
Updated Feb 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeřábek, Kamil; Hynek, Karel; Čejka, Tomáš; Ryšavý, Ondřej (2022). DoH-Gen-F-CCDDD [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5957420
Explore at:
Dataset updated
Feb 5, 2022
Dataset provided by
CESNEThttp://www.cesnet.cz/
FIT BUT
FIT CTU
Authors
Jeřábek, Kamil; Hynek, Karel; Čejka, Tomáš; Ryšavý, Ondřej
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of DNS over HTTPS traffic from Firefox (Comcast, CZNIC, DNSForge, DSNSB, DOHli) The dataset contains DoH and HTTPS traffic that was captured in a virtualized environment (Docker) and generated automatically by Firefox browser with enabled DoH towards 5 different DoH servers (Comcast, CZNIC, DNSForge, DSNSB, DOHli) and a web page loads towards a sample of web pages taken from Majestic Million dataset. The data are provided in the form of PCAP files. However, we also provided TLS enriched flow data that are generated with opensource ipfixprobe flow exporter. Other than TLS related information is not relevant since the dataset comprises only encrypted TLS traffic. The TLS enriched flow data are provided in the form of CSV files with the following columns:

Column Name Column Description DST_IP Destination IP address SRC_IP Source IP address BYTES The number of transmitted bytes from Source to Destination BYTES_REV The number of transmitted bytes from Destination to Source TIME_FIRST Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS TIME_LAST Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS PACKETS The number of packets transmitted from Source to Destination PACKETS_REV The number of packets transmitted from Destination to Source DST_PORT Destination port SRC_PORT Source port PROTOCOL The number of transport protocol TCP_FLAGS Logic OR across all TCP flags in the packets transmitted from Source to Destination TCP_FLAGS_REV Logic OR across all TCP flags in the packets transmitted from Destination to Source TLS_ALPN The Value of Application Protocol Negotiation Extension sent from Server TLS_JA3 The JA3 fingerprint TLS_SNI The value of Server Name Indication Extension sent by Client

The DoH resolvers in the dataset can be identified by IP addresses written in doh_resolver_ip.csv file.

The main part of the dataset is located in DoH-Gen-F-CCDDD.tar.gz and has the following structure:

. └─── data | - Main directory with data └── generated | - Directory with generated captures ├── pcap | - Generated PCAPs │ └── firefox └── tls-flow-csv | - Generated CSV flow data └── firefox

Total stats of generated data:

Name Value Total Data Size 40.2 GB Total files 10 DoH extracted tls flows ~100 K Non-DoH extracted tls flows ~315 K

DoH Server information

Name Provider DoH query url Comcast https://corporate.comcast.com https://doh.xfinity.com/dns-query CZNIC https://www.nic.cz https://odvr.nic.cz/doh DNSForge https://dnsforge.de https://dnsforge.de/dns-query DNSSB https://dns.sb/doh/ https://doh.dns.sb/dns-query DOHli https://doh.li https://doh.li/dns-query
c
/MinBias_TuneD6T_2760GeV_pythia6/HiWinter13-STARTHI53_V26-v1/GEN-SIM-RECO
opendata.cern.ch
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS Collaboration (2023). /MinBias_TuneD6T_2760GeV_pythia6/HiWinter13-STARTHI53_V26-v1/GEN-SIM-RECO [Dataset]. http://doi.org/10.7483/OPENDATA.CMS.LRVU.HHYP
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.CMS.LRVU.HHYP
Dataset updated
2023
Dataset provided by
CERN Open Data Portal
Authors
CMS Collaboration
Description
Simulated dataset MinBias_TuneD6T_2760GeV_pythia6 in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the pp collision data, needed as reference data for heavy-ion data analysis, at energy 2.76TeV collected by the CMS experiment in 2013.
100 TeV pp collisions, Exotics type, PYTHIA8 generator:...
osti.gov
Updated Nov 14, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HepSim Monte Carlo Event Repository, Argonne National Laboratory (ANL) (2016). 100 TeV pp collisions, Exotics type, PYTHIA8 generator: tev100pp_qstar_pythia8_mbins_slim [Dataset]. http://doi.org/10.34664/1575510
Explore at:
Unique identifier
https://doi.org/10.34664/1575510
Dataset updated
Nov 14, 2016
Dataset provided by
Office of Sciencehttp://www.er.doe.gov/
United States Department of Energyhttp://energy.gov/
Argonne National Laboratory (ANL), Argonne, IL (United States)
HepSim Monte Carlo Event Repository, Argonne National Laboratory (ANL)
Description
Excited Fermions in mass range pT=5-40 TeV. 10000 events per file, 100 files per mass. Compositeness scale (Lambda) is set to the mass of the fermion, so the width is expected to be small (see the log files for details) Cross sections are included in the log files (mass dependent) Note that data are slimmed (see the log file). How to decode name: Name: tev100_pythia8_qstar_m[MASS]_[NUMBER] where [MASS] is generator-level mass as given below: Mass bins (in GeV) m[1]=5000 m[2]=10000 ......... How to use: To get a sample with a given mass, use "glob" regular expressions. Slimmed as: Particle records are slimmed (all stable with pT>0.3 GeV) and (PID=5 || PID=6) or PID>22 && PID<38) or PID>10 && PID<17).
Z
Albero study: a longitudinal database of the social network and personal...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maya Jariego, Isidro; Holgado Ramos, Daniel; Alieva, Deniza (2021). Albero study: a longitudinal database of the social network and personal networks of a cohort of students at the end of high school [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3532047
Explore at:
Dataset updated
Mar 26, 2021
Dataset provided by
Management Development Institute of Singapore in Tashkent
Universidad de Sevilla
Authors
Maya Jariego, Isidro; Holgado Ramos, Daniel; Alieva, Deniza
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT

The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.

INTRODUCTION

Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.

The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).

Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).

These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.

PARTICIPANTS

The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).

DATE STRUCTURE AND ARCHIVES FORMAT

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

Social network

The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.

Personal networks

Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).

Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.

Sense of community and metropolitan displacements

The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:

• Socio-economic data. • Data on habitual residence. • Information on intercity journeys. • Identity and sense of community. • Personal network indicators. • Social network indicators.

DATA ACCESS

Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.

The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: .

In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:

Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp

The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl

CONCLUSION

The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.

The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.

ACKNOWLEDGEMENTS

The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals, groups, organizations and social settings” (2006 -2009) of the European Science Foundation (ESF). The data was presented for the first time on June 30, 2009, at the European Research Collaborative Project Meeting on Dynamic Analysis of Networks and Behaviors, held at the Nuffield College of the University of Oxford.

REFERENCES

Brandes, U., & Wagner, D. (2004). Visone - Analysis and Visualization of Social Networks. In M. Jünger, & P. Mutzel (Eds.), Graph Drawing Software (pp. 321-340). New York: Springer-Verlag.

Maya-Jariego, I. (2018). Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of
Invoices Dataset
kaggle.com
zip
Updated Jan 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cankat Saraç (2022). Invoices Dataset [Dataset]. https://www.kaggle.com/datasets/cankatsrc/invoices
Explore at:
zip(574249 bytes)Available download formats
Dataset updated
Jan 18, 2022
Authors
Cankat Saraç
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
MyMart: A Comprehensive Sales Dataset
kaggle.com
zip
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dave Darshan (2024). MyMart: A Comprehensive Sales Dataset [Dataset]. https://www.kaggle.com/datasets/davedarshan/mymart-a-comprehensive-sales-dataset
Explore at:
zip(277198 bytes)Available download formats
Dataset updated
Apr 8, 2024
Authors
Dave Darshan
Description
This data is artificially generated. It can be used for practicing data visualization and analysis skills. Please note that since the data is generated randomly, it may not reflect real-world sales data accurately. However, it should serve as a good starting point for practicing data analysis and visualization.

Description :

• Sales Date: This column contains the date of each sale. The dates are generated for a period of 120 days starting from January 1, 2023. • Category: This column contains the category of the product sold. The categories include ‘Electronics’, ‘Clothing’, and ‘Home & Kitchen’. • Subcategory: This column contains the subcategory of the product sold. Each category has its own set of subcategories. For example, the ‘Electronics’ category includes subcategories such as ‘Communication’, ‘Computers’, and ‘Wearables’. • ProductName: This column contains the name of the product sold. Each subcategory has its own set of products. For example, the ‘Communication’ subcategory includes products such as ‘Walkie Talkie’, ‘Cell Phone’, and ‘Smart Phone’. • Salesperson: This column contains the name of the salesperson who made the sale. There are different salespersons assigned to each category. • Gender: This column contains the gender of the salesperson. The gender is determined based on the salesperson’s name. • Unit sold: This column contains the number of units of the product sold in the sale. The number of units sold is a random number between 1 and 100. • Original Price: This column contains the original price of the product. The original price is a random number between 10 and 1000. • Sales Price: This column contains the sales price of the product. The sales price is calculated as a random fraction of the original price, ensuring that the sales price is always slightly higher than the original price.

For information on 'How to generate a dataset', click here.

70 Small Business Ideas to Start in 2025

kaggle.com

zip

Updated Aug 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

AnthonyTherrien (2025). 70 Small Business Ideas to Start in 2025 [Dataset]. https://www.kaggle.com/datasets/anthonytherrien/70-small-business-ideas-to-start-in-2025

Explore at:

zip(1165 bytes)Available download formats

Dataset updated

Aug 2, 2025

Authors

AnthonyTherrien

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

📊 70 Small Business Ideas to Start in 2025

This dataset features a curated list of 70 small business ideas that are relevant and potentially profitable for aspiring entrepreneurs in 2025. Each entry includes the business name, difficulty rating, and its corresponding category to help users analyze and choose ideas based on their interests, expertise, and available resources.

📁 Dataset Structure

Filename: small_business_ideas_2025.csv

Column Name	Description
`Name`	The name or title of the small business idea
`Difficulty`	Estimated difficulty level to start the business (`Low`, `Medium`, or `High`)
`Category`	The general type of service or industry (e.g., `Financial Services`, `Creative Work`, `Manual Labor`, etc.)

🗂 Categories Overview

The dataset covers diverse categories including:

💼 Financial Services
🛠️ Manual Labor
🎨 Creative Work
🏘️ Property & Real Estate
📈 Planning & Coaching
🍽️ Hospitality
🐾 Other Services
🌐 Online Business

🎯 Use Cases

📚 Market research for startup consultants
💡 Inspiration for new entrepreneurs
🤖 Training data for idea recommendation models
📊 Exploratory data analysis (EDA) on industry trends
📝 Project or portfolio ideas for business/data students

🏁 Sample Preview

Name	Difficulty	Category
Accounting and Tax Services	High	Financial Services
Dog Walking	Low	Other Services
Web Development	High	Creative Work
Food Truck	High	Hospitality

WikiSQL (Questions and SQL Queries)
kaggle.com
zip
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). WikiSQL (Questions and SQL Queries) [Dataset]. https://www.kaggle.com/datasets/thedevastator/dataset-for-developing-natural-language-interfac
Explore at:
zip(21491264 bytes)Available download formats
Dataset updated
Nov 25, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

By Huggingface Hub [source]

About this dataset

A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface

Research Ideas

This dataset can be used to develop natural language interfaces for relational databases.

This dataset can be used to develop a knowledge base of common SQL queries.

This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
c
/Pyquen_DiJet_Pt30_TuneZ2_Unquenched_Hydjet1p8_2760GeV/HiFall13DR53X-NoPileUp_STARTHI53_LV1-v2/GEN-SIM-RECO...
opendata.cern.ch
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS Collaboration (2023). /Pyquen_DiJet_Pt30_TuneZ2_Unquenched_Hydjet1p8_2760GeV/HiFall13DR53X-NoPileUp_STARTHI53_LV1-v2/GEN-SIM-RECO [Dataset]. http://doi.org/10.7483/OPENDATA.CMS.67F0.HMSJ
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.CMS.67F0.HMSJ
Dataset updated
2023
Dataset provided by
CERN Open Data Portal
Authors
CMS Collaboration
Description
Simulated dataset Pyquen_DiJet_Pt30_TuneZ2_Unquenched_Hydjet1p8_2760GeV in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the PbPb collision data at energy 2.76TeV collected by the CMS experiment during Run1.
c
/MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM
opendata.cern.ch
opendata-dev.cern.ch
Updated 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS collaboration (2018). /MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM [Dataset]. http://doi.org/10.7483/OPENDATA.CMS.07II.3X1D
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.CMS.07II.3X1D
Dataset updated
2018
Dataset provided by
CERN Open Data Portal
Authors
CMS collaboration
Description
Simulated pile-up event dataset MinBias_TuneZ2star_8TeV-pythia6 in GEN-SIM format. Events were sampled from this dataset and added to simulated data to make them comparable with the 2012 collision data, see the guide to pile-up simulation.

See the description of the simulated dataset names in: About CMS simulated dataset names.
c
/QCD_Pt_460_TuneZ2_5p02TeV/HiWinter13-pp_STARTHI53_V25-v1/GEN-SIM-RECO
opendata.cern.ch
Updated 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CMS Collaboration (2023). /QCD_Pt_460_TuneZ2_5p02TeV/HiWinter13-pp_STARTHI53_V25-v1/GEN-SIM-RECO [Dataset]. http://doi.org/10.7483/OPENDATA.CMS.SSH2.4DL8
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.CMS.SSH2.4DL8
Dataset updated
2023
Dataset provided by
CERN Open Data Portal
Authors
CMS Collaboration
Description
Simulated dataset QCD_Pt_460_TuneZ2_5p02TeV in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the pPb collision data at energy 5.02TeV collected by the CMS experiment in 2013.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Oyekanmi Olamilekan (2023). Fake Employee Dataset [Dataset]. https://www.kaggle.com/datasets/oyekanmiolamilekan/fake-employee-dataset

Fake Employee Dataset

Generating fake or synthetic data

Explore at:

zip(162874 bytes)Available download formats

Dataset updated

Nov 20, 2023

Authors

Oyekanmi Olamilekan

Description

Creating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.

Code Url: https://github.com/intellisenseCodez/faker-data-generator

Clear search

Close search

Google apps

Main menu

Fake Employee Dataset

Parameters for the logistic regression model to predict Name Generator ties....

Finance Dataset by Faker Library

Define the number of rows for your dataset

Generate fake finance data

Create a DataFrame

Save the DataFrame to a CSV file

Display the DataFrame

Synthetic E-Commerce Relational Datasets

Synthetic E-Commerce Relational Dataset

Purpose

Entity Relationship Diagram (ERD) - Tables Overview

1. Customers

2. Products

3. Orders

4. Order Items

5. Product Reviews

Visual EDR

Notes

Output

License

References

/GluGluToHToZZTo4L_M-550_7TeV-minloHJJ-pythia6-tauola/Summer11LegDR-PU_S13_START53_LV6-v1/AODSIM...

AI-Generated Product Naming Market Research Report 2033

AI-Generated Product Naming Market Outlook

Component Analysis

Replication Data for: The Dynamics of Partisan Identification when Party...

SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

DoH-Gen-F-CCDDD

/MinBias_TuneD6T_2760GeV_pythia6/HiWinter13-STARTHI53_V26-v1/GEN-SIM-RECO

100 TeV pp collisions, Exotics type, PYTHIA8 generator:...

Albero study: a longitudinal database of the social network and personal...

Invoices Dataset

MyMart: A Comprehensive Sales Dataset

70 Small Business Ideas to Start in 2025

📊 70 Small Business Ideas to Start in 2025

📁 Dataset Structure

🗂 Categories Overview

🎯 Use Cases

🏁 Sample Preview

WikiSQL (Questions and SQL Queries)

WikiSQL (Questions and SQL Queries)

80654 hand-annotated questions and SQL queries on 24241 Wikipedia tables

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

/Pyquen_DiJet_Pt30_TuneZ2_Unquenched_Hydjet1p8_2760GeV/HiFall13DR53X-NoPileUp_STARTHI53_LV1-v2/GEN-SIM-RECO...

/MinBias_TuneZ2star_8TeV-pythia6/Summer12-START50_V13-v3/GEN-SIM

/QCD_Pt_460_TuneZ2_5p02TeV/HiWinter13-pp_STARTHI53_V25-v1/GEN-SIM-RECO

Fake Employee Dataset

Generating fake or synthetic data