Facebook
TwitterCreating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.
Code Url: https://github.com/intellisenseCodez/faker-data-generator
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*This field indicates a dummy variable was also included. If a data point for the row variable was a 0, the dummy took on a value of 1. Otherwise the dummy was 0. Row variables with blank entries did not exhibit over-dispersion of zeros and so did not require dummy variables.†Variable was log transformed to better meet generalized linear model assumptions.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9365842%2F5d270d8701f4dc2687f0ae193ee018ae%2F20-Best-Finance-Economic-Datasets-for-Machine-Learning-Social.jpg?generation=1708443878634431&alt=media" alt="">
Finance dataset with fake information such as transaction ID, date, amount, currency, description, category, merchant, customer, city, and country. It can be used for educational purposes as well as for testing.
This script generates a dataset with fake information such as name, email, phone number, address, date of birth, job, and company. Adjust the num_rows variable to specify the number of rows you want in your dataset. Finally, the dataset is saved to a CSV file named fake_dataset.csv. You can modify the fields or add additional fields according to your requirements.
`
num_rows = 15000
data = { 'Transaction_ID': [fake.uuid4() for _ in range(num_rows)], 'Date': [fake.date_time_this_year() for _ in range(num_rows)],
'Amount': [round(random.uniform(10, 10000), 2) for _ in range(num_rows)],
'Currency': [fake.currency_code() for _ in range(num_rows)],
'Description': [fake.bs() for _ in range(num_rows)],
'Category': [random.choice(['Food', 'Transport', 'Shopping', 'Entertainment', 'Utilities']) for _ in range(num_rows)],
'Merchant': [fake.company() for _ in range(num_rows)],
'Customer': [fake.name() for _ in range(num_rows)],
'City': [fake.city() for _ in range(num_rows)],
'Country': [fake.country() for _ in range(num_rows)]
}
df = pd.DataFrame(data)
df.to_csv('finance_dataset.csv', index=False)
df.head()`
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is synthetically generated fake data designed to simulate a realistic e-commerce environment.
To provide large-scale relational datasets for practicing database operations, analytics, and testing tools like DuckDB, Pandas, and SQL engines. Ideal for benchmarking, educational projects, and data engineering experiments.
int): Unique identifier for each customer string): Customer full name string): Customer email address string): Customer gender ('Male', 'Female', 'Other') date): Date customer signed up string): Customer country of residence int): Unique identifier for each product string): Name of the product string): Product category (e.g., Electronics, Books) float): Price per unit int): Available stock count string): Product brand name int): Unique identifier for each order int): ID of the customer who placed the order (foreign key to Customers) date): Date when order was placed float): Total amount for the order string): Payment method used (Credit Card, PayPal, etc.) string): Country where the order is shipped int): Unique identifier for each order item int): ID of the order this item belongs to (foreign key to Orders) int): ID of the product ordered (foreign key to Products) int): Number of units ordered float): Price per unit at order time int): Unique identifier for each review int): ID of the reviewed product (foreign key to Products) int): ID of the customer who wrote the review (foreign key to Customers) int): Rating score (1 to 5) string): Text content of the review date): Date the review was written https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F9179978%2F7681afe8fc52a116ff56a2a4e179ad19%2FEDR.png?generation=1754741998037680&alt=media" alt="">
The script saves two folders inside the specified output path:
csv/ # CSV files
parquet/ # Parquet files
MIT License
Facebook
TwitterSimulated dataset GluGluToHToZZTo4L_M-550_7TeV-minloHJJ-pythia6-tauola in AODSIM format for 2011 collision data (SM Higgs)
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the collision data collected by the CMS experiment in 2011.
Facebook
Twitter
According to our latest research, the AI-Generated Product Naming market size reached USD 612.4 million in 2024, reflecting a robust adoption curve across industries worldwide. With a compound annual growth rate (CAGR) of 17.8% from 2025 to 2033, the market is forecasted to attain a value of USD 2,183.6 million by 2033. The principal growth factor driving this expansion is the increasing demand for rapid, creative, and data-driven branding solutions that can keep pace with product proliferation and global market entry.
The primary growth driver for the AI-Generated Product Naming market is the exponential rise in product launches across diverse sectors, especially in retail, FMCG, and technology. As businesses strive to differentiate themselves in saturated markets, the need for unique, memorable, and linguistically appropriate product names has intensified. AI-powered naming solutions leverage natural language processing, machine learning, and big data analytics to generate names that resonate with target audiences, are culturally sensitive, and are optimized for search engines. This capability not only accelerates time-to-market but also minimizes the risk of legal or cultural missteps, making AI-based naming indispensable for global enterprises and startups alike.
Another significant factor contributing to the market’s growth is the shift towards digitalization and automation in branding processes. Traditional product naming often involves lengthy brainstorming sessions, focus groups, and iterative testing, leading to time delays and increased costs. AI-Generated Product Naming tools streamline these workflows by instantly generating hundreds of name options that can be filtered by language, tone, industry relevance, and domain availability. The integration of AI solutions with branding agencies’ and enterprises’ existing marketing stacks further enhances efficiency and enables data-driven decision-making. This technological advancement is particularly valuable in highly competitive sectors such as pharmaceuticals and technology, where speed and compliance are critical.
Furthermore, the increasing investment in artificial intelligence and machine learning technologies by both established companies and innovative startups is fueling the development of more sophisticated and context-aware naming solutions. These platforms are becoming adept at understanding brand values, target demographics, and even emotional triggers, resulting in names that are not only creative but also strategically aligned with broader marketing goals. As AI algorithms evolve, their ability to generate names that pass linguistic, legal, and SEO checks will only improve, further solidifying their role in the product development lifecycle.
From a regional perspective, North America currently dominates the AI-Generated Product Naming market, accounting for the largest share due to its advanced technological infrastructure, high adoption rate of AI-powered marketing tools, and the presence of leading branding agencies and multinational companies. Europe follows closely, driven by its vibrant FMCG and e-commerce sectors, while Asia Pacific is emerging as the fastest-growing region, propelled by the rapid digital transformation of retail and consumer goods industries in China, India, and Southeast Asia. Latin America and the Middle East & Africa are also witnessing steady growth, supported by increasing entrepreneurial activity and digitalization efforts.
The Component segment of the AI-Generated Product Naming market is bifurcated into Software and Services. The software sub-segment encompasses AI-powered platforms and tools that autonomously generate product names based on user inputs, industry context, and linguistic guidelines. These solutions are increasingly leveraging advanced natural language generation and deep learning algorithms to produce names that are no
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replications materials for "The Dynamics of Partisan Identification when Party Brands Change: The Case of the Workers Party in Brazil" "Two-City, Six-Wave Panel Survey, Brazil" (2002, 2004, 2006). Sample: Representative samples of (1) Caxias do Sul, Rio Grande do Sul and (2) Juiz de Fora, Minas Gerais. Topic areas: Neighborhood quality of life, worst problems, economic assessments, political participation, media and campaign attention, civil society and neighborhood involvement, political discussion frequency, trust in government and institutions, vote choice, core values, interpersonal persuasion, feeling thermometers of groups and politicians, party identification, ideology, candidate trait assessments, candidate ideological and issues placement, issues self-placement, evaluation of Lula's government, political knowledge, discussant name generator. Sample size: About 25,000 interviews. Special features: Interviews with named political discussants, 100 interviews per neighborhood.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New cluster_id column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address.Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers.Use case: Same person using multiple contact numbers.Duplicate record with last-name variation.Use case: Name changes or data entry inconsistencies.Duplicate record with address variation.Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname.Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name.Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs.Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks.Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Includedspider_dataset_v2_6_20251027_022215.csvspider_dataset_v2_6_20251027_022215.jsonspider_readme_v2.mdSPIDER_generation_script_v2.pySupportingDocuments/ folder containing:benchmark_comparison_script.py – script used for derive F-1 score.Public_census_data_surname.csv – sample U.S. Census name and demographic data used for comparison.ssa_firstnames.csv – Social Security Administration names dataset.simplemaps_uszips.csv – ZIP-to-state mapping data used for phone and address validation.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of DNS over HTTPS traffic from Firefox (Comcast, CZNIC, DNSForge, DSNSB, DOHli) The dataset contains DoH and HTTPS traffic that was captured in a virtualized environment (Docker) and generated automatically by Firefox browser with enabled DoH towards 5 different DoH servers (Comcast, CZNIC, DNSForge, DSNSB, DOHli) and a web page loads towards a sample of web pages taken from Majestic Million dataset. The data are provided in the form of PCAP files. However, we also provided TLS enriched flow data that are generated with opensource ipfixprobe flow exporter. Other than TLS related information is not relevant since the dataset comprises only encrypted TLS traffic. The TLS enriched flow data are provided in the form of CSV files with the following columns:
Column Name
Column Description
DST_IP
Destination IP address
SRC_IP
Source IP address
BYTES
The number of transmitted bytes from Source to Destination
BYTES_REV
The number of transmitted bytes from Destination to Source
TIME_FIRST
Timestamp of the first packet in the flow in format YYYY-MM-DDTHH-MM-SS
TIME_LAST
Timestamp of the last packet in the flow in format YYYY-MM-DDTHH-MM-SS
PACKETS
The number of packets transmitted from Source to Destination
PACKETS_REV
The number of packets transmitted from Destination to Source
DST_PORT
Destination port
SRC_PORT
Source port
PROTOCOL
The number of transport protocol
TCP_FLAGS
Logic OR across all TCP flags in the packets transmitted from Source to Destination
TCP_FLAGS_REV
Logic OR across all TCP flags in the packets transmitted from Destination to Source
TLS_ALPN
The Value of Application Protocol Negotiation Extension sent from Server
TLS_JA3
The JA3 fingerprint
TLS_SNI
The value of Server Name Indication Extension sent by Client
The DoH resolvers in the dataset can be identified by IP addresses written in doh_resolver_ip.csv file.
The main part of the dataset is located in DoH-Gen-F-CCDDD.tar.gz and has the following structure:
. └─── data | - Main directory with data └── generated | - Directory with generated captures ├── pcap | - Generated PCAPs │ └── firefox └── tls-flow-csv | - Generated CSV flow data └── firefox
Total stats of generated data:
Name
Value
Total Data Size
40.2 GB
Total files
10
DoH extracted tls flows
~100 K
Non-DoH extracted tls flows
~315 K
DoH Server information
Name
Provider
DoH query url
Comcast
https://corporate.comcast.com
https://doh.xfinity.com/dns-query
CZNIC
https://www.nic.cz
https://odvr.nic.cz/doh
DNSForge
https://dnsforge.de
https://dnsforge.de/dns-query
DNSSB
https://dns.sb/doh/
https://doh.dns.sb/dns-query
DOHli
https://doh.li
https://doh.li/dns-query
Facebook
TwitterSimulated dataset MinBias_TuneD6T_2760GeV_pythia6 in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the pp collision data, needed as reference data for heavy-ion data analysis, at energy 2.76TeV collected by the CMS experiment in 2013.
Facebook
TwitterExcited Fermions in mass range pT=5-40 TeV. 10000 events per file, 100 files per mass. Compositeness scale (Lambda) is set to the mass of the fermion, so the width is expected to be small (see the log files for details) Cross sections are included in the log files (mass dependent) Note that data are slimmed (see the log file). How to decode name: Name: tev100_pythia8_qstar_m[MASS]_[NUMBER] where [MASS] is generator-level mass as given below: Mass bins (in GeV) m[1]=5000 m[2]=10000 ......... How to use: To get a sample with a given mass, use "glob" regular expressions. Slimmed as: Particle records are slimmed (all stable with pT>0.3 GeV) and (PID=5 || PID=6) or PID>22 && PID<38) or PID>10 && PID<17).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.
INTRODUCTION
Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.
The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).
Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).
These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.
PARTICIPANTS
The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).
DATE STRUCTURE AND ARCHIVES FORMAT
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
Social network
The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.
Personal networks
Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).
Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.
Sense of community and metropolitan displacements
The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:
• Socio-economic data.
• Data on habitual residence.
• Information on intercity journeys.
• Identity and sense of community.
• Personal network indicators.
• Social network indicators.
DATA ACCESS
Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.
The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: .
In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:
Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp
The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl
CONCLUSION
The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.
The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.
ACKNOWLEDGEMENTS
The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals, groups, organizations and social settings” (2006 -2009) of the European Science Foundation (ESF). The data was presented for the first time on June 30, 2009, at the European Research Collaborative Project Meeting on Dynamic Analysis of Networks and Behaviors, held at the Nuffield College of the University of Oxford.
REFERENCES
Brandes, U., & Wagner, D. (2004). Visone - Analysis and Visualization of Social Networks. In M. Jünger, & P. Mutzel (Eds.), Graph Drawing Software (pp. 321-340). New York: Springer-Verlag.
Maya-Jariego, I. (2018). Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The invoice dataset provided is a mock dataset generated using the Python Faker library. It has been designed to mimic the format of data collected from an online store. The dataset contains various fields, including first name, last name, email, product ID, quantity, amount, invoice date, address, city, and stock code. All of the data in the dataset is randomly generated and does not represent actual individuals or products. The dataset can be used for various purposes, including testing algorithms or models related to invoice management, e-commerce, or customer behavior analysis. The data in this dataset can be used to identify trends, patterns, or anomalies in online shopping behavior, which can help businesses to optimize their online sales strategies.
Facebook
TwitterThis data is artificially generated. It can be used for practicing data visualization and analysis skills. Please note that since the data is generated randomly, it may not reflect real-world sales data accurately. However, it should serve as a good starting point for practicing data analysis and visualization.
Description :
• Sales Date: This column contains the date of each sale. The dates are generated for a period of 120 days starting from January 1, 2023. • Category: This column contains the category of the product sold. The categories include ‘Electronics’, ‘Clothing’, and ‘Home & Kitchen’. • Subcategory: This column contains the subcategory of the product sold. Each category has its own set of subcategories. For example, the ‘Electronics’ category includes subcategories such as ‘Communication’, ‘Computers’, and ‘Wearables’. • ProductName: This column contains the name of the product sold. Each subcategory has its own set of products. For example, the ‘Communication’ subcategory includes products such as ‘Walkie Talkie’, ‘Cell Phone’, and ‘Smart Phone’. • Salesperson: This column contains the name of the salesperson who made the sale. There are different salespersons assigned to each category. • Gender: This column contains the gender of the salesperson. The gender is determined based on the salesperson’s name. • Unit sold: This column contains the number of units of the product sold in the sale. The number of units sold is a random number between 1 and 100. • Original Price: This column contains the original price of the product. The original price is a random number between 10 and 1000. • Sales Price: This column contains the sales price of the product. The sales price is calculated as a random fraction of the original price, ensuring that the sales price is always slightly higher than the original price.
For information on 'How to generate a dataset', click here.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset features a curated list of 70 small business ideas that are relevant and potentially profitable for aspiring entrepreneurs in 2025. Each entry includes the business name, difficulty rating, and its corresponding category to help users analyze and choose ideas based on their interests, expertise, and available resources.
Filename: small_business_ideas_2025.csv
| Column Name | Description |
|---|---|
Name | The name or title of the small business idea |
Difficulty | Estimated difficulty level to start the business (Low, Medium, or High) |
Category | The general type of service or industry (e.g., Financial Services, Creative Work, Manual Labor, etc.) |
The dataset covers diverse categories including:
| Name | Difficulty | Category |
|---|---|---|
| Accounting and Tax Services | High | Financial Services |
| Dog Walking | Low | Other Services |
| Web Development | High | Creative Work |
| Food Truck | High | Hospitality |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
A large crowd-sourced dataset for developing natural language interfaces for relational databases. WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset can be used to develop natural language interfaces for relational databases. The data fields are the same among all splits, and the file contains information on the phase, question, table, and SQL for each interface
- This dataset can be used to develop natural language interfaces for relational databases.
- This dataset can be used to develop a knowledge base of common SQL queries.
- This dataset can be used to generate a training set for a neural network that translates natural language into SQL queries
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: train.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
File: test.csv | Column name | Description | |:--------------|:---------------------------------------------------------| | phase | The phase of the data collection. (String) | | question | The question asked by the user. (String) | | table | The table containing the data for the question. (String) | | sql | The SQL query corresponding to the question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
TwitterSimulated dataset Pyquen_DiJet_Pt30_TuneZ2_Unquenched_Hydjet1p8_2760GeV in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the PbPb collision data at energy 2.76TeV collected by the CMS experiment during Run1.
Facebook
TwitterSimulated pile-up event dataset MinBias_TuneZ2star_8TeV-pythia6 in GEN-SIM format. Events were sampled from this dataset and added to simulated data to make them comparable with the 2012 collision data, see the guide to pile-up simulation.
See the description of the simulated dataset names in: About CMS simulated dataset names.
Facebook
TwitterSimulated dataset QCD_Pt_460_TuneZ2_5p02TeV in GEN-SIM-RECO format for 2013 collision data.
See the description of the simulated dataset names in: About CMS simulated dataset names.
These simulated datasets correspond to the pPb collision data at energy 5.02TeV collected by the CMS experiment in 2013.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCreating a robust employee dataset for data analysis and visualization involves several key fields that capture different aspects of an employee's information. Here's a list of fields you might consider including: Employee ID: A unique identifier for each employee. Name: First name and last name of the employee. Gender: Male, female, non-binary, etc. Date of Birth: Birthdate of the employee. Email Address: Contact email of the employee. Phone Number: Contact number of the employee. Address: Home or work address of the employee. Department: The department the employee belongs to (e.g., HR, Marketing, Engineering, etc.). Job Title: The specific job title of the employee. Manager ID: ID of the employee's manager. Hire Date: Date when the employee was hired. Salary: Employee's salary or compensation. Employment Status: Full-time, part-time, contractor, etc. Employee Type: Regular, temporary, contract, etc. Education Level: Highest level of education attained by the employee. Certifications: Any relevant certifications the employee holds. Skills: Specific skills or expertise possessed by the employee. Performance Ratings: Ratings or evaluations of employee performance. Work Experience: Previous work experience of the employee. Benefits Enrollment: Information on benefits chosen by the employee (e.g., healthcare plan, retirement plan, etc.). Work Location: Physical location where the employee works. Work Hours: Regular working hours or shifts of the employee. Employee Status: Active, on leave, terminated, etc. Emergency Contact: Contact information of the employee's emergency contact person. Employee Satisfaction Survey Responses: Data from employee satisfaction surveys, if applicable.
Code Url: https://github.com/intellisenseCodez/faker-data-generator