Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. ๐
datasetUrl ๐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl ๐ผ๏ธ: The URL of the dataset owner's profile avatar on Kaggle.
ownerName ๐ค: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl ๐: A link to the Kaggle profile page of the dataset owner.
ownerUserId ๐ผ: The unique user ID of the dataset owner on Kaggle.
ownerTier ๐๏ธ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName ๐ฉโ๐ป: The name of the dataset creator, which could be different from the owner.
creatorUrl ๐: A link to the Kaggle profile page of the dataset creator.
creatorUserId ๐ผ: The unique user ID of the dataset creator.
scriptCount ๐: The number of scripts (kernels) associated with this dataset.
scriptsUrl ๐: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl ๐ฌ: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount ๐: The number of views the dataset page has received on Kaggle.
downloadCount โฌ๏ธ: The number of times the dataset has been downloaded by users.
dateCreated ๐
: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated ๐: The date when the dataset was last updated or modified.
voteButton ๐: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories ๐ท๏ธ: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName ๐ก๏ธ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName ๐: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize ๐ฆ: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes ๐: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl โฌ๏ธ: A direct link to download the dataset files.
newKernelNotebookUrl ๐: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl ๐ป: A link to a new script for running computations or processing data related to the dataset.
usabilityRating ๐: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath ๐: A reference to the path in Firestore where this datasetโs metadata is stored.
datasetSlug ๐ท๏ธ: A URL-friendly version of the dataset name, typically used for URLs.
rank ๐: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource ๐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl ๐
: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink ๐: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId ๐ข: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes ๐ณ๏ธ: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names ๐: A comma-separated string of category names that represent the datasetโs classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. ๐๐
Facebook
TwitterFrom the file data.csv
ยท Id: Contains the user ID of the user
ยท Name : Contains the actual name of the user
ยท screen_name : Variable contains the name of the user as appeared on the OSN Network
ยท favorite_no: Contains the amount of post which the user has favorited
ยท statuses_count: Contains the amount of times the user has changed the status
ยท followers_count: Variable contains the amount of followers the user currently has in the account
ยท friends_count: Contains the count of friends the user has in the profile
ยท favourites_count: Contains the count of favourite friends in the list
ยท listed_count: The count of the listed posts in the account
ยท created_at: The timestamp contains the day, month, year and time the profile was created
ยท url: Contains the profile URL of the user
ยท lang: Contains the language that the user has chosen
ยท time_zone: Contains the information of the time zone the profile is in
ยท location: Contains the location the profile was created at
ยท default_profile: Contains basic integer values
ยท default_profile_image: Contains information if the user still has the default profile image which was given during the account creation
ยท geo_enabled: Contains information if the profile is geographically enabled
ยท profile_image_url: Contains the information of the profile image URL
ยท profile_banner_url: The HTTPS-based URL pointing to the standard web representation of the userโs uploaded profile banner.
ยท profile_use_background_image: URL pointing at the users background image
ยท profile_background_image_url_https: The URL link to the background image of the user
ยท profile_text_color: The colour code of the colour chosen by the user for the profile information
ยท profile_image_url_https: The HTTPS based URL link to the background image of the user
ยท profile_sidebar_border_color: Information of the border colour code
ยท profile_background_tile: Binary number whether the background has a tile or no
ยท profile_sidebar_fill_color: Contains the colour code of the sidebar in the profile
ยท profile_background_image_url: Contains the current profile background image URL of the user
ยท profile_background_color: Contains the colour code of the profile background
ยท profile_link_color: Contains the colour code of the profile link
ยท utc_offset: UTC offset mainly contains the geographical time and zone code
ยท protected: Ideally there is no data in the column but it indicates whether the user has chosen to protect his/her posts or no
ยท verified: Whether the profile is verified or no
ยท description: Contains short description of the profile user
ยท updated: Contains the time and date when the profile was last updated
ยท Dataset: This is the labelled column that contain information whether the account is fake or not. With this regards 1 indicates that the account is fake and 0 indicated that the account is real.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.
100 rows and 13 columns. Columns' description are listed below.
Data from Kaggle. Image from Smartcat.
If you're reading this, please upvote.
Facebook
TwitterFake Instagram Profile Detection Dataset This dataset is designed for training and evaluating machine learning models to detect fake Instagram profiles. It contains various features extracted from Instagram user profiles, such as username characteristics, profile descriptions, follower counts, and engagement metrics. The dataset can be used for classification tasks to identify whether a profile is real or fake.
Dataset Features Profile Picture Availability: Whether the user has a profile picture. Numerical Ratio in Username: The proportion of numbers in the username. Full Name Analysis: Number of words in the full name and its numerical ratio. Username and Name Matching: Whether the username and full name are identical. Description Length: Number of characters in the userโs bio/description. External URL Presence: Whether the profile contains an external link. Account Privacy: Indicates whether the profile is private or public. Post Count: Total number of posts uploaded by the user. Follower Count: Number of followers the user has. Following Count: Number of accounts the user follows. Label Information The dataset is labeled to classify profiles as either fake (1) or real (0) based on the given features. Probability scores are also provided for classification.
Use Cases Machine learning classification models for fake profile detection. Exploratory data analysis on Instagram profile patterns. Feature engineering and data preprocessing for social media fraud detection. This dataset can Only be used by students and researchers working on cybersecurity not for production, social media analysis, and artificial intelligence applications.
Facebook
TwitterIn today's dynamic business landscape, having access to actionable insights is paramount. The LinkedIn Professional Profiles Dataset is your gateway to a treasure trove of information, encompassing key details about professionals' careers, education, skills, and more. This dataset caters to a diverse range of business needs, including tracking talent movement, sourcing new talent, lead generation, and even serving as an unconventional investment data source.
Key Data Points: Name: Full name of the professional. Title: Current job title. Position: Detailed job position within the company. Current Company: Name of the current employing company. Avatar: Profile picture URL for visual identification. Experience: Chronological list of past job experiences. Education: Educational background and institutions attended. Location: Geographic location of the professional. and some more which you should explore........
With the LinkedIn dataset at your fingertips, you can generate investment signals, refine talent sourcing strategies, and improve lead generation tactics. Tailor your analyses to your specific business objectives and gain a competitive edge through data-driven decision-making.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set was used to train a CrowdFlower AI gender predictor. You can read all about the project here. Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.
Here are a few questions you might try to answer with this dataset:
Data was provided by the Data For Everyone Library on Crowdflower.
Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They're available free of charge for the community, forever.
The dataset contains the following fields:
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.
Class Labels Label 1 corresponds to a legitimate URL, label 0 to a phishing URL
Citations: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The PhiUSIIL Phishing URL Dataset is a comprehensive resource designed for research and development in phishing detection systems. This dataset includes 134,850 legitimate URLs and 100,945 phishing URLs, carefully curated to provide a balanced representation for effective model training and evaluation.
The dataset primarily consists of recent and active URLs, ensuring relevance to current phishing and legitimate website characteristics. Features are extracted from both the URL structure and the source code of web pages, enabling detailed analysis and advanced feature engineering.
Key Features CharContinuationRate: A measure of character continuation patterns in URLs. URLTitleMatchScore: Evaluates the similarity between the URL and the webpage title. URLCharProb: Calculates the likelihood of characters appearing in a URL based on observed probabilities. TLDLegitimateProb: Assesses the legitimacy probability of the top-level domain (TLD). This dataset is ideal for developing and testing machine learning models for phishing detection, feature analysis, and exploring the characteristics of malicious websites.
Applications Training machine learning models to identify phishing URLs. Analyzing the behavioral patterns of phishing websites. Comparing legitimate and phishing website characteristics. Feature engineering for cybersecurity-related tasks. The PhiUSIIL dataset offers a rich source of information for researchers and practitioners in the domain of cybersecurity, machine learning, and web content analysis.
Reference: PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning https://www.sciencedirect.com/science/article/abs/pii/S0167404823004558?via%3Dihub
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains reviews from customers of Fashion Nova, a popular online clothing retailer. Each entry includes detailed information about the reviewer's experience with the brand, including ratings, review titles, review texts, and other relevant metadata. The data is useful for analyzing customer satisfaction, understanding sentiment, and identifying trends in user feedback over time.
Column Descriptions Reviewer Name: The name or pseudonym of the reviewer. This helps in identifying different reviewers and analyzing their reviews individually.
Profile Link: A link to the reviewer's profile. This could be used to gather more information about the reviewer if publicly available, or for linking purposes.
Country: The country of the reviewer. This information is useful for understanding geographic distribution and regional preferences.
Review Count: The number of reviews submitted by the reviewer. It provides insight into whether the reviewer is a frequent user or a first-time reviewer.
Review Date: The timestamp indicating when the review was posted. This is important for time-series analysis and tracking changes in sentiment over time.
Rating: The rating given by the reviewer, expressed as a number of stars (e.g., "Rated 5 out of 5 stars"). It is a quantitative measure of satisfaction.
Review Title: A brief title summarizing the review. This can be used to quickly gauge the sentiment or topic of the review.
Review Text: The full text of the review. This field provides detailed feedback and opinions from the customer.
Date of Experience: The date when the customer experienced the service or product. It helps correlate the review date with the actual date of the customer experience.
Ways to Use the Data Sentiment Analysis: By analyzing the Review Text and Review Title, you can perform sentiment analysis to understand the overall customer sentiment towards Fashion Nova.
Customer Satisfaction Tracking: Use the Rating column to monitor customer satisfaction levels and identify trends over time or across different countries.
Market Segmentation: The Country column can help in identifying trends and preferences in different regions, aiding targeted marketing strategies.
Customer Engagement Analysis: The Review Count can be used to identify frequent reviewers, potentially loyal customers, or brand advocates, which can help in developing customer engagement strategies.
Temporal Analysis: With the Review Date and Date of Experience, you can analyze how customer feedback varies over time, which is useful for understanding the impact of specific events or promotions.
Increasing Usability Score on Kaggle Add Metadata: Include clear and detailed metadata with descriptions for each column, as provided above. This makes it easier for users to understand and utilize the dataset.
Provide a README File: A README file that explains the dataset, its origin, and possible use cases can significantly increase its usability.
Data Cleaning and Preprocessing: Ensure that the data is clean, consistent, and well-formatted. Handling missing values, standardizing date formats, and ensuring consistent text encoding will make the dataset more user-friendly.
Add Tags and Keywords: Use relevant tags and keywords on Kaggle to improve discoverability. Examples include "Fashion Nova," "customer reviews," "sentiment analysis," "e-commerce," and "retail analytics."
Sample Code or Notebook: Provide a Jupyter notebook with examples of how to use the dataset, including visualizations, analysis scripts, and initial findings. This can guide users on how to extract meaningful insights from the data.
Potential Use Cases Product Improvement: Identify common issues or praises from reviews to provide feedback to product development teams. Customer Service Analysis: Understand the quality of customer service and areas that require improvement. Marketing Insights: Use the data to tailor marketing campaigns based on customer preferences and satisfaction levels. Competitor Analysis: Compare reviews with competitors to identify strengths and weaknesses.
To gather customer reviews from Fashion Nova on Trustpilot, a web scraping approach was employed. The process involved the following steps:
Web Scraping Setup:
Libraries Used: The scraping was carried out using Python's requests library to fetch web pages and BeautifulSoup from the bs4 library to parse HTML content. Scraping Process:
Page Iteration: The function scrape_pages(start_page, end_page) was used to iterate over a range of pages on Trustpilot, where each page contains multiple reviews. URL Formation: For each page, the URL was dynamically constructed based on the page number using the format https://www.trustpilot.com/review/www.fashionnova.com?page={page_number}. Request Handling: An HTTP GET request was sent to fetch the content of the page. The response was then parsed...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data contains ranks of Experts, Masters, and Grandmasters of the Discussion Tier. You can apply EDA to this data and see which country is having the highest-ranked Kaggle Users. Name: User name. Rank: User rank on the leaderboard. Level: User level whether a user is an expert, master, or grandmaster. Link: Profile link Gold: Total number of gold a user got. Silver: Total number of silver a user got. Bronze: Total number of bronze a user got. Points: Total number of points a user got. Joined: Year/Month joined Total Competitions: Total number of competitions a user participated in. Total Dataset: Total number of datasets a user uploaded. Total Codes: Total number of codes/notebooks a user uploaded. Total Discussion: Total number of discussions a user had. Highest Rank: Highest rank hit by a user. Current Rank: User's current rank. Current Level: Current Level represents whether a user's level is expert, master, or grandmaster in one of the four tiers. City: User's city. State: User's state. Country: User's country.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to the Google Places Comprehensive Business Dataset! This dataset has been meticulously scraped from Google Maps and presents extensive information about businesses across several countries. Each entry in the dataset provides detailed insights into business operations, location specifics, customer interactions, and much more, making it an invaluable resource for data analysts and scientists looking to explore business trends, geographic data analysis, or consumer behaviour patterns.
This dataset is ideal for a variety of analytical projects, including: - Market Analysis: Understand business distribution and popularity across different regions. - Customer Sentiment Analysis: Explore relationships between customer ratings and business characteristics. - Temporal Trend Analysis: Analyze patterns of business activity throughout the week. - Geospatial Analysis: Integrate with mapping software to visualise business distribution or cluster businesses based on location.
The dataset contains 46 columns, providing a thorough profile for each listed business. Key columns include:
business_id: A unique Google Places identifier for each business, ensuring distinct entries.phone_number: The contact number associated with the business. It provides a direct means of communication.name: The official name of the business as listed on Google Maps.full_address: The complete postal address of the business, including locality and geographic details.latitude: The geographic latitude coordinate of the business location, useful for mapping and spatial analysis.longitude: The geographic longitude coordinate of the business location.review_count: The total number of reviews the business has received on Google Maps.rating: The average user rating out of 5 for the business, reflecting customer satisfaction.timezone: The world timezone the business is located in, important for temporal analysis.website: The official website URL of the business, providing further information and contact options.category: The category or type of service the business provides, such as restaurant, museum, etc.claim_status: Indicates whether the business listing has been claimed by the owner on Google Maps.plus_code: A sho...
Facebook
TwitterContext
A Twitter dataset composed of 20,000 rows, Twitter User Data includes the following information: user name, random tweet, account profile, image, and location information.
Content
The dataset contains the following fields:
unit_id: a unique id for user
golden: whether the user was included in the gold standard for the model; TRUE or FALSE
unit_state: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)
trusted_judgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations
last_judgment_at: date and time of last contributor judgment; blank for gold standard observations
gender: one of male, female, or brand (for non-human profiles)
gender:confidence: a float representing confidence in the provided gender
profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it
profile_yn:confidence: confidence in the existence/non-existence of the profile
created: date and time when the profile was created
description: the user's profile description
fav_number: number of tweets the user has favorited
gender_gold: if the profile is golden, what is the gender?
link_color: the link color on the profile, as a hex value
name: the user's name
profile_yn_gold: whether the profile y/n value is golden
profileimage: a link to the profile image
retweet_count: number of times the user has retweeted (or possibly, been retweeted)
sidebar_color: color of the profile sidebar, as a hex value
text: text of a random one of the user's tweets
tweet_coord: if the user has location turned on, the coordinates as a string with the format "[latitude, longitude]"
tweet_count: number of tweets that the user has posted
tweet_created: when the random tweet (in the text column) was created
tweet_id: the tweet id of the random tweet
tweet_location: location of the tweet; seems to not be particularly normalized
user_timezone: the timezone of the user
Acknowledgements
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Facebook is becoming an essential tool for more than just family and friends. Discover how Cheltenham Township (USA), a diverse community just outside of Philadelphia, deals with major issues such as the Bill Cosby trial, everyday traffic issues, sewer I/I problems and lost cats and dogs. And yes, theft.
Communities work when they're connected and exchanging information. What and who are the essential forces making a positive impact, and when and how do conversational threads get directed or misdirected?
Use Any Facebook Public Group
You can leverage the examples here for any public Facebook group. For an example of the source code used to collect this data, and a quick start docker image, take a look at the following project: facebook-group-scrape.
Data Sources
There are 4 csv files in the dataset, with data from the following 5 public Facebook groups:
post.csv
These are the main posts you will see on the page. It might help to take a quick look at the page. Commas in the msg field have been replaced with {COMMA}, and apostrophes have been replaced with {APOST}.
comment.csv
These are comments to the main post. Note, Facebook postings have comments, and comments on comments.
like.csv
These are likes and responses. The two keys in this file (pid,cid) will join to post and comment respectively.
member.csv
These are all the members in the group. Some members never, or rarely, post or comment. You may find multiple entries in this table for the same person. The name of the individual never changes, but they change their profile picture. Each profile picture change is captured in this table. Facebook gives users a new id in this table when they change their profile picture.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Some of these records were flagged false by existing algorithms.
Further approaches could be used to feature engineer properties that could further strengthen the fraud detection algorithms as well as find out where the existing algorithm lacks.
CASH-IN: is the process of increasing the balance of account by paying in cash to a merchant.
CASH-OUT: is the opposite process of CASH-IN, it means to withdraw cash from a merchant which decreases the balance of the account.
DEBIT: is similar process than CASH-OUT and involves sending the money from the mobile money service to a bank account.
PAYMENT: is the process of paying for goods or services to merchants which decreases the balance of the account and increases the balance of the receiver.
TRANSFER: is the process of sending money to another user of the service through the mobile money platform
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset simulates realistic financial transaction patterns and generated by using python code. For the purpose of developing and testing fraud detection models. The dataset was generated to mimic a wide range of transactional scenarios across multiple categories, including retail, grocery, dining, travel, and more, making it ideal for exploring patterns that distinguish legitimate transactions from fraudulent ones.
Financial fraud is an increasingly prevalent issue, with organizations constantly seeking advanced solutions to detect and prevent suspicious activity. This dataset was inspired by real-world transaction data but was generated synthetically to avoid privacy concerns. It includes key features that play a critical role in fraud detection, such as transaction amounts, device types, geographic locations, currency, card type, and a "fraud" label indicating whether a transaction is suspicious.
Comprehensive Transaction Categories: Transactions span categories like retail (online and in-store), groceries, restaurants (fast food to premium), entertainment (streaming, gaming, events), healthcare, education, gas, and travel.
Geographic and Demographic Variety: The dataset includes diverse geographic data (countries, cities) and currency types, allowing for analysis on a global scale with varying risk profiles.
Detailed Customer Profiles: Each transaction is linked to a customer profile that includes characteristics like account age, preferred devices, typical spending range, and fraud-protection features.
Feature-Rich Data for ML and Fraud Analysis: Features like transaction velocity, merchant risk, card presence, and device fingerprints provide an enriched environment for machine learning models to detect anomalies and suspicious patterns.
Use Cases:
This dataset is designed for data scientists, analysts, and machine learning practitioners interested in: Building and training fraud detection models. Exploring financial transaction patterns and consumer behaviors. Developing and testing machine learning algorithms for anomaly detection. With this dataset, users can delve into advanced topics like feature engineering, model evaluation, and performance optimization, especially relevant to fraud detection applications in finance and e-commerce.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is aggregated from sources such as
Entirely available in the public domain.
Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.
This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.
Primary table containing core information about each individual.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Unique identifier for each person | Primary Key, Not Null | 1 |
| name | VARCHAR(255) | Full name of the person | May be Null | "Database Administrator" |
| VARCHAR(255) | Email address | May be Null | "john.doe@email.com" | |
| phone | VARCHAR(50) | Contact number | May be Null | "+1-555-0123" |
| VARCHAR(255) | LinkedIn profile URL | May be Null | "linkedin.com/in/johndoe" |
Detailed abilities and competencies listed by individuals.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| ability | TEXT | Description of ability | Not Null | "Installation and Building Server" |
Contains educational history for each person.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| institution | VARCHAR(255) | Name of educational institution | May be Null | "Lead City University" |
| program | VARCHAR(255) | Degree or program name | May be Null | "Bachelor of Science" |
| start_date | VARCHAR(7) | Start date of education | May be Null | "07/2013" |
| location | VARCHAR(255) | Location of institution | May be Null | "Atlanta, GA" |
Details of work experience entries.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| title | VARCHAR(255) | Job title | May be Null | "Database Administrator" |
| firm | VARCHAR(255) | Company name | May be Null | "Family Private Care LLC" |
| start_date | VARCHAR(7) | Employment start date | May be Null | "04/2017" |
| end_date | VARCHAR(7) | Employment end date | May be Null | "Present" |
| location | VARCHAR(255) | Job location | May be Null | "Roswell, GA" |
Mapping table connecting people to their skills.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| person_id | INTEGER | Reference to people table | Foreign Key, Not Null | 1 |
| skill | VARCHAR(255) | Reference to skills table | Foreign Key, Not Null | "SQL Server" |
Master list of unique skills mentioned across all resumes.
| Column Name | Data Type | Description | Constraints | Example |
|---|---|---|---|---|
| skill | VARCHAR(255) | Unique skill name | Primary Key, Not Null | "SQL Server" |
-- Get all skills for a person
SELECT s.skill
FROM person_skills ps
JOIN skills s ON ps.skill = s.skill
WHERE ps.person_id = 1;
-- Get complete work history
SELECT *
FROM experience
WHERE person_id = 1
ORDER BY start_date DESC;
-- Most common skills
SELECT s.skill, COUNT(*) as frequency
FROM person_skills ps
...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
๐ต๏ธโโ๏ธ Advanced OSINT Public Profiles Dataset (Synthetic) ๐ Overview This dataset contains 2,000 synthetic public profile records generated for open-source intelligence (OSINT) research, cybersecurity education, and red team simulation. It mimics realistic personal, professional, and breach-related information typically found through OSINT tools and techniques.
It is 100% synthetic โ no real individuals or private data were used.
| Column Name | Description |
|---|---|
Name Full name of the synthetic individual | |
Username Commonly used username | |
Email Generated email address | |
Phone Randomly formatted phone number | |
Twitter Simulated Twitter profile link | |
LinkedIn Simulated LinkedIn profile link | |
Domain Domain name associated with the person | |
Location City and country | |
Job_Title Profession or role | |
Company Employer or organization | |
IP_Address Public IPv4 address | |
MAC_Address Synthetic MAC address | |
Breached Indicates whether their data was breached | |
Breach_Source Known breach source (LinkedIn, Dropbox, etc.) | |
Breach_Year Year of breach (if applicable) | |
Password_Strength | Simulated password strength: Weak, Moderate, or Strong |
Public_Pastebin | Whether their data appeared on a pastebin (Yes/No) |
๐ฏ Use Cases You can use this dataset for:
โ OSINT Reconnaissance Practice
โ Identity Risk Scoring Systems
โ Cybersecurity Education & Red Team Simulations
โ NLP & Fuzzy Matching for Entity Resolution
โ Network Graphs of Breached Users
โ Training AI models for fake profile detection
โ Demonstrating recon tools and dashboards ๐ License This dataset is licensed under the Creative Commons CC0 1.0 โ Public Domain Dedication.
Feel free to use it in your academic projects, machine learning models, blogs, or demos โ with or without attribution.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to the Enhanced Saudi Arabian Oil Company (Aramco) Stock Dataset! This dataset has been meticulously prepared from Yahoo Finance and further enriched with several engineered features to elevate your data analysis, machine learning, and financial forecasting projects. It captures the daily trading figures of Aramco stocks, presented in Saudi Riyal (SAR), providing a robust foundation for comprehensive market analysis.
Date: The trading day for the data recorded (ISO 8601 format).Open: The price at which the stock first traded upon the opening of an exchange on a given trading day.High: The highest price at which the stock traded during the trading day.Low: The lowest price at which the stock traded during the trading day.Close: The price at which the stock last traded upon the close of an exchange on a given trading day.Volume: The total number of shares traded during the trading day.Dividends: The dividend value paid out per share on the trading day.Stock Splits: The number of stock splits occurring on the trading day.Lag Features (Lag_Close, Lag_High, Lag_Low): Previous day's closing, highest, and lowest prices.Rolling Window Statistics (e.g., Rolling_Mean_7, Rolling_Std_7): 7-day and 30-day moving averages and standard deviations of the Close price.Technical Indicators (RSI, MACD, Bollinger Bands): Key metrics used in trading to analyze short-term price movements.Change Features (Change_Close, Change_Volume): Day-over-day changes in Close price and trading volume.Date-Time Features (Weekday, Month, Year, Quarter): Extracted components of the trading day.Volume_Normalized: The standardized trading volume using z-score normalization to adjust for scale differences.This dataset is tailored for a wide array of applications:
Financial Analysis: Explore historical performance, volatility, and market trends.Forecasting Models: Utilize features like lagged prices and rolling statistics to predict future stock prices.Machine Learning: Develop regression models or classification frameworks to predict market movements.Deep Learning: Leverage LSTM networks for more sophisticated time-series forecasting.Time-Series Analysis: Dive deep into trend analysis, seasonality, and cyclical behavior of stock prices.Whether you are a data scientist, a financial analyst, or a hobbyist interested in the stock market, this dataset provides a rich playground for analysis and model building. Its comprehensive feature set allows for the development of robust predictive models and offers unique insights into one of the worldโs most significant oil companies. Unlock the potential of financial data with this carefully crafted dataset.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Data was scraped from Github's API.
LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user
This dataset contains over 600 users from Lagos, Nigeria and Rwanda
Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
https://i.imgur.com/mgcprVX.jpg" alt="image">
- The dataset does not contain any sensitive information. The information has not been hacked. Do not believe the fake news.
- I do not have an account in RaidForums
Clubhouse (joinclubhouse.com) is a social networking app that lets people gather in audio chat rooms to discuss various topics. Currently only the iOS version is available and membership is by invitation only. When you invite someone to the clubhouse, in profile of the user you invited will be written "nominated by YOUR_NAME". As a data scientist, it was interesting to me to extract the hierarchical structure of invitations๐. In this link you can see an example of this tree structure. I slightly changed the code in this github repository in order to extract data. Clubhouse's rate limit is terrible!๐ค Even while you using the app, if you refresh the page several times, you will be blocked for a few minutes! Therefore, I sent requests to clubhouse server every 5.65 seconds!. Just a crazy data scientist can cut the mustard ๐
this version of dataset (v1) contains 1,300,515 user profiles in clubhouse. You can see how to use this dataset in the code section. In summary, each row shows a user's profile information, including: - user-id - name - photo-url - username - twitter - Instagram - num-followers - num-following - time-created - invited-by-user-profile
this version of dataset (v2) contains 3,469,520 user profiles in clubhouse.
this version of dataset (v3) contains 4,838,345 user profiles in clubhouse. In summary, each row shows a user's profile information, including: - user-id - name - photo-url - username - twitter - Instagram - num-followers - num-following - time-created - invited-by-user-profile - invited-by-club
In this version of dataset, a new column called invited-by-club shows which user invited by a club. Additionally, a new table called club has been added. Each row shows information about a club, including:
-club-id (same invited-by-club in user table) -name -description -photo-url -num-members -num-followers -enable-private -is-follow-allowed -is-membership-private -is-community -rules -url
This version of dataset (v4) contains 6,188,441 user profiles, as well as 4974 club information records.
This version of dataset (v5) contains 8,427,058 user profiles, as well as 8520 club information records.
This version of dataset (v5) contains 9,794,022 user profiles, as well as 12,375 club information records.
todo list : - Some user IDs do not exist because Clubhouse server not responding sometimes, and also some users have not yet been invited. In the next update, the missing user IDs will be scanned again. - Some users have been invited to the clubhouse by clubs. A new column called invited_by_club will be added in the next update.
I will update this dataset over time. Subscribe (https://t.me/Clubhouse_Dataset) to inform updates. twitter : @VahidBaghi95
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">
This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. ๐
datasetUrl ๐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl ๐ผ๏ธ: The URL of the dataset owner's profile avatar on Kaggle.
ownerName ๐ค: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl ๐: A link to the Kaggle profile page of the dataset owner.
ownerUserId ๐ผ: The unique user ID of the dataset owner on Kaggle.
ownerTier ๐๏ธ: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName ๐ฉโ๐ป: The name of the dataset creator, which could be different from the owner.
creatorUrl ๐: A link to the Kaggle profile page of the dataset creator.
creatorUserId ๐ผ: The unique user ID of the dataset creator.
scriptCount ๐: The number of scripts (kernels) associated with this dataset.
scriptsUrl ๐: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl ๐ฌ: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount ๐: The number of views the dataset page has received on Kaggle.
downloadCount โฌ๏ธ: The number of times the dataset has been downloaded by users.
dateCreated ๐
: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated ๐: The date when the dataset was last updated or modified.
voteButton ๐: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories ๐ท๏ธ: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName ๐ก๏ธ: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName ๐: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize ๐ฆ: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes ๐: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl โฌ๏ธ: A direct link to download the dataset files.
newKernelNotebookUrl ๐: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl ๐ป: A link to a new script for running computations or processing data related to the dataset.
usabilityRating ๐: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath ๐: A reference to the path in Firestore where this datasetโs metadata is stored.
datasetSlug ๐ท๏ธ: A URL-friendly version of the dataset name, typically used for URLs.
rank ๐: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource ๐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl ๐
: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink ๐: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId ๐ข: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes ๐ณ๏ธ: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names ๐: A comma-separated string of category names that represent the datasetโs classification.
This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. ๐๐