100+ datasets found

Kaggle Dataset Metadata Repository
kaggle.com
zip
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository
Explore at:
zip(5122110 bytes)Available download formats
Dataset updated
Nov 16, 2024
Authors
Ijaj Ahmed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

Kaggle Dataset Metadata Collection 📊

This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚

Dataset Overview:

Purpose: To provide detailed insights into Kaggle dataset metadata.

Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.

Target Audience: Data scientists, Kaggle competitors, and dataset curators.

Columns Description 📋

datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.

ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.

ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.

ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.

ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.

ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.

creatorName 👩‍💻: The name of the dataset creator, which could be different from the owner.

creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.

creatorUserId 💼: The unique user ID of the dataset creator.

scriptCount 📜: The number of scripts (kernels) associated with this dataset.

scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.

forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.

viewCount 👀: The number of views the dataset page has received on Kaggle.

downloadCount ⬇️: The number of times the dataset has been downloaded by users.

dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.

dateUpdated 🔄: The date when the dataset was last updated or modified.

voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.

categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").

licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").

licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).

datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.

commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).

downloadUrl ⬇️: A direct link to download the dataset files.

newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.

newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.

usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.

firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.

datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.

rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).

datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).

medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.

hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.

ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.

totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.

category_names 📑: A comma-separated string of category names that represent the dataset’s classification.

This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊
Genuine/Fake User Profile Dataset
kaggle.com
zip
Updated Aug 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Whose Aspects (2020). Genuine/Fake User Profile Dataset [Dataset]. https://www.kaggle.com/whoseaspects/genuinefake-user-profile-dataset
Explore at:
zip(436543 bytes)Available download formats
Dataset updated
Aug 9, 2020
Authors
Whose Aspects
Description
From the file data.csv

· Id: Contains the user ID of the user

· Name : Contains the actual name of the user

· screen_name : Variable contains the name of the user as appeared on the OSN Network

· favorite_no: Contains the amount of post which the user has favorited

· statuses_count: Contains the amount of times the user has changed the status

· followers_count: Variable contains the amount of followers the user currently has in the account

· friends_count: Contains the count of friends the user has in the profile

· favourites_count: Contains the count of favourite friends in the list

· listed_count: The count of the listed posts in the account

· created_at: The timestamp contains the day, month, year and time the profile was created

· url: Contains the profile URL of the user

· lang: Contains the language that the user has chosen

· time_zone: Contains the information of the time zone the profile is in

· location: Contains the location the profile was created at

· default_profile: Contains basic integer values

· default_profile_image: Contains information if the user still has the default profile image which was given during the account creation

· geo_enabled: Contains information if the profile is geographically enabled

· profile_image_url: Contains the information of the profile image URL

· profile_banner_url: The HTTPS-based URL pointing to the standard web representation of the user’s uploaded profile banner.

· profile_use_background_image: URL pointing at the users background image

· profile_background_image_url_https: The URL link to the background image of the user

· profile_text_color: The colour code of the colour chosen by the user for the profile information

· profile_image_url_https: The HTTPS based URL link to the background image of the user

· profile_sidebar_border_color: Information of the border colour code

· profile_background_tile: Binary number whether the background has a tile or no

· profile_sidebar_fill_color: Contains the colour code of the sidebar in the profile

· profile_background_image_url: Contains the current profile background image URL of the user

· profile_background_color: Contains the colour code of the profile background

· profile_link_color: Contains the colour code of the profile link

· utc_offset: UTC offset mainly contains the geographical time and zone code

· protected: Ideally there is no data in the column but it indicates whether the user has chosen to protect his/her posts or no

· verified: Whether the profile is verified or no

· description: Contains short description of the profile user

· updated: Contains the time and date when the profile was last updated

· Dataset: This is the labelled column that contain information whether the account is fake or not. With this regards 1 indicates that the account is fake and 0 indicated that the account is real.
Kaggle Competitions Top 100
kaggle.com
zip
Updated May 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vivo Vinco (2022). Kaggle Competitions Top 100 [Dataset]. https://www.kaggle.com/vivovinco/kaggle-competitions-top-100
Explore at:
zip(15932 bytes)Available download formats
Dataset updated
May 1, 2022
Authors
Vivo Vinco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

This dataset contains top 100 of Kaggle competitions ranking. The dataset will be updated every month.

Content

100 rows and 13 columns. Columns' description are listed below.

User : Name of the user

Tier : Grandmaster, Master or Expert

Company/School : Company/School info of the user if mentioned

Country : Country info of the user if mentioned

Competitions_Num : Number of competitions joined

Competitions_Gold : Number of competitions gold medals won

Competitions_Silver : Number of competitions silver medals won

Competitions_Bronze : Number of competitions bronze medals won

Datasets_Num : Number of public datasets

Notebooks_Num : Number of public notebooks

Discussions_Num : Number of topics/comments posted

Points : Total points

Profile : Link of Kaggle profile

Acknowledgements

Data from Kaggle. Image from Smartcat.

If you're reading this, please upvote.
Fake Instagram Profile Dataset
kaggle.com
zip
Updated Mar 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raju Mavinmar (2025). Fake Instagram Profile Dataset [Dataset]. https://www.kaggle.com/datasets/rajumavinmar/fake-instagram-profile-dataset/code
Explore at:
zip(76866 bytes)Available download formats
Dataset updated
Mar 16, 2025
Authors
Raju Mavinmar
Description
Fake Instagram Profile Detection Dataset This dataset is designed for training and evaluating machine learning models to detect fake Instagram profiles. It contains various features extracted from Instagram user profiles, such as username characteristics, profile descriptions, follower counts, and engagement metrics. The dataset can be used for classification tasks to identify whether a profile is real or fake.

Dataset Features Profile Picture Availability: Whether the user has a profile picture. Numerical Ratio in Username: The proportion of numbers in the username. Full Name Analysis: Number of words in the full name and its numerical ratio. Username and Name Matching: Whether the username and full name are identical. Description Length: Number of characters in the user’s bio/description. External URL Presence: Whether the profile contains an external link. Account Privacy: Indicates whether the profile is private or public. Post Count: Total number of posts uploaded by the user. Follower Count: Number of followers the user has. Following Count: Number of accounts the user follows. Label Information The dataset is labeled to classify profiles as either fake (1) or real (0) based on the given features. Probability scores are also provided for classification.

Use Cases Machine learning classification models for fake profile detection. Exploratory data analysis on Instagram profile patterns. Feature engineering and data preprocessing for social media fraud detection. This dataset can Only be used by students and researchers working on cybersecurity not for production, social media analysis, and artificial intelligence applications.
LinkedIn Professional Profiles Dataset
kaggle.com
zip
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish kumar (2023). LinkedIn Professional Profiles Dataset [Dataset]. https://www.kaggle.com/datasets/manishkumar7432698/linkedinuserprofiles/discussion
Explore at:
zip(3394094 bytes)Available download formats
Dataset updated
Sep 9, 2023
Authors
Manish kumar
Description
In today's dynamic business landscape, having access to actionable insights is paramount. The LinkedIn Professional Profiles Dataset is your gateway to a treasure trove of information, encompassing key details about professionals' careers, education, skills, and more. This dataset caters to a diverse range of business needs, including tracking talent movement, sourcing new talent, lead generation, and even serving as an unconventional investment data source.

Key Data Points: Name: Full name of the professional. Title: Current job title. Position: Detailed job position within the company. Current Company: Name of the current employing company. Avatar: Profile picture URL for visual identification. Experience: Chronological list of past job experiences. Education: Educational background and institutions attended. Location: Geographic location of the professional. and some more which you should explore........

With the LinkedIn dataset at your fingertips, you can generate investment signals, refine talent sourcing strategies, and improve lead generation tactics. Tailor your analyses to your specific business objectives and gain a competitive edge through data-driven decision-making.
Twitter User Gender Classification
kaggle.com
zip
Updated Nov 21, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Figure Eight (2016). Twitter User Gender Classification [Dataset]. https://www.kaggle.com/datasets/crowdflower/twitter-user-gender-classification/code
Explore at:
zip(3163744 bytes)Available download formats
Dataset updated
Nov 21, 2016
Dataset authored and provided by
Figure Eight
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set was used to train a CrowdFlower AI gender predictor. You can read all about the project here. Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.

Inspiration

Here are a few questions you might try to answer with this dataset:

how well do words in tweets and profiles predict user gender?

what are the words that strongly predict male or female gender?

how well do stylistic factors (like link color and sidebar color) predict user gender?

Acknowledgments

Data was provided by the Data For Everyone Library on Crowdflower.

Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They're available free of charge for the community, forever.

The Data

The dataset contains the following fields:

_unit_id: a unique id for user

_golden: whether the user was included in the gold standard for the model; TRUE or FALSE

_unit_state: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)

_trusted_judgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations

_last_judgment_at: date and time of last contributor judgment; blank for gold standard observations

gender: one of male, female, or brand (for non-human profiles)

gender:confidence: a float representing confidence in the provided gender

profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it

profile_yn:confidence: confidence in the existence/non-existence of the profile

created: date and time when the profile was created

description: the user's profile description

fav_number: number of tweets the user has favorited

gender_gold: if the profile is golden, what is the gender?

link_color: the link color on the profile, as a hex value

name: the user's name

profile_yn_gold: whether the profile y/n value is golden

profileimage: a link to the profile image

retweet_count: number of times the user has retweeted (or possibly, been retweeted)

sidebar_color: color of the profile sidebar, as a hex value

text: text of a random one of the user's tweets

tweet_coord: if the user has location turned on, the coordinates as a string with the format "[*latitude*, longitude]"

tweet_count: number of tweets that the user has posted

tweet_created: when the random tweet (in the text column) was created

tweet_id: the tweet id of the random tweet

tweet_location: location of the tweet; seems to not be particularly normalized

user_timezone: the timezone of the user
PhiUSIIL Phishing URL Dataset
kaggle.com
data.mendeley.com
zip
Updated Mar 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arvind Prasad (2024). PhiUSIIL Phishing URL Dataset [Dataset]. https://www.kaggle.com/datasets/ndarvind/phiusiil-phishing-url-dataset
Explore at:
zip(15400969 bytes)Available download formats
Dataset updated
Mar 8, 2024
Authors
Arvind Prasad
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed while constructing the dataset are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.

Class Labels Label 1 corresponds to a legitimate URL, label 0 to a phishing URL

Citations: Prasad, A., & Chandra, S. (2023). PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning. Computers & Security, 103545. doi: https://doi.org/10.1016/j.cose.2023.103545
Phishing URL Dataset
kaggle.com
zip
Updated Nov 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharma Geetika (2024). Phishing URL Dataset [Dataset]. https://www.kaggle.com/datasets/sharmageetika/phishing-url-dataset
Explore at:
zip(16642226 bytes)Available download formats
Dataset updated
Nov 26, 2024
Authors
Sharma Geetika
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The PhiUSIIL Phishing URL Dataset is a comprehensive resource designed for research and development in phishing detection systems. This dataset includes 134,850 legitimate URLs and 100,945 phishing URLs, carefully curated to provide a balanced representation for effective model training and evaluation.

The dataset primarily consists of recent and active URLs, ensuring relevance to current phishing and legitimate website characteristics. Features are extracted from both the URL structure and the source code of web pages, enabling detailed analysis and advanced feature engineering.

Key Features CharContinuationRate: A measure of character continuation patterns in URLs. URLTitleMatchScore: Evaluates the similarity between the URL and the webpage title. URLCharProb: Calculates the likelihood of characters appearing in a URL based on observed probabilities. TLDLegitimateProb: Assesses the legitimacy probability of the top-level domain (TLD). This dataset is ideal for developing and testing machine learning models for phishing detection, feature analysis, and exploring the characteristics of malicious websites.

Applications Training machine learning models to identify phishing URLs. Analyzing the behavioral patterns of phishing websites. Comparing legitimate and phishing website characteristics. Feature engineering for cybersecurity-related tasks. The PhiUSIIL dataset offers a rich source of information for researchers and practitioners in the domain of cybersecurity, machine learning, and web content analysis.

Reference: PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning https://www.sciencedirect.com/science/article/abs/pii/S0167404823004558?via%3Dihub
Fashion Nova Reviews
kaggle.com
zip
Updated Aug 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed Afroz (2024). Fashion Nova Reviews [Dataset]. https://www.kaggle.com/datasets/syedafroz6284/fashion-nova-reviews
Explore at:
zip(9516278 bytes)Available download formats
Dataset updated
Aug 24, 2024
Authors
Syed Afroz
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains reviews from customers of Fashion Nova, a popular online clothing retailer. Each entry includes detailed information about the reviewer's experience with the brand, including ratings, review titles, review texts, and other relevant metadata. The data is useful for analyzing customer satisfaction, understanding sentiment, and identifying trends in user feedback over time.

Column Descriptions Reviewer Name: The name or pseudonym of the reviewer. This helps in identifying different reviewers and analyzing their reviews individually.

Profile Link: A link to the reviewer's profile. This could be used to gather more information about the reviewer if publicly available, or for linking purposes.

Country: The country of the reviewer. This information is useful for understanding geographic distribution and regional preferences.

Review Count: The number of reviews submitted by the reviewer. It provides insight into whether the reviewer is a frequent user or a first-time reviewer.

Review Date: The timestamp indicating when the review was posted. This is important for time-series analysis and tracking changes in sentiment over time.

Rating: The rating given by the reviewer, expressed as a number of stars (e.g., "Rated 5 out of 5 stars"). It is a quantitative measure of satisfaction.

Review Title: A brief title summarizing the review. This can be used to quickly gauge the sentiment or topic of the review.

Review Text: The full text of the review. This field provides detailed feedback and opinions from the customer.

Date of Experience: The date when the customer experienced the service or product. It helps correlate the review date with the actual date of the customer experience.

Ways to Use the Data Sentiment Analysis: By analyzing the Review Text and Review Title, you can perform sentiment analysis to understand the overall customer sentiment towards Fashion Nova.

Customer Satisfaction Tracking: Use the Rating column to monitor customer satisfaction levels and identify trends over time or across different countries.

Market Segmentation: The Country column can help in identifying trends and preferences in different regions, aiding targeted marketing strategies.

Customer Engagement Analysis: The Review Count can be used to identify frequent reviewers, potentially loyal customers, or brand advocates, which can help in developing customer engagement strategies.

Temporal Analysis: With the Review Date and Date of Experience, you can analyze how customer feedback varies over time, which is useful for understanding the impact of specific events or promotions.

Increasing Usability Score on Kaggle Add Metadata: Include clear and detailed metadata with descriptions for each column, as provided above. This makes it easier for users to understand and utilize the dataset.

Provide a README File: A README file that explains the dataset, its origin, and possible use cases can significantly increase its usability.

Data Cleaning and Preprocessing: Ensure that the data is clean, consistent, and well-formatted. Handling missing values, standardizing date formats, and ensuring consistent text encoding will make the dataset more user-friendly.

Add Tags and Keywords: Use relevant tags and keywords on Kaggle to improve discoverability. Examples include "Fashion Nova," "customer reviews," "sentiment analysis," "e-commerce," and "retail analytics."

Sample Code or Notebook: Provide a Jupyter notebook with examples of how to use the dataset, including visualizations, analysis scripts, and initial findings. This can guide users on how to extract meaningful insights from the data.

Potential Use Cases Product Improvement: Identify common issues or praises from reviews to provide feedback to product development teams. Customer Service Analysis: Understand the quality of customer service and areas that require improvement. Marketing Insights: Use the data to tailor marketing campaigns based on customer preferences and satisfaction levels. Competitor Analysis: Compare reviews with competitors to identify strengths and weaknesses.

To gather customer reviews from Fashion Nova on Trustpilot, a web scraping approach was employed. The process involved the following steps:

Web Scraping Setup:

Libraries Used: The scraping was carried out using Python's requests library to fetch web pages and BeautifulSoup from the bs4 library to parse HTML content. Scraping Process:

Page Iteration: The function scrape_pages(start_page, end_page) was used to iterate over a range of pages on Trustpilot, where each page contains multiple reviews. URL Formation: For each page, the URL was dynamically constructed based on the page number using the format https://www.trustpilot.com/review/www.fashionnova.com?page={page_number}. Request Handling: An HTTP GET request was sent to fetch the content of the page. The response was then parsed...
💭 Discussion Tier Ranked Data By Location 👨‍🎤
kaggle.com
zip
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bilal Haneef (2022). 💭 Discussion Tier Ranked Data By Location 👨‍🎤 [Dataset]. https://www.kaggle.com/datasets/muhammadbilalhaneef/discussion-tier-ranked-data-by-location
Explore at:
zip(119948 bytes)Available download formats
Dataset updated
Dec 15, 2022
Authors
Bilal Haneef
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The data contains ranks of Experts, Masters, and Grandmasters of the Discussion Tier. You can apply EDA to this data and see which country is having the highest-ranked Kaggle Users. Name: User name. Rank: User rank on the leaderboard. Level: User level whether a user is an expert, master, or grandmaster. Link: Profile link Gold: Total number of gold a user got. Silver: Total number of silver a user got. Bronze: Total number of bronze a user got. Points: Total number of points a user got. Joined: Year/Month joined Total Competitions: Total number of competitions a user participated in. Total Dataset: Total number of datasets a user uploaded. Total Codes: Total number of codes/notebooks a user uploaded. Total Discussion: Total number of discussions a user had. Highest Rank: Highest rank hit by a user. Current Rank: User's current rank. Current Level: Current Level represents whether a user's level is expert, master, or grandmaster in one of the four tiers. City: User's city. State: User's state. Country: User's country.
🌎 Location Intelligence Data | From Google Map
kaggle.com
zip
Updated Apr 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azhar Saleem (2024). 🌎 Location Intelligence Data | From Google Map [Dataset]. https://www.kaggle.com/datasets/azharsaleem/location-intelligence-data-from-google-map
Explore at:
zip(1911275 bytes)Available download formats
Dataset updated
Apr 21, 2024
Authors
Azhar Saleem
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
👨‍💻 Author: Azhar Saleem

"https://github.com/azharsaleem18" target="_blank"> https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github" alt="GitHub Profile"> "https://www.kaggle.com/azharsaleem" target="_blank"> https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle" alt="Kaggle Profile"> "https://www.linkedin.com/in/azhar-saleem/" target="_blank"> https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Profile">
"https://www.youtube.com/@AzharSaleem19" target="_blank"> https://img.shields.io/badge/YouTube-Profile-red?style=for-the-badge&logo=youtube" alt="YouTube Profile"> "https://www.facebook.com/azhar.saleem1472/" target="_blank"> https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook" alt="Facebook Profile"> "https://www.tiktok.com/@azhar_saleem18" target="_blank"> https://img.shields.io/badge/TikTok-Profile-blue?style=for-the-badge&logo=tiktok" alt="TikTok Profile">
"https://twitter.com/azhar_saleem18" target="_blank"> https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter" alt="Twitter Profile"> "https://www.instagram.com/azhar_saleem18/" target="_blank"> https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram" alt="Instagram Profile"> "mailto:azharsaleem6@gmail.com"> https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=gmail" alt="Email Contact">

Dataset Overview

Welcome to the Google Places Comprehensive Business Dataset! This dataset has been meticulously scraped from Google Maps and presents extensive information about businesses across several countries. Each entry in the dataset provides detailed insights into business operations, location specifics, customer interactions, and much more, making it an invaluable resource for data analysts and scientists looking to explore business trends, geographic data analysis, or consumer behaviour patterns.

Key Features

Business Details: Includes unique identifiers, names, and contact information.

Geolocation Data: Precise latitude and longitude for pinpointing business locations on a map.

Operational Timings: Detailed opening and closing hours for each day of the week, allowing analysis of business activity patterns.

Customer Engagement: Data on review counts and ratings, offering insights into customer satisfaction and business popularity.

Additional Attributes: Links to business websites, time zone information, and country-specific details enrich the dataset for comprehensive analysis.

Potential Use Cases

This dataset is ideal for a variety of analytical projects, including: - Market Analysis: Understand business distribution and popularity across different regions. - Customer Sentiment Analysis: Explore relationships between customer ratings and business characteristics. - Temporal Trend Analysis: Analyze patterns of business activity throughout the week. - Geospatial Analysis: Integrate with mapping software to visualise business distribution or cluster businesses based on location.

Dataset Structure

The dataset contains 46 columns, providing a thorough profile for each listed business. Key columns include:

business_id: A unique Google Places identifier for each business, ensuring distinct entries.

phone_number: The contact number associated with the business. It provides a direct means of communication.

name: The official name of the business as listed on Google Maps.

full_address: The complete postal address of the business, including locality and geographic details.

latitude: The geographic latitude coordinate of the business location, useful for mapping and spatial analysis.

longitude: The geographic longitude coordinate of the business location.

review_count: The total number of reviews the business has received on Google Maps.

rating: The average user rating out of 5 for the business, reflecting customer satisfaction.

timezone: The world timezone the business is located in, important for temporal analysis.

website: The official website URL of the business, providing further information and contact options.

category: The category or type of service the business provides, such as restaurant, museum, etc.

claim_status: Indicates whether the business listing has been claimed by the owner on Google Maps.

plus_code: A sho...
Twitter user data
kaggle.com
zip
Updated Aug 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BARKHA VERMA (2020). Twitter user data [Dataset]. https://www.kaggle.com/barkhaverma/twitter-user-data
Explore at:
zip(3163744 bytes)Available download formats
Dataset updated
Aug 23, 2020
Authors
BARKHA VERMA
Description
Context

A Twitter dataset composed of 20,000 rows, Twitter User Data includes the following information: user name, random tweet, account profile, image, and location information.

Content

The dataset contains the following fields:

unit_id: a unique id for user

golden: whether the user was included in the gold standard for the model; TRUE or FALSE

unit_state: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)

trusted_judgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations

last_judgment_at: date and time of last contributor judgment; blank for gold standard observations

gender: one of male, female, or brand (for non-human profiles)

gender:confidence: a float representing confidence in the provided gender

profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it

profile_yn:confidence: confidence in the existence/non-existence of the profile

created: date and time when the profile was created

description: the user's profile description

fav_number: number of tweets the user has favorited

gender_gold: if the profile is golden, what is the gender?

link_color: the link color on the profile, as a hex value

name: the user's name

profile_yn_gold: whether the profile y/n value is golden

profileimage: a link to the profile image

retweet_count: number of times the user has retweeted (or possibly, been retweeted)

sidebar_color: color of the profile sidebar, as a hex value

text: text of a random one of the user's tweets

tweet_coord: if the user has location turned on, the coordinates as a string with the format "[latitude, longitude]"

tweet_count: number of tweets that the user has posted

tweet_created: when the random tweet (in the text column) was created

tweet_id: the tweet id of the random tweet

tweet_location: location of the tweet; seems to not be particularly normalized

user_timezone: the timezone of the user

Acknowledgements

https://data.world/data-society/twitter-user-data
Cheltenham's Facebook Groups
kaggle.com
zip
Updated Apr 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mike Chirico (2018). Cheltenham's Facebook Groups [Dataset]. https://www.kaggle.com/datasets/mchirico/cheltenham-s-facebook-group
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Apr 2, 2018
Authors
Mike Chirico
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Facebook is becoming an essential tool for more than just family and friends. Discover how Cheltenham Township (USA), a diverse community just outside of Philadelphia, deals with major issues such as the Bill Cosby trial, everyday traffic issues, sewer I/I problems and lost cats and dogs. And yes, theft.

Communities work when they're connected and exchanging information. What and who are the essential forces making a positive impact, and when and how do conversational threads get directed or misdirected?

Use Any Facebook Public Group

You can leverage the examples here for any public Facebook group. For an example of the source code used to collect this data, and a quick start docker image, take a look at the following project: facebook-group-scrape.

Data Sources

There are 4 csv files in the dataset, with data from the following 5 public Facebook groups:

Unofficial Cheltenham Township

Elkins Park Happenings!

Free Speech Zone

Cheltenham Lateral Solutions

Cheltenham Township Residents

post.csv

These are the main posts you will see on the page. It might help to take a quick look at the page. Commas in the msg field have been replaced with {COMMA}, and apostrophes have been replaced with {APOST}.

gid Group id (5 different Facebook groups)

pid Main Post id

id Id of the user posting

name User's name

timeStamp

shares

url

msg Text of the message posted.

likes Number of likes

comment.csv

These are comments to the main post. Note, Facebook postings have comments, and comments on comments.

gid Group id

pid Matches Main Post identifier in post.csv

cid Comment Id.

timeStamp

id Id of user commenting

name Name of user commenting

rid Id of user responding to first comment

msg Message

like.csv

These are likes and responses. The two keys in this file (pid,cid) will join to post and comment respectively.

gid Group id

pid Matches Main Post identifier in post.csv

cid Matches Comments id.

response Response such as LIKE, ANGRY etc.

id The id of user responding

name Name of the user responding

member.csv

These are all the members in the group. Some members never, or rarely, post or comment. You may find multiple entries in this table for the same person. The name of the individual never changes, but they change their profile picture. Each profile picture change is captured in this table. Facebook gives users a new id in this table when they change their profile picture.

gid Group id

id Id of the member

name Name of the member

url URL of the member
Fraud Detection Dataset
kaggle.com
zip
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Ali Siddiqui (2025). Fraud Detection Dataset [Dataset]. https://www.kaggle.com/datasets/amanalisiddiqui/fraud-detection-dataset
Explore at:
zip(186385521 bytes)Available download formats
Dataset updated
Mar 28, 2025
Authors
Aman Ali Siddiqui
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset contains the records of financial transactions for fraud detection. (6.3 Million Records)

Some of these records were flagged false by existing algorithms.

Further approaches could be used to feature engineer properties that could further strengthen the fraud detection algorithms as well as find out where the existing algorithm lacks.

CASH-IN: is the process of increasing the balance of account by paying in cash to a merchant.

CASH-OUT: is the opposite process of CASH-IN, it means to withdraw cash from a merchant which decreases the balance of the account.

DEBIT: is similar process than CASH-OUT and involves sending the money from the mobile money service to a bank account.

PAYMENT: is the process of paying for goods or services to merchants which decreases the balance of the account and increases the balance of the receiver.

TRANSFER: is the process of sending money to another user of the service through the mobile money platform

Citation for original work
Transactions
kaggle.com
zip
Updated Oct 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ismat Samadov (2024). Transactions [Dataset]. https://www.kaggle.com/datasets/ismetsemedov/transactions
Explore at:
zip(790290740 bytes)Available download formats
Dataset updated
Oct 30, 2024
Authors
Ismat Samadov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset simulates realistic financial transaction patterns and generated by using python code. For the purpose of developing and testing fraud detection models. The dataset was generated to mimic a wide range of transactional scenarios across multiple categories, including retail, grocery, dining, travel, and more, making it ideal for exploring patterns that distinguish legitimate transactions from fraudulent ones.

Financial fraud is an increasingly prevalent issue, with organizations constantly seeking advanced solutions to detect and prevent suspicious activity. This dataset was inspired by real-world transaction data but was generated synthetically to avoid privacy concerns. It includes key features that play a critical role in fraud detection, such as transaction amounts, device types, geographic locations, currency, card type, and a "fraud" label indicating whether a transaction is suspicious.

Comprehensive Transaction Categories: Transactions span categories like retail (online and in-store), groceries, restaurants (fast food to premium), entertainment (streaming, gaming, events), healthcare, education, gas, and travel.

Geographic and Demographic Variety: The dataset includes diverse geographic data (countries, cities) and currency types, allowing for analysis on a global scale with varying risk profiles.

Detailed Customer Profiles: Each transaction is linked to a customer profile that includes characteristics like account age, preferred devices, typical spending range, and fraud-protection features.

Feature-Rich Data for ML and Fraud Analysis: Features like transaction velocity, merchant risk, card presence, and device fingerprints provide an enriched environment for machine learning models to detect anomalies and suspicious patterns.

Use Cases:

This dataset is designed for data scientists, analysts, and machine learning practitioners interested in: Building and training fraud detection models. Exploring financial transaction patterns and consumer behaviors. Developing and testing machine learning algorithms for anomaly detection. With this dataset, users can delve into advanced topics like feature engineering, model evaluation, and performance optimization, especially relevant to fraud detection applications in finance and e-commerce.

54k Resume dataset (structured)

kaggle.com

zip

Updated Nov 14, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Suriya Ganesh (2024). 54k Resume dataset (structured) [Dataset]. https://www.kaggle.com/datasets/suriyaganesh/resume-dataset-structured

Explore at:

zip(39830263 bytes)Available download formats

Dataset updated

Nov 14, 2024

Authors

Suriya Ganesh

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset is aggregated from sources such as

Entirely available in the public domain.

Resumes are usually in pdf format. OCR was used to convert the PDF into text and LLMs were used to convert the data into a structured format.

Dataset Overview

This dataset contains structured information extracted from professional resumes, normalized into multiple related tables. The data includes personal information, educational background, work experience, professional skills, and abilities.

Table Schemas

1. people.csv

Primary table containing core information about each individual.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Unique identifier for each person	Primary Key, Not Null	1
name	VARCHAR(255)	Full name of the person	May be Null	"Database Administrator"
email	VARCHAR(255)	Email address	May be Null	"john.doe@email.com"
phone	VARCHAR(50)	Contact number	May be Null	"+1-555-0123"
linkedin	VARCHAR(255)	LinkedIn profile URL	May be Null	"linkedin.com/in/johndoe"

2. abilities.csv

Detailed abilities and competencies listed by individuals.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
ability	TEXT	Description of ability	Not Null	"Installation and Building Server"

3. education.csv

Contains educational history for each person.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
institution	VARCHAR(255)	Name of educational institution	May be Null	"Lead City University"
program	VARCHAR(255)	Degree or program name	May be Null	"Bachelor of Science"
start_date	VARCHAR(7)	Start date of education	May be Null	"07/2013"
location	VARCHAR(255)	Location of institution	May be Null	"Atlanta, GA"

4. experience.csv

Details of work experience entries.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
title	VARCHAR(255)	Job title	May be Null	"Database Administrator"
firm	VARCHAR(255)	Company name	May be Null	"Family Private Care LLC"
start_date	VARCHAR(7)	Employment start date	May be Null	"04/2017"
end_date	VARCHAR(7)	Employment end date	May be Null	"Present"
location	VARCHAR(255)	Job location	May be Null	"Roswell, GA"

4. person_skills.csv

Mapping table connecting people to their skills.

Column Name	Data Type	Description	Constraints	Example
person_id	INTEGER	Reference to people table	Foreign Key, Not Null	1
skill	VARCHAR(255)	Reference to skills table	Foreign Key, Not Null	"SQL Server"

5. skills.csv

Master list of unique skills mentioned across all resumes.

Column Name	Data Type	Description	Constraints	Example
skill	VARCHAR(255)	Unique skill name	Primary Key, Not Null	"SQL Server"

Relationships

Each person (people.csv) can have:
- Multiple education entries (education.csv)
- Multiple experience entries (experience.csv)
- Multiple skills (person_skills.csv)
- Multiple abilities (abilities.csv)
Skills (skills.csv) can be associated with multiple people
All relationships are maintained through the person_id field

Data Characteristics

Date Formats

All dates are stored in MM/YYYY format
Current positions use "Present" for end_date

Text Fields

All text fields preserve original case
NULL values indicate missing information
No maximum length enforced for TEXT fields
VARCHAR fields have practical limits noted in schema

Identifiers

person_id starts at 1 and increments sequentially
No natural or composite keys used
All relationships maintained through person_id

Common Usage Patterns

Basic Queries

-- Get all skills for a person
SELECT s.skill 
FROM person_skills ps
JOIN skills s ON ps.skill = s.skill
WHERE ps.person_id = 1;

-- Get complete work history
SELECT * 
FROM experience
WHERE person_id = 1
ORDER BY start_date DESC;

Analytics Queries

-- Most common skills
SELECT s.skill, COUNT(*) as frequency
FROM person_skills ps
...

Osint_public_profiles_dataset

kaggle.com

zip

Updated May 27, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Aniket kumar (2025). Osint_public_profiles_dataset [Dataset]. https://www.kaggle.com/datasets/alliot032/osint-public-profiles-dataset/code

Explore at:

zip(182999 bytes)Available download formats

Dataset updated

May 27, 2025

Authors

Aniket kumar

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

🕵️‍♂️ Advanced OSINT Public Profiles Dataset (Synthetic) 📄 Overview This dataset contains 2,000 synthetic public profile records generated for open-source intelligence (OSINT) research, cybersecurity education, and red team simulation. It mimics realistic personal, professional, and breach-related information typically found through OSINT tools and techniques.

It is 100% synthetic — no real individuals or private data were used.

Column Name	Description
`Name` Full name of the synthetic individual
`Username` Commonly used username
`Email` Generated email address
`Phone` Randomly formatted phone number
`Twitter` Simulated Twitter profile link
`LinkedIn` Simulated LinkedIn profile link
`Domain` Domain name associated with the person
`Location` City and country
`Job_Title` Profession or role
`Company` Employer or organization
`IP_Address` Public IPv4 address
`MAC_Address` Synthetic MAC address
`Breached` Indicates whether their data was breached
`Breach_Source` Known breach source (LinkedIn, Dropbox, etc.)
`Breach_Year` Year of breach (if applicable)
`Password_Strength`	Simulated password strength: Weak, Moderate, or Strong
`Public_Pastebin`	Whether their data appeared on a pastebin (Yes/No)

🎯 Use Cases You can use this dataset for:

✅ OSINT Reconnaissance Practice

✅ Identity Risk Scoring Systems

✅ Cybersecurity Education & Red Team Simulations

✅ NLP & Fuzzy Matching for Entity Resolution

✅ Network Graphs of Breached Users

✅ Training AI models for fake profile detection

✅ Demonstrating recon tools and dashboards 📌 License This dataset is licensed under the Creative Commons CC0 1.0 — Public Domain Dedication.

Feel free to use it in your academic projects, machine learning models, blogs, or demos — with or without attribution.

Advanced: Saudi Arabian Aramco Stocks Dataset 🐪
kaggle.com
zip
Updated May 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azhar Saleem (2024). Advanced: Saudi Arabian Aramco Stocks Dataset 🐪 [Dataset]. https://www.kaggle.com/datasets/azharsaleem/advanced-saudi-arabian-aramco-stocks-dataset
Explore at:
zip(156915 bytes)Available download formats
Dataset updated
May 3, 2024
Authors
Azhar Saleem
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Saudi Arabia
Description
Saudi Arabian Oil Company Aramco, Stocks

👨‍💻 Author: Azhar Saleem

"https://github.com/azharsaleem18" target="_blank"> https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github" alt="GitHub Profile"> "https://www.kaggle.com/azharsaleem" target="_blank"> https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle" alt="Kaggle Profile"> "https://www.linkedin.com/in/azhar-saleem/" target="_blank"> https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin" alt="LinkedIn Profile">
"https://www.youtube.com/@AzharSaleem19" target="_blank"> https://img.shields.io/badge/YouTube-Profile-red?style=for-the-badge&logo=youtube" alt="YouTube Profile"> "https://www.facebook.com/azhar.saleem1472/" target="_blank"> https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook" alt="Facebook Profile"> "https://www.tiktok.com/@azhar_saleem18" target="_blank"> https://img.shields.io/badge/TikTok-Profile-blue?style=for-the-badge&logo=tiktok" alt="TikTok Profile">
"https://twitter.com/azhar_saleem18" target="_blank"> https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter" alt="Twitter Profile"> "https://www.instagram.com/azhar_saleem18/" target="_blank"> https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram" alt="Instagram Profile"> "mailto:azharsaleem6@gmail.com"> https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=gmail" alt="Email Contact">

Dataset Description

Welcome to the Enhanced Saudi Arabian Oil Company (Aramco) Stock Dataset! This dataset has been meticulously prepared from Yahoo Finance and further enriched with several engineered features to elevate your data analysis, machine learning, and financial forecasting projects. It captures the daily trading figures of Aramco stocks, presented in Saudi Riyal (SAR), providing a robust foundation for comprehensive market analysis.

Columns in the Dataset

Date: The trading day for the data recorded (ISO 8601 format).

Open: The price at which the stock first traded upon the opening of an exchange on a given trading day.

High: The highest price at which the stock traded during the trading day.

Low: The lowest price at which the stock traded during the trading day.

Close: The price at which the stock last traded upon the close of an exchange on a given trading day.

Volume: The total number of shares traded during the trading day.

Dividends: The dividend value paid out per share on the trading day.

Stock Splits: The number of stock splits occurring on the trading day.

Lag Features (Lag_Close, Lag_High, Lag_Low): Previous day's closing, highest, and lowest prices.

Rolling Window Statistics (e.g., Rolling_Mean_7, Rolling_Std_7): 7-day and 30-day moving averages and standard deviations of the Close price.

Technical Indicators (RSI, MACD, Bollinger Bands): Key metrics used in trading to analyze short-term price movements.

Change Features (Change_Close, Change_Volume): Day-over-day changes in Close price and trading volume.

Date-Time Features (Weekday, Month, Year, Quarter): Extracted components of the trading day.

Volume_Normalized: The standardized trading volume using z-score normalization to adjust for scale differences.

Potential Uses

This dataset is tailored for a wide array of applications:

Financial Analysis: Explore historical performance, volatility, and market trends.

Forecasting Models: Utilize features like lagged prices and rolling statistics to predict future stock prices.

Machine Learning: Develop regression models or classification frameworks to predict market movements.

Deep Learning: Leverage LSTM networks for more sophisticated time-series forecasting.

Time-Series Analysis: Dive deep into trend analysis, seasonality, and cyclical behavior of stock prices.

Whether you are a data scientist, a financial analyst, or a hobbyist interested in the stock market, this dataset provides a rich playground for analysis and model building. Its comprehensive feature set allows for the development of robust predictive models and offers unique insights into one of the world’s most significant oil companies. Unlock the potential of financial data with this carefully crafted dataset.
Machine Learning users on Github
kaggle.com
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prosper chuks (2022). Machine Learning users on Github [Dataset]. https://www.kaggle.com/prosperchuks/machine-learning-users-on-github
Explore at:
zip(52282 bytes)Available download formats
Dataset updated
Jan 9, 2022
Authors
prosper chuks
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Data was scraped from Github's API.

Columns

LOGIN: shows the user's Github login ID: user's id URL: API link to the user's profile NAME: fullname of the user COMPANY: organization the user's affiliated with BLOG: link to the user's blog site LOCATION: location where the user resides EMAIL: user's email address BIO: about the user

This dataset contains over 600 users from Lagos, Nigeria and Rwanda

Source: https://github.com/ProsperChuks/Github-Data-Ingestion/tree/main/data
Clubhouse Dataset 9.7M
kaggle.com
zip
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vahid (2021). Clubhouse Dataset 9.7M [Dataset]. https://www.kaggle.com/johntukey/clubhouse-dataset
Explore at:
zip(2779558253 bytes)Available download formats
Dataset updated
Jun 22, 2021
Authors
Vahid
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://i.imgur.com/mgcprVX.jpg" alt="image"> - The dataset does not contain any sensitive information. The information has not been hacked. Do not believe the fake news. - I do not have an account in RaidForums

Clubhouse (joinclubhouse.com) is a social networking app that lets people gather in audio chat rooms to discuss various topics. Currently only the iOS version is available and membership is by invitation only. When you invite someone to the clubhouse, in profile of the user you invited will be written "nominated by YOUR_NAME". As a data scientist, it was interesting to me to extract the hierarchical structure of invitations😃. In this link you can see an example of this tree structure. I slightly changed the code in this github repository in order to extract data. Clubhouse's rate limit is terrible!🤕 Even while you using the app, if you refresh the page several times, you will be blocked for a few minutes! Therefore, I sent requests to clubhouse server every 5.65 seconds!. Just a crazy data scientist can cut the mustard 😂

Version 1 (2021-04-05)

this version of dataset (v1) contains 1,300,515 user profiles in clubhouse. You can see how to use this dataset in the code section. In summary, each row shows a user's profile information, including: - user-id - name - photo-url - username - twitter - Instagram - num-followers - num-following - time-created - invited-by-user-profile

Version 2 (2021-04-29)

this version of dataset (v2) contains 3,469,520 user profiles in clubhouse.

Version 3 (2021-05-19)

this version of dataset (v3) contains 4,838,345 user profiles in clubhouse. In summary, each row shows a user's profile information, including: - user-id - name - photo-url - username - twitter - Instagram - num-followers - num-following - time-created - invited-by-user-profile - invited-by-club

In this version of dataset, a new column called invited-by-club shows which user invited by a club. Additionally, a new table called club has been added. Each row shows information about a club, including:

-club-id (same invited-by-club in user table) -name -description -photo-url -num-members -num-followers -enable-private -is-follow-allowed -is-membership-private -is-community -rules -url

Version 4 (2021-06-01)

This version of dataset (v4) contains 6,188,441 user profiles, as well as 4974 club information records.

Version 5 (2021-06-10)

This version of dataset (v5) contains 8,427,058 user profiles, as well as 8520 club information records.

Version 6 (2021-06-22)

This version of dataset (v5) contains 9,794,022 user profiles, as well as 12,375 club information records.

todo list : - Some user IDs do not exist because Clubhouse server not responding sometimes, and also some users have not yet been invited. In the next update, the missing user IDs will be scanned again. - Some users have been invited to the clubhouse by clubs. A new column called invited_by_club will be added in the next update.

I will update this dataset over time. Subscribe (https://t.me/Clubhouse_Dataset) to inform updates. twitter : @VahidBaghi95

Facebook

Twitter

Click to copy link

Link copied

Cite

Ijaj Ahmed (2024). Kaggle Dataset Metadata Repository [Dataset]. https://www.kaggle.com/datasets/ijajdatanerd/kaggle-dataset-metadata-repository

Kaggle Dataset Metadata Repository

Comprehensive Metadata for Kaggle Datasets Including Owner, Usage, and Licensing

Explore at:

zip(5122110 bytes)Available download formats

Dataset updated

Nov 16, 2024

Authors

Ijaj Ahmed

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F13367141%2F444a868e669671faf9007822d6f2d348%2FAdd%20a%20heading.png?generation=1731775788329917&alt=media" alt="">

Kaggle Dataset Metadata Collection 📊

This dataset provides comprehensive metadata on various Kaggle datasets, offering detailed information about the dataset owners, creators, usage statistics, licensing, and more. It can help researchers, data scientists, and Kaggle enthusiasts quickly analyze the key attributes of different datasets on Kaggle. 📚

Dataset Overview:

Purpose: To provide detailed insights into Kaggle dataset metadata.
Content: Information related to the dataset's owner, creator, usage metrics, licensing, and more.
Target Audience: Data scientists, Kaggle competitors, and dataset curators.

Columns Description 📋

datasetUrl 🌐: The URL of the Kaggle dataset page. This directs you to the specific dataset's page on Kaggle.
ownerAvatarUrl 🖼️: The URL of the dataset owner's profile avatar on Kaggle.
ownerName 👤: The name of the dataset owner. This can be the individual or organization that created and maintains the dataset.
ownerUrl 🌍: A link to the Kaggle profile page of the dataset owner.
ownerUserId 💼: The unique user ID of the dataset owner on Kaggle.
ownerTier 🎖️: The ownership tier, such as "Tier 1" or "Tier 2," indicating the owner's status or level on Kaggle.
creatorName 👩‍💻: The name of the dataset creator, which could be different from the owner.
creatorUrl 🌍: A link to the Kaggle profile page of the dataset creator.
creatorUserId 💼: The unique user ID of the dataset creator.
scriptCount 📜: The number of scripts (kernels) associated with this dataset.
scriptsUrl 🔗: A link to the scripts (kernels) page for the dataset, where you can explore related code.
forumUrl 💬: The URL to the discussion forum for this dataset, where users can ask questions and share insights.
viewCount 👀: The number of views the dataset page has received on Kaggle.
downloadCount ⬇️: The number of times the dataset has been downloaded by users.
dateCreated 📅: The date when the dataset was first created and uploaded to Kaggle.
dateUpdated 🔄: The date when the dataset was last updated or modified.
voteButton 👍: The metadata for the dataset's vote button, showing how users interact with the dataset's quality ratings.
categories 🏷️: The categories or tags associated with the dataset, helping users filter datasets based on topics of interest (e.g., "Healthcare," "Finance").
licenseName 🛡️: The name of the license under which the dataset is shared (e.g., "CC0," "MIT License").
licenseShortName 🔑: A short form or abbreviation of the dataset's license name (e.g., "CC0" for Creative Commons Zero).
datasetSize 📦: The size of the dataset in terms of storage, typically measured in MB or GB.
commonFileTypes 📂: A list of common file types included in the dataset (e.g., .csv, .json, .xlsx).
downloadUrl ⬇️: A direct link to download the dataset files.
newKernelNotebookUrl 📝: A link to a new kernel or notebook related to this dataset, for those who wish to explore it programmatically.
newKernelScriptUrl 💻: A link to a new script for running computations or processing data related to the dataset.
usabilityRating 🌟: A rating or score representing how usable the dataset is, based on user feedback.
firestorePath 🔍: A reference to the path in Firestore where this dataset’s metadata is stored.
datasetSlug 🏷️: A URL-friendly version of the dataset name, typically used for URLs.
rank 📈: The dataset's rank based on certain metrics (e.g., downloads, votes, views).
datasource 🌐: The source or origin of the dataset (e.g., government data, private organizations).
medalUrl 🏅: A URL pointing to the dataset's medal or badge, indicating the dataset's quality or relevance.
hasHashLink 🔗: Indicates whether the dataset has a hash link for verifying data integrity.
ownerOrganizationId 🏢: The unique organization ID of the dataset's owner if the owner is an organization rather than an individual.
totalVotes 🗳️: The total number of votes the dataset has received from users, reflecting its popularity or quality.
category_names 📑: A comma-separated string of category names that represent the dataset’s classification.

This dataset is a valuable resource for those who want to analyze Kaggle's ecosystem, discover high-quality datasets, and explore metadata in a structured way. 🌍📊

Clear search

Close search

Google apps

Main menu

Kaggle Dataset Metadata Repository

Kaggle Dataset Metadata Collection 📊

Dataset Overview:

Columns Description 📋

Genuine/Fake User Profile Dataset

Kaggle Competitions Top 100

Context

Content

Acknowledgements

Fake Instagram Profile Dataset

LinkedIn Professional Profiles Dataset

Twitter User Gender Classification

Inspiration

Acknowledgments

The Data

PhiUSIIL Phishing URL Dataset

Phishing URL Dataset

Fashion Nova Reviews

💭 Discussion Tier Ranked Data By Location 👨‍🎤

🌎 Location Intelligence Data | From Google Map

👨‍💻 Author: Azhar Saleem

Dataset Overview

Key Features

Potential Use Cases

Dataset Structure

Twitter user data

Cheltenham's Facebook Groups

Fraud Detection Dataset

The dataset contains the records of financial transactions for fraud detection. (6.3 Million Records)

Transactions

54k Resume dataset (structured)

Dataset Overview

Table Schemas

1. people.csv

2. abilities.csv

3. education.csv

4. experience.csv

4. person_skills.csv

5. skills.csv

Relationships

Data Characteristics

Date Formats

Text Fields

Identifiers

Common Usage Patterns

Basic Queries

Analytics Queries

Osint_public_profiles_dataset

Advanced: Saudi Arabian Aramco Stocks Dataset 🐪

Saudi Arabian Oil Company Aramco, Stocks

👨‍💻 Author: Azhar Saleem

Dataset Description

Columns in the Dataset

Potential Uses

Machine Learning users on Github

Columns

Clubhouse Dataset 9.7M

Version 1 (2021-04-05)

Version 2 (2021-04-29)

Version 3 (2021-05-19)

Version 4 (2021-06-01)

Version 5 (2021-06-10)

Version 6 (2021-06-22)

Kaggle Dataset Metadata Repository

Comprehensive Metadata for Kaggle Datasets Including Owner, Usage, and Licensing

Kaggle Dataset Metadata Collection 📊

Dataset Overview:

Columns Description 📋