6 datasets found
  1. Indeed - Data Science

    • kaggle.com
    zip
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cormac42 (2024). Indeed - Data Science [Dataset]. https://www.kaggle.com/datasets/cormac42/indeed-data-science
    Explore at:
    zip(6243501 bytes)Available download formats
    Dataset updated
    Aug 16, 2024
    Authors
    Cormac42
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset was scraped from Indeed during the summer of 2024, focusing on the search term 'data scientist.' The data encompasses job listings from every state in the USA, including remote positions, providing a comprehensive snapshot of the data science job market during this period.

    Working with this dataset involves a variety of skills that can help students gain valuable experience in data analysis, visualization, and interpretation. Some skills that could be practiced using this data:

    1. Data Cleaning and Preprocessing
    2. Exploratory Data Analysis (EDA)
    3. Data Visualization
    4. Text Analysis and Natural Language Processing (NLP)
    5. SQL and Database Management
    6. Geospatial Analysis
    7. Machine Learning
  2. Z

    IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Indiana University
    Authors
    Cains, Mariana; Anand, Srini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

    Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

    The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

    The companion paper can be found here: doi.org/10.5281/zenodo.814979

    Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

    Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)

  3. h

    NSText2SQL

    • huggingface.co
    • opendatalab.com
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NumbersStation (2024). NSText2SQL [Dataset]. https://huggingface.co/datasets/NumbersStation/NSText2SQL
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2024
    Dataset authored and provided by
    NumbersStation
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Summary

    NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.

  4. Data from: WikiReddit: Tracing Information and Attention Flows Between...

    • zenodo.org
    bin
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265
    Explore at:
    binAvailable download formats
    Dataset updated
    May 4, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 15, 2025
    Description

    Preprint

    Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942
    Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

    Abstract

    The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

    Datasheet

    Motivation

    The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

    Composition

    WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

    Collection Process

    Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

    Preprocessing/cleaning/labeling

    Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

    Uses

    We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

    Distribution

    The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

    Maintenance

    Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.


    SQL Database Schema

    Table: posts

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    crosspost_parent_idTEXTThe ID of the original Reddit post if this post is a crosspost.
    post_idTEXTUnique identifier for the Reddit post.
    created_atTIMESTAMPThe timestamp when the post was created.
    updated_atTIMESTAMPThe timestamp when the post was last updated.
    language_codeTEXTThe language code of the post.
    scoreINTEGERThe score (upvotes minus downvotes) of the post.
    upvote_ratioREALThe ratio of upvotes to total votes.
    gildingsINTEGERNumber of awards (gildings) received by the post.
    num_commentsINTEGERNumber of comments on the post.

    Table: comments

    Column NameTypeDescription
    subreddit_idTEXTThe unique identifier for the subreddit.
    post_idTEXTThe ID of the Reddit post the comment belongs to.
    parent_idTEXTThe ID of the parent comment (if a reply).
    comment_idTEXTUnique identifier for the comment.
    created_atTIMESTAMPThe timestamp when the comment was created.
    last_modified_atTIMESTAMPThe timestamp when the comment was last modified.
    scoreINTEGERThe score (upvotes minus downvotes) of the comment.
    upvote_ratioREALThe ratio of upvotes to total votes for the comment.
    gildedINTEGERNumber of awards (gildings) received by the comment.

    Table: postlinks

    Column NameTypeDescription
    post_idTEXTUnique identifier for the Reddit post.
    end_processed_validINTEGERWhether the extracted URL from the post resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the Reddit post.
    final_validINTEGERWhether the final URL from the post resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final URL.
    final_urlTEXTThe final URL after redirections.
    redirectedINTEGERIndicator of whether the posted URL was redirected (1) or not (0).
    in_titleINTEGERIndicator of whether the link appears in the post title (1) or post body (0).

    Table: commentlinks

    Column NameTypeDescription
    comment_idTEXTUnique identifier for the Reddit comment.
    end_processed_validINTEGERWhether the extracted URL from the comment resolves to a valid URL.
    end_processed_urlTEXTThe extracted URL from the comment.
    final_validINTEGERWhether the final URL from the comment resolves to a valid URL after redirections.
    final_statusINTEGERHTTP status code of the final

  5. Online_Retail_II

    • kaggle.com
    zip
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Nawaj (2025). Online_Retail_II [Dataset]. https://www.kaggle.com/datasets/shahnawaj9/online-retail
    Explore at:
    zip(71343848 bytes)Available download formats
    Dataset updated
    Jul 2, 2025
    Authors
    Shah Nawaj
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Cleaned & Merged UCI Online Retail Dataset (Dec 2009 – Dec 2011)

    This dataset is a cleaned and merged version of the original UCI Online Retail and Online Retail II datasets. It contains transaction data from a UK-based online retailer, covering a period from December 2009 to December 2011.

    Description

    The original UCI Online Retail II dataset contains two separate sheets: - Year 2009–2010 - Year 2010–2011

    These have been merged with the original UCI Online Retail dataset to create a unified and continuous dataset.

    Cleaning and Preprocessing Performed

    • Merged all sheets into a single dataset
    • Removed:
      • Rows with negative or zero quantity
      • Rows with negative or zero price
      • Rows with missing customer_id
    • Created:
      • total_price column (quantity × price)
      • is_cancelled column based on invoice format or return flag
    • Standardized:
      • invoicedate formatting
      • Column names and data types

    Column Definitions

    ColumnDescription
    invoiceInvoice number (returns start with 'C')
    stockcodeProduct code
    descriptionDescription of product
    quantityNumber of items purchased
    invoicedateDate and time of invoice
    priceUnit price in GBP
    customer_idUnique identifier for each customer
    countryCustomer’s country
    is_cancelledBoolean flag for cancelled transactions
    total_priceComputed total (quantity × price) for each line item

    Included Files and Descriptions

    FileTypeDescription
    online_retail_cleaned.csvDataCleaned and merged retail transactions from 2009–2011
    rfm_final_score.csvOutputFinal RFM scores for each customer with segment labels
    Retail_Data_Analysis_Dashboard.xlsxExcelInteractive Excel dashboard with KPIs, CLV, monthly trends
    Retail_Data_Analysis_Dashboard.pngImageVisual preview of the Excel dashboard
    RFM_Segmentation.sqlSQLSQL logic to calculate RFM scores and assign segments
    Cohort_Analysis_on_Customer.sqlSQLCohort analysis based on acquisition month
    Cohort_Analysis_on_Revenue.sqlSQLCohort revenue tracking over time

    Dataset Summary

    • Time range: December 2009 – December 2011
    • Data combined from all three sheets (original and Online Retail II)
    • Most customers are from the United Kingdom
    • Fully cleaned and ready for use in analysis or modeling

    Applications

    • Market basket analysis
    • RFM segmentation
    • Cohort and retention analysis
    • Customer lifetime value modeling
    • Time series forecasting

    Included Analysis & Dashboards

    In addition to the cleaned dataset, this dataset includes complete analysis artifacts:

    1. Excel Dashboard

    • Summary metrics: Total Revenue, Orders, Customers, AOV
    • Turnover by year
    • Customer Lifetime Value segmentation (High, Medium, Low)
    • Monthly customer acquisition and churn trend
    • Country-wise revenue
    • Key business recommendations

    2. SQL-Based RFM Segmentation

    • RFM scores (1–5 scale)
    • Segment grouping (e.g., Champions, At Risk, Loyal Customers)
    • Monetary value distributions

    3. SQL-Based Cohort Analysis

    • Monthly cohorts based on acquisition date
    • Retention matrix for month-over-month analysis
    • Supports churn and lifecycle evaluation

    These files are provided in .xlsx and .sql formats and can be used for further business analysis or modeling.

    Source

    Original datasets: - UCI Online Retail II: https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

    This version was cleaned and merged by: Md Shah Nawaj

    Tags

    retail, ecommerce, customer segmentation, transactions, time series, data cleaning, rfm, python, pandas, online retail

  6. Cleaned retail sales dataset

    • kaggle.com
    zip
    Updated Aug 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    S Joshi (2025). Cleaned retail sales dataset [Dataset]. https://www.kaggle.com/datasets/hghdhygf/cleaned-retail-sales-dataset
    Explore at:
    zip(13352 bytes)Available download formats
    Dataset updated
    Aug 18, 2025
    Authors
    S Joshi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This dataset contains 1,000 retail transaction records after cleaning and preprocessing.

    This synthetic dataset has been meticulously crafted to simulate a dynamic retail environment, providing an ideal playground for those eager to sharpen their data analysis skills through exploratory data analysis (EDA). With a focus on retail sales and customer characteristics, this dataset invites you to unravel intricate patterns, draw insights, and gain a deeper understanding of customer behaviour.

    It includes customer demographics, product categories, transaction details, and derived analytics, such as the daily percentage change in sales.

    Original dataset (Uncleaned):- https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset

    The dataset can be used for:

    • Sales trend analysis
    • Customer segmentation
    • Revenue forecasting
    • Data visualisation projects
    • Teaching SQL, Pandas, or AWS analytics pipelines

    File Information

    • Filename: cleaned_retail_sales_dataset.csv
    • Records (rows): 1,000
    • Columns (features): 10
    • Missing values: Minimal (only 1 missing in Daily Percent Change)

    Column Descriptions

    • Transaction ID – Unique identifier for each transaction (range: 1–1000).
    • Date – Purchase date in DD-MM-YYYY format (345 unique dates).
    • Customer ID – Unique identifier for each customer (1,000 unique customers).
    • Gender – Customer gender: Male / Female (~51% Female, ~49% Male).
    • Age – Customer’s age (range: 18–64, average ≈ 41 years).
    • Product Category – Purchased product category (Clothing, Electronics, Groceries).
    • Quantity – Number of items purchased per transaction (range: 1–4, average ≈ 2.5).
    • Price per Unit – Price of a single item (range: ₹25 – ₹500, average ≈ ₹180).
    • Total Amount – Transaction value = Quantity × Price per Unit (range: ₹25 – ₹2000, average ≈ ₹456).
    • Daily Percent Change – Day-over-day percentage change in sales (range: -98.75% to 7900%).

    Features

    • Transaction ID: Unique identifier for each transaction.
    • Date: Purchase date in DD-MM-YYYY format.
    • Customer ID: Unique identifier for each customer.
    • Gender: Customer gender (Male / Female).
    • Age: Customer’s age.
    • Product Category: Purchased product category (Clothing, Electronics, Groceries).
    • Quantity: Number of items purchased in the transaction.
    • Price per Unit: Price of a single item.
    • Total Amount: Transaction value (Quantity × Price per Unit).
    • Daily Percent Change: Day-over-day percentage change in sales.

    **💬 Feedback & Suggestions ** If you find this dataset helpful for your research or projects, feel free to upvote and share your feedback or suggestions. Your support is appreciated — thank you! 😉

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cormac42 (2024). Indeed - Data Science [Dataset]. https://www.kaggle.com/datasets/cormac42/indeed-data-science
Organization logo

Indeed - Data Science

a small scrape of indeed data science positions

Explore at:
45 scholarly articles cite this dataset (View in Google Scholar)
zip(6243501 bytes)Available download formats
Dataset updated
Aug 16, 2024
Authors
Cormac42
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset was scraped from Indeed during the summer of 2024, focusing on the search term 'data scientist.' The data encompasses job listings from every state in the USA, including remote positions, providing a comprehensive snapshot of the data science job market during this period.

Working with this dataset involves a variety of skills that can help students gain valuable experience in data analysis, visualization, and interpretation. Some skills that could be practiced using this data:

  1. Data Cleaning and Preprocessing
  2. Exploratory Data Analysis (EDA)
  3. Data Visualization
  4. Text Analysis and Natural Language Processing (NLP)
  5. SQL and Database Management
  6. Geospatial Analysis
  7. Machine Learning
Search
Clear search
Close search
Google apps
Main menu