6 datasets found

Indeed - Data Science
kaggle.com
zip
Updated Aug 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cormac42 (2024). Indeed - Data Science [Dataset]. https://www.kaggle.com/datasets/cormac42/indeed-data-science
Explore at:
zip(6243501 bytes)Available download formats
Dataset updated
Aug 16, 2024
Authors
Cormac42
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset was scraped from Indeed during the summer of 2024, focusing on the search term 'data scientist.' The data encompasses job listings from every state in the USA, including remote positions, providing a comprehensive snapshot of the data science job market during this period.

Working with this dataset involves a variety of skills that can help students gain valuable experience in data analysis, visualization, and interpretation. Some skills that could be practiced using this data:

Data Cleaning and Preprocessing

Exploratory Data Analysis (EDA)

Data Visualization

Text Analysis and Natural Language Processing (NLP)

SQL and Database Management

Geospatial Analysis

Machine Learning
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
h
NSText2SQL
huggingface.co
opendatalab.com
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NumbersStation (2024). NSText2SQL [Dataset]. https://huggingface.co/datasets/NumbersStation/NSText2SQL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Dataset authored and provided by
NumbersStation
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Summary

NSText2SQL dataset used to train NSQL models. The data is curated from more than 20 different public sources across the web with permissable licenses (listed below). All of these datasets come with existing text-to-SQL pairs. We apply various data cleaning and pre-processing techniques including table schema augmentation, SQL cleaning, and instruction generation using existing LLMs. The resulting dataset contains around 290,000 samples of text-to-SQL pairs. For more… See the full description on the dataset page: https://huggingface.co/datasets/NumbersStation/NSText2SQL.

Data from: WikiReddit: Tracing Information and Attention Flows Between...

zenodo.org

bin

Updated May 4, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms [Dataset]. http://doi.org/10.5281/zenodo.14653265

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14653265

Dataset updated

May 4, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Patrick Gildersleve; Patrick Gildersleve; Anna Beers; Anna Beers; Viviane Ito; Viviane Ito; Agustin Orozco; Agustin Orozco; Francesca Tripodi; Francesca Tripodi

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 15, 2025

Description

Preprint

Gildersleve, P., Beers, A., Ito, V., Orozco, A., & Tripodi, F. (2025). WikiReddit: Tracing Information and Attention Flows Between Online Platforms. arXiv [Cs.CY]. https://doi.org/10.48550/arXiv.2502.04942

Accepted at the International AAAI Conference on Web and Social Media (ICWSM) 2025

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia links shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Datasheet

Motivation

The motivations for this dataset stem from the challenges researchers face in studying the flow of information across the web. While the World Wide Web enables global communication and collaboration, data silos, linguistic barriers, and platform-specific restrictions hinder our ability to understand how information circulates, evolves, and impacts public discourse. Wikipedia and Reddit, as major hubs of knowledge sharing and discussion, offer an invaluable lens into these processes. However, without comprehensive data capturing their interactions, researchers are unable to fully examine how platforms co-construct knowledge. This dataset bridges this gap, providing the tools needed to study the interconnectedness of social media and collaborative knowledge systems.

Composition

WikiReddit, a comprehensive dataset capturing all Wikipedia mentions (including links) shared in posts and comments on Reddit from 2020 to 2023, excluding those from private and NSFW (not safe for work) subreddits. The SQL database comprises 336K total posts, 10.2M comments, 1.95M unique links, and 1.26M unique articles spanning 59 languages on Reddit and 276 Wikipedia language subdomains. Each linked Wikipedia article is enriched with its revision history and page view data within a ±10-day window of its posting, as well as article ID, redirects, and Wikidata identifiers. Supplementary anonymous metadata from Reddit posts and comments further contextualizes the links, offering a robust resource for analysing cross-platform information flows, collective attention dynamics, and the role of Wikipedia in online discourse.

Collection Process

Data was collected from the Reddit4Researchers and Wikipedia APIs. No personally identifiable information is published in the dataset. Data from Reddit to Wikipedia is linked via the hyperlink and article titles appearing in Reddit posts.

Preprocessing/cleaning/labeling

Extensive processing with tools such as regex was applied to the Reddit post/comment text to extract the Wikipedia URLs. Redirects for Wikipedia URLs and article titles were found through the API and mapped to the collected data. Reddit IDs are hashed with SHA-256 for post/comment/user/subreddit anonymity.

Uses

We foresee several applications of this dataset and preview four here. First, Reddit linking data can be used to understand how attention is driven from one platform to another. Second, Reddit linking data can shed light on how Wikipedia's archive of knowledge is used in the larger social web. Third, our dataset could provide insights into how external attention is topically distributed across Wikipedia. Our dataset can help extend that analysis into the disparities in what types of external communities Wikipedia is used in, and how it is used. Fourth, relatedly, a topic analysis of our dataset could reveal how Wikipedia usage on Reddit contributes to societal benefits and harms. Our dataset could help examine if homogeneity within the Reddit and Wikipedia audiences shapes topic patterns and assess whether these relationships mitigate or amplify problematic engagement online.

Distribution

The dataset is publicly shared with a Creative Commons Attribution 4.0 International license. The article describing this dataset should be cited: https://doi.org/10.48550/arXiv.2502.04942

Maintenance

Patrick Gildersleve will maintain this dataset, and add further years of content as and when available.

SQL Database Schema

Table: `posts`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`crosspost_parent_id`	TEXT	The ID of the original Reddit post if this post is a crosspost.
`post_id`	TEXT	Unique identifier for the Reddit post.
`created_at`	TIMESTAMP	The timestamp when the post was created.
`updated_at`	TIMESTAMP	The timestamp when the post was last updated.
`language_code`	TEXT	The language code of the post.
`score`	INTEGER	The score (upvotes minus downvotes) of the post.
`upvote_ratio`	REAL	The ratio of upvotes to total votes.
`gildings`	INTEGER	Number of awards (gildings) received by the post.
`num_comments`	INTEGER	Number of comments on the post.

Table: `comments`

Column Name	Type	Description
`subreddit_id`	TEXT	The unique identifier for the subreddit.
`post_id`	TEXT	The ID of the Reddit post the comment belongs to.
`parent_id`	TEXT	The ID of the parent comment (if a reply).
`comment_id`	TEXT	Unique identifier for the comment.
`created_at`	TIMESTAMP	The timestamp when the comment was created.
`last_modified_at`	TIMESTAMP	The timestamp when the comment was last modified.
`score`	INTEGER	The score (upvotes minus downvotes) of the comment.
`upvote_ratio`	REAL	The ratio of upvotes to total votes for the comment.
`gilded`	INTEGER	Number of awards (gildings) received by the comment.

Table: `postlinks`

Column Name	Type	Description
`post_id`	TEXT	Unique identifier for the Reddit post.
`end_processed_valid`	INTEGER	Whether the extracted URL from the post resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the Reddit post.
`final_valid`	INTEGER	Whether the final URL from the post resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final URL.
`final_url`	TEXT	The final URL after redirections.
`redirected`	INTEGER	Indicator of whether the posted URL was redirected (1) or not (0).
`in_title`	INTEGER	Indicator of whether the link appears in the post title (1) or post body (0).

Table: `commentlinks`

Column Name	Type	Description
`comment_id`	TEXT	Unique identifier for the Reddit comment.
`end_processed_valid`	INTEGER	Whether the extracted URL from the comment resolves to a valid URL.
`end_processed_url`	TEXT	The extracted URL from the comment.
`final_valid`	INTEGER	Whether the final URL from the comment resolves to a valid URL after redirections.
`final_status`	INTEGER	HTTP status code of the final

Online_Retail_II

kaggle.com

zip

Updated Jul 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Shah Nawaj (2025). Online_Retail_II [Dataset]. https://www.kaggle.com/datasets/shahnawaj9/online-retail

Explore at:

zip(71343848 bytes)Available download formats

Dataset updated

Jul 2, 2025

Authors

Shah Nawaj

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Cleaned & Merged UCI Online Retail Dataset (Dec 2009 – Dec 2011)

This dataset is a cleaned and merged version of the original UCI Online Retail and Online Retail II datasets. It contains transaction data from a UK-based online retailer, covering a period from December 2009 to December 2011.

Description

The original UCI Online Retail II dataset contains two separate sheets: - Year 2009–2010 - Year 2010–2011

These have been merged with the original UCI Online Retail dataset to create a unified and continuous dataset.

Cleaning and Preprocessing Performed

Merged all sheets into a single dataset
Removed:
- Rows with negative or zero quantity
- Rows with negative or zero price
- Rows with missing customer_id
Created:
- total_price column (quantity × price)
- is_cancelled column based on invoice format or return flag
Standardized:
- invoicedate formatting
- Column names and data types

Column Definitions

Column	Description
`invoice`	Invoice number (returns start with 'C')
`stockcode`	Product code
`description`	Description of product
`quantity`	Number of items purchased
`invoicedate`	Date and time of invoice
`price`	Unit price in GBP
`customer_id`	Unique identifier for each customer
`country`	Customer’s country
`is_cancelled`	Boolean flag for cancelled transactions
`total_price`	Computed total (`quantity × price`) for each line item

Included Files and Descriptions

File	Type	Description
`online_retail_cleaned.csv`	Data	Cleaned and merged retail transactions from 2009–2011
`rfm_final_score.csv`	Output	Final RFM scores for each customer with segment labels
`Retail_Data_Analysis_Dashboard.xlsx`	Excel	Interactive Excel dashboard with KPIs, CLV, monthly trends
`Retail_Data_Analysis_Dashboard.png`	Image	Visual preview of the Excel dashboard
`RFM_Segmentation.sql`	SQL	SQL logic to calculate RFM scores and assign segments
`Cohort_Analysis_on_Customer.sql`	SQL	Cohort analysis based on acquisition month
`Cohort_Analysis_on_Revenue.sql`	SQL	Cohort revenue tracking over time

Dataset Summary

Time range: December 2009 – December 2011
Data combined from all three sheets (original and Online Retail II)
Most customers are from the United Kingdom
Fully cleaned and ready for use in analysis or modeling

Applications

Market basket analysis
RFM segmentation
Cohort and retention analysis
Customer lifetime value modeling
Time series forecasting

Included Analysis & Dashboards

In addition to the cleaned dataset, this dataset includes complete analysis artifacts:

1. Excel Dashboard

Summary metrics: Total Revenue, Orders, Customers, AOV
Turnover by year
Customer Lifetime Value segmentation (High, Medium, Low)
Monthly customer acquisition and churn trend
Country-wise revenue
Key business recommendations

2. SQL-Based RFM Segmentation

RFM scores (1–5 scale)
Segment grouping (e.g., Champions, At Risk, Loyal Customers)
Monetary value distributions

3. SQL-Based Cohort Analysis

Monthly cohorts based on acquisition date
Retention matrix for month-over-month analysis
Supports churn and lifecycle evaluation

These files are provided in .xlsx and .sql formats and can be used for further business analysis or modeling.

Source

Original datasets: - UCI Online Retail II: https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

This version was cleaned and merged by: Md Shah Nawaj

Indeed - Data Science

a small scrape of indeed data science positions

Explore at:

45 scholarly articles cite this dataset (View in Google Scholar)

zip(6243501 bytes)Available download formats

Dataset updated

Aug 16, 2024

Authors

Cormac42

License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

This dataset was scraped from Indeed during the summer of 2024, focusing on the search term 'data scientist.' The data encompasses job listings from every state in the USA, including remote positions, providing a comprehensive snapshot of the data science job market during this period.

Working with this dataset involves a variety of skills that can help students gain valuable experience in data analysis, visualization, and interpretation. Some skills that could be practiced using this data:

Data Cleaning and Preprocessing
Exploratory Data Analysis (EDA)
Data Visualization
Text Analysis and Natural Language Processing (NLP)
SQL and Database Management
Geospatial Analysis
Machine Learning

Clear search

Close search

Google apps

Main menu

Indeed - Data Science

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

NSText2SQL

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`

Online_Retail_II

Cleaned & Merged UCI Online Retail Dataset (Dec 2009 – Dec 2011)

Description

Cleaning and Preprocessing Performed

Column Definitions

Included Files and Descriptions

Dataset Summary

Applications

Included Analysis & Dashboards

1. Excel Dashboard

2. SQL-Based RFM Segmentation

3. SQL-Based Cohort Analysis

Source

Tags

Cleaned retail sales dataset

Overview

File Information

Column Descriptions

Features

Indeed - Data Science

a small scrape of indeed data science positions

Indeed - Data Science

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

NSText2SQL

Data from: WikiReddit: Tracing Information and Attention Flows Between...

Preprint

Abstract

Datasheet

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

SQL Database Schema

Table: posts

Table: comments

Table: postlinks

Table: commentlinks

Online_Retail_II

Cleaned & Merged UCI Online Retail Dataset (Dec 2009 – Dec 2011)

Description

Cleaning and Preprocessing Performed

Column Definitions

Included Files and Descriptions

Dataset Summary

Applications

Included Analysis & Dashboards

1. Excel Dashboard

2. SQL-Based RFM Segmentation

3. SQL-Based Cohort Analysis

Source

Tags

Cleaned retail sales dataset

Overview

File Information

Column Descriptions

Features

Indeed - Data Science

a small scrape of indeed data science positions

Table: `posts`

Table: `comments`

Table: `postlinks`

Table: `commentlinks`