100+ datasets found

Small Business Data
kaggle.com
zip
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anne Ezeh (2024). Small Business Data [Dataset]. https://www.kaggle.com/datasets/anneezeh/small-business-data
Explore at:
zip(8544 bytes)Available download formats
Dataset updated
Mar 11, 2024
Authors
Anne Ezeh
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Anne Ezeh

Released under Apache 2.0

Contents
Datasets for Sentiment Analysis
zenodo.org
csv
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10157504
Dataset updated
Dec 10, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.
Below are the datasets specified, along with the details of their references, authors, and download sources.

----------- STS-Gold Dataset ----------------
The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.
Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
File name: sts_gold_tweet.csv
----------- Amazon Sales Dataset ----------------
This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.
Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)
Features:
product_id - Product ID
product_name - Name of the Product
category - Category of the Product
discounted_price - Discounted Price of the Product
actual_price - Actual Price of the Product
discount_percentage - Percentage of Discount for the Product
rating - Rating of the Product
rating_count - Number of people who voted for the Amazon rating
about_product - Description about the Product
user_id - ID of the user who wrote review for the Product
user_name - Name of the user who wrote review for the Product
review_id - ID of the user review
review_title - Short review
review_content - Long review
img_link - Image Link of the Product
product_link - Official Website Link of the Product
License: CC BY-NC-SA 4.0
File name: amazon.csv
----------- Rotten Tomatoes Reviews Dataset ----------------
This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.
This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).
Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics
File name: data_rt.csv
----------- Preprocessed Dataset Sentiment Analysis ----------------
Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
Stemmed and lemmatized using nltk.
Sentiment labels are generated using TextBlob polarity scores.
The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).
DOI: 10.34740/kaggle/dsv/3877817
Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }
This dataset was used in the experimental phase of my research.
File name: EcoPreprocessed.csv
----------- Amazon Earphones Reviews ----------------
This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)
License: U.S. Government Works
Source: www.amazon.in
File name (original): AllProductReviews.csv (contains 14337 reviews)
File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)
----------- Amazon Musical Instruments Reviews ----------------
This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.
This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.
The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).
Source: http://jmcauley.ucsd.edu/data/amazon/
File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)
File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)
ETT-small
kaggle.com
zip
Updated Oct 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alaa ELmor (2022). ETT-small [Dataset]. https://www.kaggle.com/datasets/alaaelmor/ettsmall
Explore at:
zip(4083932 bytes)Available download formats
Dataset updated
Oct 22, 2022
Authors
Alaa ELmor
Description
Dataset

This dataset was created by Alaa ELmor

Contents
CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Nepal, Tunisia, Canada, British Indian Ocean Territory, Andorra, Bangladesh, Isle of Man, Northern Mariana Islands, Moldova (Republic of), Taiwan
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
Small_customer_data_csv
kaggle.com
zip
Updated Sep 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush11111111 (2023). Small_customer_data_csv [Dataset]. https://www.kaggle.com/datasets/ayush11111111/small-customer-data-csv
Explore at:
zip(746 bytes)Available download formats
Dataset updated
Sep 27, 2023
Authors
Ayush11111111
Description
Dataset

This dataset was created by Ayush11111111

Contents
Datasets used in Transitive prediction of small-molecule function through...
figshare.com
csv
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feng Bao (2025). Datasets used in Transitive prediction of small-molecule function through alignment of high-content screening resources [Dataset]. http://doi.org/10.6084/m9.figshare.29061038.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29061038.v2
Dataset updated
May 14, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Feng Bao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset supports the development of CLIPn, a contrastive-learning framework designed to align heterogeneous high-content screening (HCS) profile datasets.GitHub link: https://github.com/AltschulerWu-Lab/CLIPnDirectory StructureFoldersraw_profilesHCS13/Contains raw data from 13 high-content screening (HCS) datasets. Each dataset includes meta and feature files.L1000/CDRP_feature_exp.csv: Raw L1000 expression data from the CDRP dataset.CDRP_meta_exp.csv: Metadata associated with the CDRP expression data.LINCS_feature_exp.csv: Raw L1000 expression data from the LINCS dataset.LINCS_meta_exp.csv: Metadata associated with the LINCS expression data.RxRx3/RxRx3_feature_final.csv: Profile data from the RxRx3 dataset.RxRx3_meta_final.csv: Metadata from the RxRx3 dataset.Uncharacterized_compounds/NCI_cpnData.csv: Feature data for uncharacterized compounds from the NCI dataset.NCI_cpnInfo.csv: Information about uncharacterized compounds in the NCI dataset.Prestwick_UTSW_cpnData.csv: Feature data for uncharacterized compounds from the Prestwick UTSW dataset.Prestwick_UTSW_cpnInfo.csv: Information about uncharacterized compounds from the Prestwick UTSW dataset.Data ReferenceFor raw datasets from 13 HCS database, data and analysis pipeline for dataset 1 was obtained from https://www.science.org/doi/suppl/10.1126/science.1100709/suppl_file/perlman.som.zip; for datasets 2-3, data were shared by authors; For datasets 4-5, analysis code was downloaded from https://static-content.springer.com/esm/art:10.1038/nbt.3419/MediaObjects/41587_2016_BFnbt3419_MOESM21_ESM.zip and data were shared by authors; For datasets 6-7, processed dataset was downloaded from AWS following instructions from https://github.com/carpenter-singh-lab/2022_Haghighi_NatureMethods, and replicate_level_cp_normalized.csv.gz features were used. For project datasets 8-13, datasets and analysis results were downloaded from https://zenodo.org/records/7352487. For RxRx3, dataset was obtained from https://www.rxrx.ai/rxrx3. L1000 transcript datasets were downloaded using the same link as datasets 6-7 and the processed transcript data files (named “replicate_level_l1k.csv”) were used.
f
TP and NTP small dataset
datasetcatalog.nlm.nih.gov
figshare.arts.ac.uk
Updated Mar 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Velios, Athanasios (2022). TP and NTP small dataset [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000439502
Explore at:
Dataset updated
Mar 31, 2022
Authors
Velios, Athanasios
Description
This dataset includes statements about manuscripts from the library of St. Catherine Monastery in Sinai and specifically about the existence of leaf markers on each manuscript. The dataset is provided in three formats: CSV, OWL and RDF.Leaf markers are not individually identified. Only their existence and type is indicated. The dataset is used to demonstrate a method of describing numerous individuals and absence of types in Knowledge Bases. two-records.csv is part of the original data as collected at the Monastery.two-records-owlcro.owl holds part of the original data alongside fictional records of individual leaf markers for each book (these do not exist but they are necessary to demonstrate the applicable method)two-records-owlcrop.owl holds part of the original data onlyThe same logic is followed for the RDF files.The size of this dataset allows performing test reasoning in OWL. A full dataset is also available in this repository.
Annotated Benchmark of Real-World Data for Approximate Functional Dependency...
zenodo.org
data.niaid.nih.gov
csv
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. http://doi.org/10.5281/zenodo.8098909
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8098909
Dataset updated
Jul 1, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marcel Parciak; Marcel Parciak; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren; Sebastiaan Weytjens; Frank Neven; Niel Hens; Liesbet M. Peeters; Stijn Vansummeren
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

Dataset References

adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.

dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.

hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

Code4ML 2.0

zenodo.org

csv, txt

Updated May 19, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737

Explore at:

csv, txtAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15465737

Dataset updated

May 19, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Anonimous authors; Anonimous authors

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

The original dataset is organized into multiple CSV files, each containing structured data on different entities:

code_blocks.csv: Contains raw code snippets extracted from Kaggle.
kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

Table 1. code_blocks.csv structure

Column	Description
code_blocks_index	Global index linking code blocks to markup_data.csv.
kernel_id	Identifier for the Kaggle Jupyter notebook from which the code block was extracted.
code_block_id	Position of the code block within the notebook.
code_block	The actual machine learning code snippet.

Table 2. kernels_meta.csv structure

Column	Description
kernel_id	Identifier for the Kaggle Jupyter notebook.
kaggle_score	Performance metric of the notebook.
kaggle_comments	Number of comments on the notebook.
kaggle_upvotes	Number of upvotes the notebook received.
kernel_link	URL to the notebook.
comp_name	Name of the associated Kaggle competition.

Table 3. competitions_meta.csv structure

Column	Description
comp_name	Name of the Kaggle competition.
description	Overview of the competition task.
data_type	Type of data used in the competition.
comp_type	Classification of the competition.
subtitle	Short description of the task.
EvaluationAlgorithmAbbreviation	Metric used for assessing competition submissions.
data_sources	Links to datasets used.
metric type	Class label for the assessment metric.

Table 4. markup_data.csv structure

Column	Description
code_block	Machine learning code block.
too_long	Flag indicating whether the block spans multiple semantic types.
marks	Confidence level of the annotation.
graph_vertex_id	ID of the semantic type.

The dataset allows mapping between these tables. For example:

code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

Code4ML 2.0 Enhancements

The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

Applications

The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

Code generation
Code understanding
Natural language processing of code-related tasks

h
short-jokes-punchline
huggingface.co
Updated Nov 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timxjl (2024). short-jokes-punchline [Dataset]. https://huggingface.co/datasets/Timxjl/short-jokes-punchline
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2024
Authors
Timxjl
License
https://choosealicense.com/licenses/gpl-2.0/https://choosealicense.com/licenses/gpl-2.0/
Description
Short Jokes Punchline

This dataset contains information about jokes, visitors, labels, and label segments used in a joke labeling application. The data is stored in four CSV files: joke.csv, visitor.csv, label.csv, and label_segment.csv.

Files joke.csv

This file contains 200 jokes randomly sampled from the Kaggle dataset "Short Jokes." Each row represents a joke with the following columns:

id: The unique identifier for the joke. text: The text content of the… See the full description on the dataset page: https://huggingface.co/datasets/Timxjl/short-jokes-punchline.
d
Data from: CSV file of names, times, and locations of images collected by an...
catalog.data.gov
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). CSV file of names, times, and locations of images collected by an unmanned aerial system (UAS) flying over Black Beach, Falmouth, Massachusetts on 18 March 2016 [Dataset]. https://catalog.data.gov/dataset/csv-file-of-names-times-and-locations-of-images-collected-by-an-unmanned-aerial-system-uas
Explore at:
Dataset updated
Nov 26, 2025
Dataset provided by
U.S. Geological Survey
Area covered
Massachusetts, Black Beach, Falmouth
Description
Imagery acquired with unmanned aerial systems (UAS) and coupled with structure from motion (SfM) photogrammetry can produce high-resolution topographic and visual reflectance datasets that rival or exceed lidar and orthoimagery. These new techniques are particularly useful for data collection of coastal systems, which requires high temporal and spatial resolution datasets. The U.S. Geological Survey worked in collaboration with members of the Marine Biological Laboratory and Woods Hole Analytics at Black Beach, in Falmouth, Massachusetts to explore scientific research demands on UAS technology for topographic and habitat mapping applications. This project explored the application of consumer-grade UAS platforms as a cost-effective alternative to lidar and aerial/satellite imagery to support coastal studies requiring high-resolution elevation or remote sensing data. A small UAS was used to capture low-altitude photographs and GPS devices were used to survey reference points. These data were processed in an SfM workflow to create an elevation point cloud, an orthomosaic image, and a digital elevation model.
h
my-dataset
huggingface.co
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
aman (2024). my-dataset [Dataset]. https://huggingface.co/datasets/ns-1/my-dataset
Explore at:
Dataset updated
May 16, 2024
Authors
aman
Description
This directory includes a few sample datasets to get you started.

california_housing_data*.csv is California housing data from the 1990 US Census; more information is available at: https://docs.google.com/document/d/e/2PACX-1vRhYtsvc5eOR2FWNCwaBiKL6suIOrxJig8LcSBbmCbyYsayia_DvPOOBlXZ4CAlQ5nlDD8kTaIDRwrN/pub

mnist_*.csv is a small sample of the MNIST database, which is described at: http://yann.lecun.com/exdb/mnist/

anscombe.json contains a copy of Anscombe's quartet; it was originally… See the full description on the dataset page: https://huggingface.co/datasets/ns-1/my-dataset.
Popular Quotes from GoodReads
kaggle.com
zip
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Souradip Pal (2023). Popular Quotes from GoodReads [Dataset]. https://www.kaggle.com/datasets/souradippal/popular-quotes-from-goodreads
Explore at:
zip(191669 bytes)Available download formats
Dataset updated
Jul 1, 2023
Authors
Souradip Pal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The DataSet is in csv format. Contains author name, quote, and popularity. Scraped from GoodReads. Drawbacks: I was only able to scrape 3k data.
SynSpeech Dataset (Small Version)
figshare.com
csv
Updated Nov 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusuf Brima (2024). SynSpeech Dataset (Small Version) [Dataset]. http://doi.org/10.6084/m9.figshare.27627840.v1
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27627840.v1
Dataset updated
Nov 7, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yusuf Brima
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SynSpeech Dataset (Small Version) is an English-language synthetic speech dataset created using OpenVoice and LibriSpeech-100 for bench-marking disentangled speech representation learning methods. It includes 50 unique speakers, each with 500 distinct sentences spoken in a “default” style at a 16kHz sampling rate. Data is organized by speaker ID, with a synspeech_Small_Metadata.csv file detailing speaker information, gender, speaking style, text, and file paths. This dataset is ideal for tasks in representation learning, speaker and content factorization, and TTS synthesis.
o
CLEAN_SmallRingTensileTest_StainlessSteel316L_ExtensionRate1_15mmMin
ordo.open.ac.uk
txt
Updated Mar 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aniket Joshi; Alexander Forsey; Richard Moat; Salih Gungor (2023). CLEAN_SmallRingTensileTest_StainlessSteel316L_ExtensionRate1_15mmMin [Dataset]. http://doi.org/10.21954/ou.rd.22114536.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.21954/ou.rd.22114536.v1
Dataset updated
Mar 1, 2023
Dataset provided by
The Open University
Authors
Aniket Joshi; Alexander Forsey; Richard Moat; Salih Gungor
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This is a part of the test dataset (Digital Image Correlation Images and CSV data) that describes the experiments done at The Open University for the Small Ring Tensile Testing of SS316L performed at various displacement rates. Unprocessed images (.NEF format) start with the prefix 'RAW' while processed images (.TIF format) start with the prefix 'CLEAN'. The following letters after that describe the test type (Small Ring Test or Uniaxial Test), followed by the material. Lastly, after the material, the crosshead extension rate is described. For instance, 'Extension Rate0_3mmMin' refers to an extension rate of 0.3 mm/min and so on. The 'RAW' folders also contain the unprocessed CSV files. The 'CLEAN' folders contain the camera information (capture interval, ISO, etc) as well as the denoised CSV experimental files. The CSV files are denoised with the help of a Butterworth filter.
Podcast PR Contacts - Self-Service CSV Batch Export
datarade.ai
.csv, .xls
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Listen Notes (2025). Podcast PR Contacts - Self-Service CSV Batch Export [Dataset]. https://datarade.ai/data-products/podcast-pr-contacts-self-service-csv-batch-export-listen-notes
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
May 27, 2025
Dataset authored and provided by
Listen Notes
Area covered
Bulgaria, Algeria, Kuwait, Costa Rica, Dominican Republic, French Polynesia, Israel, Gibraltar, Congo, Benin
Description
== Quick starts ==

Batch export podcast metadata to CSV files:

1) Export by search keyword: https://www.listennotes.com/podcast-datasets/keyword/

2) Export by category: https://www.listennotes.com/podcast-datasets/category/

== Quick facts ==

The most up-to-date and comprehensive podcast database available All languages & All countries Includes over 3,500,000 podcasts Features 35+ data fields , such as basic metadata, global rank, RSS feed (with audio URLs), Spotify links, and more Delivered in CSV format

== Data Attributes ==

See the full list of data attributes on this page: https://www.listennotes.com/podcast-datasets/fields/?filter=podcast_only

How to access podcast audio files: Our dataset includes RSS feed URLs for all podcasts. You can retrieve audio for over 170 million episodes directly from these feeds. With access to the raw audio, you’ll have high-quality podcast speech data ideal for AI training and related applications.

== Custom Offers ==

We can provide custom datasets based on your needs, such as language-specific data, daily/weekly/monthly update frequency, or one-time purchases.

We also provide a RESTful API at PodcastAPI.com

Contact us: hello@listennotes.com

== Need Help? ==

If you have any questions about our products, feel free to reach out hello@listennotes.com

== About Listen Notes, Inc. ==

Since 2017, Listen Notes, Inc. has provided the leading podcast search engine and podcast database.
m
Phishing Websites Dataset
data.mendeley.com
Updated Sep 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grega Vrbančič (2020). Phishing Websites Dataset [Dataset]. http://doi.org/10.17632/72ptz43s9v.1
Explore at:
Unique identifier
https://doi.org/10.17632/72ptz43s9v.1
Dataset updated
Sep 24, 2020
Authors
Grega Vrbančič
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data consist of a collection of legitimate as well as phishing website instances. Each website is represented by the set of features which denote, whether website is legitimate or not. Data can serve as an input for machine learning process.

In this repository the two variants of the Phishing Dataset are presented.

Full variant - dataset_full.csv Short description of the full variant dataset: Total number of instances: 88,647 Number of legitimate website instances (labeled as 0): 58,000 Number of phishing website instances (labeled as 1): 30,647 Total number of features: 111

Small variant - dataset_small.csv Short description of the small variant dataset: Total number of instances: 58,645 Number of legitimate website instances (labeled as 0): 27,998 Number of phishing website instances (labeled as 1): 30,647 Total number of features: 111
Small Business Financial Dataset (2022–2023)
kaggle.com
zip
Updated Sep 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabrielle Charlton (2025). Small Business Financial Dataset (2022–2023) [Dataset]. https://www.kaggle.com/datasets/gabriellecharlton/coffee-shop-financial-dataset-synthetic-2022-2023
Explore at:
zip(22299 bytes)Available download formats
Dataset updated
Sep 2, 2025
Authors
Gabrielle Charlton
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📊 Coffee Shop Financial Dataset (Synthetic, 2022–2023)

📝 Overview

This dataset simulates the financial records of a small-town coffee shop over a two-year period (Jan 2022 – Dec 2023).
It was designed for data science, bookkeeping, and analytics projects — including financial dashboards, revenue forecasting, and expense tracking.

The dataset contains 5 CSV files representing different business accounts:
1. checking_account_main.csv - Daily sales deposits (hot drinks, cold drinks, pastries, sandwiches) + operating expenses
2. checking_account_secondary.csv - Monthly transfers between accounts + payroll funding
3. credit_card_account.csv - Weekly credit card expenses (supplies, utilities, vendor charges) and payments
4. gusto_payroll.csv - Payroll data for 3 employees + 1 contractor
5. gusto_payroll_bc.csv - Payroll data for 3 full-time employees + 1 contractor + 1 seasonal employee, with actual tax breakdown for the province of British Columbia, Canada

📂 File Details

checking_account_main.csv

date

description

category (Sales, Utilities, Rent, Supplies, etc.)

amount (positive = inflow, negative = outflow)

balance

checking_account_secondary.csv

date

description

amount

balance

credit_card_account.csv

date

vendor

category (Supplies, Marketing, Utilities, etc.)

amount (negative = charge, positive = payment)

balance

gusto_payroll.csv

date

employee_id

employee_name (Owner, Barista 1, Barista 2, Contractor)

role (Owner, Barista, Manager, Contractor)

gross_pay

gusto_payroll_bc.csv

This file simulates bi-weekly payroll data for a small coffee shop in British Columbia, Canada, covering January 2022 – December 2023.
It reflects realistic Canadian payroll structure with federal and provincial tax breakdowns, CPP, EI, and additional factors.

Columns: - date → Pay date (bi-weekly schedule)
- employee_id → Unique identifier for each employee
- employee_name → Owner, Barista 1, Barista 2, Manager, Contractor, plus a seasonal Barista (June–Aug 2022)
- role → Role within the coffee shop (Owner, Barista, Manager, Contractor)
- gross_pay → Total earnings before deductions (wages + tips + reimbursements)
- federal_tax → Federal income tax withheld
- provincial_tax → British Columbia income tax withheld
- cpp_employee → Employee CPP contribution
- ei_employee → Employee EI contribution
- other_deductions → Placeholder for possible deductions (e.g., garnishments, union dues)
- net_pay → Take-home pay after deductions
- tips → Declared tips (taxable, included in gross pay)
- travel_reimbursement → Non-taxable reimbursement for travel expenses (if applicable)
- cpp_employer → Employer portion of CPP contributions
- ei_employer → Employer portion of EI contributions

Notes: - Payroll data is synthetic but modeled on Canadian payroll rules (2022–2023 rates).
- A seasonal barista employee is included (employed June 1 – Aug 31, 2022).
- Travel reimbursements are non-taxable and recorded separately.
- This file allows users to practice payroll accounting, deductions analysis, and tax reconciliation.

📈 Business Context

The coffee shop experiences higher sales September–February (holiday season & winter drinks).

Sales dip March–June due to seasonality in a small town.

Pastries are sourced from a local bakery, while sandwiches are made in-house.

Payroll includes 3 employees (baristas, manager) and 1 independent contractor.

🎯 Possible Use Cases

Build a financial health dashboard

Forecast revenue and expenses

Create a profit & loss statement

Test SQL queries for accounting workflows

Explore data visualization with Python, R, or BI tools

Educational projects for small business analytics

📜 License

This dataset is released under the MIT License, free to use for research, learning, or commercial purposes.

⭐ If you use this dataset in your project or notebook, please credit and share your work, it helps the community!

📷 Photo Credits: freepik
Tacit Knowledge management Dataset.csv
figshare.com
txt
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philip Adu Sarfo (2023). Tacit Knowledge management Dataset.csv [Dataset]. http://doi.org/10.6084/m9.figshare.23702121.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23702121.v1
Dataset updated
Jul 18, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Philip Adu Sarfo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is a study aimed to investigate the impact of tacit knowledge management systems (TKM) on organizational performance (OP) among Ghanaian small and medium enterprises (SMEs), addressing the function of employee performance (EP) and Job Satisfaction (JS) building on the knowledge-based viewpoint (KBV).

Facebook

Twitter

Click to copy link

Link copied

Cite

Anne Ezeh (2024). Small Business Data [Dataset]. https://www.kaggle.com/datasets/anneezeh/small-business-data

Small Business Data

Explore at:

zip(8544 bytes)Available download formats

Dataset updated

Mar 11, 2024

Authors

Anne Ezeh

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Anne Ezeh

Released under Apache 2.0

Clear search

Close search

Google apps

Main menu

Small Business Data

Dataset

Contents

Datasets for Sentiment Analysis

ETT-small

Dataset

Contents

CSV file used in statistical analyses

Company Datasets for Business Profiling

Small_customer_data_csv

Dataset

Contents

Datasets used in Transitive prediction of small-molecule function through...

TP and NTP small dataset

Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

Code4ML 2.0

Code4ML 2.0 Enhancements

Applications

short-jokes-punchline

Data from: CSV file of names, times, and locations of images collected by an...

my-dataset

Popular Quotes from GoodReads

SynSpeech Dataset (Small Version)

CLEAN_SmallRingTensileTest_StainlessSteel316L_ExtensionRate1_15mmMin

Podcast PR Contacts - Self-Service CSV Batch Export

Phishing Websites Dataset

Small Business Financial Dataset (2022–2023)

📊 Coffee Shop Financial Dataset (Synthetic, 2022–2023)

📝 Overview

📂 File Details

checking_account_main.csv

checking_account_secondary.csv

credit_card_account.csv

gusto_payroll.csv

gusto_payroll_bc.csv

📈 Business Context

🎯 Possible Use Cases

📜 License

Tacit Knowledge management Dataset.csv

Small Business Data

Dataset

Contents

`checking_account_main.csv`

`checking_account_secondary.csv`

`credit_card_account.csv`

`gusto_payroll.csv`

`gusto_payroll_bc.csv`