100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  3. GitTables 1M - CSV files

    • zenodo.org
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  4. i

    Sample Dataset for Testing

    • ieee-dataport.org
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Alex Outman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    10

  5. Sample csv data

    • kaggle.com
    Updated Oct 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Faisal Ahmed Arman (2021). Sample csv data [Dataset]. https://www.kaggle.com/mdfaisalahmedarman/sample-csv-data/tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Md Faisal Ahmed Arman
    Description

    Dataset

    This dataset was created by Md Faisal Ahmed Arman

    Contents

  6. d

    Residential School Locations Dataset (CSV Format)

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orlandini, Rosa (2023). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Orlandini, Rosa
    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  7. h

    doc-formats-csv-3

    • huggingface.co
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasets examples (2023). doc-formats-csv-3 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset authored and provided by
    Datasets examples
    Description

    [doc] formats - csv - 3

    This dataset contains one csv file at the root:

    data.csv

    ignored comment

    col1|col2 dog|woof cat|meow pokemon|pika human|hello

    We define the config name in the YAML config, as well as the exact location of the file, the separator as "|", the name of the columns, and the number of rows to ignore (the row #1 is a row of column headers, that will be replaced by the names option, and the row #0 is ignored). The reference for the options is the documentation… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3.

  8. r

    1000 Empirical Time series

    • researchdata.edu.au
    • figshare.com
    Updated May 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2022). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    Dataset updated
    May 5, 2022
    Dataset provided by
    Monash University
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.


    The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.

    The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv.

    These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.

    The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as
    >> TS_Init('INP_Empirical1000.mat');

    Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.

    See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  9. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    • data.europa.eu
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  10. Sample CSV files

    • kaggle.com
    Updated Mar 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naman Kumar (2022). Sample CSV files [Dataset]. https://www.kaggle.com/matcauthon49/sample-csv-files/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Naman Kumar
    Description

    Dataset

    This dataset was created by Naman Kumar

    Contents

  11. h

    doc-formats-csv-1

    • huggingface.co
    Updated Nov 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasets examples (2023). doc-formats-csv-1 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset authored and provided by
    Datasets examples
    Description

    [doc] formats - csv - 1

    This dataset contains one csv file at the root:

    data.csv

    kind,sound dog,woof cat,meow pokemon,pika human,hello

    The YAML section of the README does not contain anything related to loading the data (only the size category metadata):

    size_categories:

    - n<1K

  12. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset authored and provided by
    Oxylabs
    Area covered
    Moldova (Republic of), Bangladesh, British Indian Ocean Territory, Northern Mariana Islands, Andorra, Taiwan, Canada, Nepal, Isle of Man, Tunisia
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  13. Google Analytics Sample

    • kaggle.com
    zip
    Updated Sep 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/datasets/bigquery/google-analytics-sample
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

    Content

    The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

    Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

    Fork this kernel to get started.

    Acknowledgements

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

    Banner Photo by Edho Pratama from Unsplash.

    Inspiration

    What is the total number of transactions generated per device browser in July 2017?

    The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

    What was the average number of product pageviews for users who made a purchase in July 2017?

    What was the average number of product pageviews for users who did not make a purchase in July 2017?

    What was the average total transactions per user that made a purchase in July 2017?

    What is the average amount of money spent per session in July 2017?

    What is the sequence of pages viewed?

  14. Yelp Dataset - Contains 1 million rows

    • kaggle.com
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Majid (2022). Yelp Dataset - Contains 1 million rows [Dataset]. https://www.kaggle.com/datasets/abdulmajid115/yelp-dataset-contains-1-million-rows
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abdul Majid
    Description

    Context

    The data has been acquired from yelp website.

    Content

    The data can help people find companies/organizations with respect to ratings and reviews. This can help people to choose or recommend best services out there.

  15. Walmart Dataset

    • crawlfeeds.com
    csv, zip
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Walmart Dataset [Dataset]. https://crawlfeeds.com/datasets/walmart-dataset
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Apr 26, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    Walmart products sample dataset having 1000+ records in CSV format. Download monthly dataset for walmart data and it having around 100K+ records.

    Get 50% discount for all datasets. Link

  16. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  17. _labels1.csv. This data set representss the label of the corresponding...

    • figshare.com
    txt
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    naillah gul (2023). _labels1.csv. This data set representss the label of the corresponding samples in data.csv file [Dataset]. http://doi.org/10.6084/m9.figshare.24270088.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    naillah gul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The datasets contain pixel-level hyperspectral data of six snow and glacier classes. They have been extracted from a Hyperspectral image. The dataset "data.csv" has 5417 * 142 samples belonging to the classes: Clean snow, Dirty ice, Firn, Glacial ice, Ice mixed debris, and Water body. The dataset "_labels1.csv" has corresponding labels of the "data.csv" file. The dataset "RGB.csv" has only 5417 * 3 samples. There are only three band values in this file while "data.csv" has 142 band values.

  18. f

    Example of a csv file exported from the database.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Oct 24, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Caselle, Jennifer E.; Iles, Alison; Tinker, Martin T.; Black, August; Novak, Mark; Carr, Mark H.; Malone, Dan; Beas-Luna, Rodrigo; Hoban, Michael (2014). Example of a csv file exported from the database. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001227183
    Explore at:
    Dataset updated
    Oct 24, 2014
    Authors
    Caselle, Jennifer E.; Iles, Alison; Tinker, Martin T.; Black, August; Novak, Mark; Carr, Mark H.; Malone, Dan; Beas-Luna, Rodrigo; Hoban, Michael
    Description

    Example of a csv file exported from the database.

  19. w

    Randomized Hourly Load Data for use with Taxonomy Distribution Feeders

    • data.wu.ac.at
    application/unknown
    Updated Aug 29, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Energy (2017). Randomized Hourly Load Data for use with Taxonomy Distribution Feeders [Dataset]. https://data.wu.ac.at/schema/data_gov/NWYwYmFmYTItOWRkMC00OWM0LTk3OGYtZDcyYzZiOWY5N2Ez
    Explore at:
    application/unknownAvailable download formats
    Dataset updated
    Aug 29, 2017
    Dataset provided by
    Department of Energy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feeder’s geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].

    The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.

    This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.

    For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.

    Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.

    For questions about this dataset, contact andy.hoke@nrel.gov.

    If you find this dataset useful, please mention NREL and cite [1] in your work.

    References:

    [1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, “Steady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,” IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .

    [2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, “Modern Grid Initiative Distribution Taxonomy Final Report”, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf

    [3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, “Distribution power flow for smart grid technologies”, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.

  20. f

    Event Logs CSV

    • figshare.com
    rar
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dina Bayomie (2019). Event Logs CSV [Dataset]. http://doi.org/10.6084/m9.figshare.11342063.v1
    Explore at:
    rarAvailable download formats
    Dataset updated
    Dec 9, 2019
    Dataset provided by
    figshare
    Authors
    Dina Bayomie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The event logs in CSV format. The dataset contains both correlated and uncorrelated logs

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu