100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14330132
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  3. Data from: Pre-compiled metrics data sets, links to yearly statistics files...

    • doi.pangaea.de
    html, tsv
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin G Schultz; Sabine Schröder; Olga Lyapina; Owen R Cooper (2017). Pre-compiled metrics data sets, links to yearly statistics files in CSV format [Dataset]. http://doi.org/10.1594/PANGAEA.880505
    Explore at:
    tsv, htmlAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    PANGAEA
    Authors
    Martin G Schultz; Sabine Schröder; Olga Lyapina; Owen R Cooper
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1970 - Jan 1, 2015
    Variables measured
    DATE/TIME, File name, File size, Uniform resource locator/link to file
    Description

    Errata: On Dec 2nd, 2018, several yearly statistics files were replaced with new versions to correct an inconsistency related to the computation of the "dma8epax" statistics. As written in Schultz et al. (2017) [https://doi.org/10.1525/elementa.244], Supplement 1, Table 6: "When the aggregation period is “seasonal”, “summer”, or “annual”, the 4th highest daily 8-hour maximum of the aggregation period will be computed.". The data values for these aggregation periods are correct, however, the header information in the original files stated that the respective data column would contain "average daily maximum 8-hour ozone mixing ratio (nmol mol-1)". Therefore, the header of the seasonal, summer, and annual files has been corrected. Furthermore, the "dma8epax" column in the monthly files erroneously contained 4th highest daily maximum 8-hour average values, while it should have listed monthly average values instead. The data of this metric in the monthly files have therefore been replaced. The new column header reads "avgdma8epax". The updated files contain a version label "1.1" and a brief description of the error. If you have made use of previous TOAR data files with the "dma8epax" metric, please exchange your data files.

  4. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  5. UK House Price Index: data downloads March 2025

    • gov.uk
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HM Land Registry (2025). UK House Price Index: data downloads March 2025 [Dataset]. https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-march-2025
    Explore at:
    Dataset updated
    May 21, 2025
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    HM Land Registry
    Area covered
    United Kingdom
    Description

    The UK House Price Index is a National Statistic.

    Create your report

    Download the full UK House Price Index data below, or use our tool to https://landregistry.data.gov.uk/app/ukhpi?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=tool&utm_term=9.30_21_05_25" class="govuk-link">create your own bespoke reports.

    Download the data

    Datasets are available as CSV files. Find out about republishing and making use of the data.

    Full file

    This file includes a derived back series for the new UK HPI. Under the UK HPI, data is available from 1995 for England and Wales, 2004 for Scotland and 2005 for Northern Ireland. A longer back series has been derived by using the historic path of the Office for National Statistics HPI to construct a series back to 1968.

    Download the full UK HPI background file:

    Individual attributes files

    If you are interested in a specific attribute, we have separated them into these CSV files:

  6. all csv files used for analysis of NCBI data

    • figshare.com
    txt
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cassandre Pyne (2023). all csv files used for analysis of NCBI data [Dataset]. http://doi.org/10.6084/m9.figshare.24461239.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Oct 30, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Cassandre Pyne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    all csv files used for analysis of NCBIall files with "WOAH" in it are the disease and disease agents from WOAH's list (see manuscript for link) all breed files (with breed names in name) are from web scrapingMASTER_DATA_coordinates_FINAL_AUG_5: cleaned mined data from NCBI

  7. h

    url-classifications

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    snats, url-classifications [Dataset]. https://huggingface.co/datasets/snats/url-classifications
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    snats
    Description

    Model Card: URL Classifications Dataset

      Dataset Summary
    

    The URL Classifications Dataset is a collection of URL classifications for PDF documents, primarily derived from the SafeDocs corpus. It contains multiple CSV files with different subsets of classifications, including both raw and processed data.

      Supported Tasks
    

    This dataset supports the following tasks:

    Text Classification URL-based Document Classification PDF Content Inference

      Languages
    

    The… See the full description on the dataset page: https://huggingface.co/datasets/snats/url-classifications.

  8. CIFAR-10 Python in CSV

    • kaggle.com
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fedesoriano (2021). CIFAR-10 Python in CSV [Dataset]. https://www.kaggle.com/fedesoriano/cifar10-python-in-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    fedesoriano
    Description

    Context

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. The classes are completely mutually exclusive. There are 50000 training images and 10000 test images.

    The batches.meta file contains the label names of each class.

    The dataset was originally divided in 5 training batches with 10000 images per batch. The original dataset can be found here: https://www.cs.toronto.edu/~kriz/cifar.html. This dataset contains all the training data and test data in the same CSV file so it is easier to load.

    Content

    Here is the list of the 10 classes in the CIFAR-10:

    Classes: 1) 0: airplane 2) 1: automobile 3) 2: bird 4) 3: cat 5) 4: deer 6) 5: dog 7) 6: frog 8) 7: horse 9) 8: ship 10) 9: truck

    Acknowledgements

    • Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009. Link

    How to load the batches.meta file (Python)

    The function used to open the file: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict

    Example of how to read the file: metadata_path = './cifar-10-python/batches.meta' # change this path metadata = unpickle(metadata_path)

  9. a

    Alaska DCCED CBPL Active Business License CSV File Download

    • gis.data.alaska.gov
    • hub.arcgis.com
    • +3more
    Updated Nov 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dept. of Commerce, Community, & Economic Development (2021). Alaska DCCED CBPL Active Business License CSV File Download [Dataset]. https://gis.data.alaska.gov/documents/6070036058764b96a0d37d147088e70c
    Explore at:
    Dataset updated
    Nov 16, 2021
    Dataset authored and provided by
    Dept. of Commerce, Community, & Economic Development
    Area covered
    Alaska
    Description

    Alaska DCCED Division of Corporations, Business and Professional Licensing courtesy CSV Download Link Location

  10. UK House Price Index: data downloads September 2016

    • gov.uk
    Updated Nov 15, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HM Land Registry (2016). UK House Price Index: data downloads September 2016 [Dataset]. https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-september-2016
    Explore at:
    Dataset updated
    Nov 15, 2016
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    HM Land Registry
    Area covered
    United Kingdom
    Description

    Datasets are available as CSV files. Find out about republishing and making use of the data.

    Download the data

    Historical back series

    This file includes a derived back series for the new UK HPI. Under the UK HPI, data is available from 1995 for England and Wales, 2004 for Scotland and 2005 for Northern Ireland. A longer back series has been derived by using the historic path of the ONS HPI to construct a series back to 1968:

    Release calendar

    The release calendar shows when the next month’s data will be published.

    Create your report

    Create your own reports based on the UK House Price Index data, http://landregistry.data.gov.uk/app/ukhpi" class="govuk-link">use our tool.

  11. d

    Line Data (zipped csv file)

    • datadiscoverystudio.org
    Updated Jan 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2012). Line Data (zipped csv file) [Dataset]. http://datadiscoverystudio.org/geoportal/rest/metadata/item/313bd54aeb3c4606b18a819c824d4725/html
    Explore at:
    Dataset updated
    Jan 1, 2012
    Area covered
    Description

    Link to the ScienceBase Item Summary page for the item described by this metadata record. Service Protocol: Link to the ScienceBase Item Summary page for the item described by this metadata record. Application Profile: Web Browser. Link Function: information

  12. Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus...

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard Ferrers; Speedtest Global Index (2023). Speedtest Open Data - Four International cities - MEL, BKK, SHG, LAX plus ALC - 2020, 2022 [Dataset]. http://doi.org/10.6084/m9.figshare.13621169.v24
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Richard Ferrers; Speedtest Global Index
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset compares four cities FIXED-line broadband internet speeds: - Melbourne, AU - Bangkok, TH - Shanghai, CN - Los Angeles, US - Alice Springs, AU

    ERRATA: 1.Data is for Q3 2020, but some files are labelled incorrectly as 02-20 of June 20. They all should read Sept 20, or 09-20 as Q3 20, rather than Q2. Will rename and reload. Amended in v7.

    1. LAX file named 0320, when should be Q320. Amended in v8.

    *lines of data for each geojson file; a line equates to a 600m^2 location, inc total tests, devices used, and average upload and download speed - MEL 16181 locations/lines => 0.85M speedtests (16.7 tests per 100people) - SHG 31745 lines => 0.65M speedtests (2.5/100pp) - BKK 29296 lines => 1.5M speedtests (14.3/100pp) - LAX 15899 lines => 1.3M speedtests (10.4/100pp) - ALC 76 lines => 500 speedtests (2/100pp)

    Geojsons of these 2* by 2* extracts for MEL, BKK, SHG now added, and LAX added v6. Alice Springs added v15.

    This dataset unpacks, geospatially, data summaries provided in Speedtest Global Index (linked below). See Jupyter Notebook (*.ipynb) to interrogate geo data. See link to install Jupyter.

    ** To Do Will add Google Map versions so everyone can see without installing Jupyter. - Link to Google Map (BKK) added below. Key:Green > 100Mbps(Superfast). Black > 500Mbps (Ultrafast). CSV provided. Code in Speedtestv1.1.ipynb Jupyter Notebook. - Community (Whirlpool) surprised [Link: https://whrl.pl/RgAPTl] that Melb has 20% at or above 100Mbps. Suggest plot Top 20% on map for community. Google Map link - now added (and tweet).

    ** Python melb = au_tiles.cx[144:146 , -39:-37] #Lat/Lon extract shg = tiles.cx[120:122 , 30:32] #Lat/Lon extract bkk = tiles.cx[100:102 , 13:15] #Lat/Lon extract lax = tiles.cx[-118:-120, 33:35] #lat/Lon extract ALC=tiles.cx[132:134, -22:-24] #Lat/Lon extract

    Histograms (v9), and data visualisations (v3,5,9,11) will be provided. Data Sourced from - This is an extract of Speedtest Open data available at Amazon WS (link below - opendata.aws).

    **VERSIONS v.24 Add tweet and google map of Top 20% (over 100Mbps locations) in Mel Q322. Add v.1.5 MEL-Superfast notebook, and CSV of results (now on Google Map; link below). v23. Add graph of 2022 Broadband distribution, and compare 2020 - 2022. Updated v1.4 Jupyter notebook. v22. Add Import ipynb; workflow-import-4cities. v21. Add Q3 2022 data; five cities inc ALC. Geojson files. (2020; 4.3M tests 2022; 2.9M tests)

    Melb 14784 lines Avg download speed 69.4M Tests 0.39M

    SHG 31207 lines Avg 233.7M Tests 0.56M

    ALC 113 lines Avg 51.5M Test 1092

    BKK 29684 lines Avg 215.9M Tests 1.2M

    LAX 15505 lines Avg 218.5M Tests 0.74M

    v20. Speedtest - Five Cities inc ALC. v19. Add ALC2.ipynb. v18. Add ALC line graph. v17. Added ipynb for ALC. Added ALC to title.v16. Load Alice Springs Data Q221 - csv. Added Google Map link of ALC. v15. Load Melb Q1 2021 data - csv. V14. Added Melb Q1 2021 data - geojson. v13. Added Twitter link to pics. v12 Add Line-Compare pic (fastest 1000 locations) inc Jupyter (nbn-intl-v1.2.ipynb). v11 Add Line-Compare pic, plotting Four Cities on a graph. v10 Add Four Histograms in one pic. v9 Add Histogram for Four Cities. Add NBN-Intl.v1.1.ipynb (Jupyter Notebook). v8 Renamed LAX file to Q3, rather than 03. v7 Amended file names of BKK files to correctly label as Q3, not Q2 or 06. v6 Added LAX file. v5 Add screenshot of BKK Google Map. v4 Add BKK Google map(link below), and BKK csv mapping files. v3 replaced MEL map with big key version. Prev key was very tiny in top right corner. v2 Uploaded MEL, SHG, BKK data and Jupyter Notebook v1 Metadata record

    ** LICENCE AWS data licence on Speedtest data is "CC BY-NC-SA 4.0", so use of this data must be: - non-commercial (NC) - reuse must be share-alike (SA)(add same licence). This restricts the standard CC-BY Figshare licence.

    ** Other uses of Speedtest Open Data; - see link at Speedtest below.

  13. f

    CSV file of all links in the summary network

    • figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emily Griffiths (2023). CSV file of all links in the summary network [Dataset]. http://doi.org/10.6084/m9.figshare.978526.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Emily Griffiths
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Semicolon delimited text file equivalent of the Rdata file. See the Rdata file for a description of the data in each column.

  14. OpenCitations Index CSV dataset of the provenance information of all the...

    • figshare.com
    zip
    Updated Mar 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCitations ​ (2025). OpenCitations Index CSV dataset of the provenance information of all the citation data [Dataset]. http://doi.org/10.6084/m9.figshare.24417733.v4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCitations ​
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the provenance information (in CSV format) of all the citation data included in the OpenCitations Index, released on March 24, 2025. In particular, each line of the CSV file defines a citation, and includes the following information:[field "oci"] the Open Citation Identifier (OCI) for the citation;[field "snapshot"] the identifier of the snapshot;[field "agent"] the name of the agent that have created the citation data;[field "source"] the URL of the source dataset from where the citation data have been extracted;[field "created"] the creation time of the citation data.[field "invalidated"] the start of the destruction, cessation, or expiry of an existing entity by an activity;[field "description"] a textual description of the activity made;[field "update"] the UPDATE SPARQL query that keeps track of which metadata have been modified.The size of the zipped archive is 18 GB, while the size of the unzipped CSV files is 410 GB.

  15. The China Plant Trait Database, link to database in CSV format (zipped)

    • doi.pangaea.de
    • search.dataone.org
    zip
    Updated Feb 7, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Han Wang; Sandy P Harrison; Iain Colin Prentice; Yanzheng Yang; Fan Bai; Henrique Furstenau Togashi; Meng Wang; Shuangxi Zhou; Jian Ni (2017). The China Plant Trait Database, link to database in CSV format (zipped) [Dataset]. http://doi.org/10.1594/PANGAEA.871818
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 7, 2017
    Dataset provided by
    PANGAEA
    Authors
    Han Wang; Sandy P Harrison; Iain Colin Prentice; Yanzheng Yang; Fan Bai; Henrique Furstenau Togashi; Meng Wang; Shuangxi Zhou; Jian Ni
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Area covered
    Description

    This dataset is about: The China Plant Trait Database, link to database in CSV format (zipped). Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.871819 for more information. Version: 2017-09-05, Character encoding: UTF-8

  16. Datasets and scripts related to the paper: "*Can Generative AI Help us in...

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous Anonymous; Anonymous Anonymous (2024). Datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*" [Dataset]. http://doi.org/10.5281/zenodo.13134104
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous Anonymous; Anonymous Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    This replication package contains datasets and scripts related to the paper: "*Can Generative AI Help us in Qualitative Software Engineering?*"
    The replication package is organized into two directories:
    - `manual_analysis`: This directory contains all sheets used to perform the manual analysis for RQ1, RQ2, and RQ3.
    - `stats`: This directory contains all datasets, scripts, and results metrics used for the quantitative analyses of RQ1 and RQ2.
    In the following, we describe the content of each directory:
    ## manual_analysis
    - `manual_analysis_rq1`: This directory contains all sheets used to perform manual analysis for RQ1 (independent and incremental coding).
    - The sub-directory `incremental_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_incremental.csv`, `DL_Faults_ISSUE_incremental.csv`, `DL_Fault_SO_incremental.csv`, `DRL_Challenges_incremental.csv` and `Functional_incremental.csv`). All these .csv files contain the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Instance ID
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output\_memory*: Output of GPT-4-Turbo with incremental coding
    - *Chatgpt\_output\_memory\_clean*: (only for the DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts
    - The sub-directory `independent_coding` contains .csv files for all datasets (`DL_Faults_COMMIT_independent.csv`, `DL_Faults_ISSUE_ independent.csv`, `DL_Fault_SO_ independent.csv`, `DRL_Challenges_ independent.csv` and `Functional_ independent.csv`), containing the following columns:
    - *Link*: The link to the instances
    - *Prompt*: Prompt used as input to GPT-4-Turbo
    - *ID*: Specific ID for the instance
    - *FinalTag*: Tag assigned by the human in the original paper
    - *Chatgpt\_output*: Output of GPT-4-Turbo with independent coding
    - *Chatgpt\_output\_clean*: (only for DL Faults datasets) output of GPT-4-Turbo considering only the label assigned, excluding the text
    - *Author1*: Label assigned by the first author
    - *Author2*: Label assigned by the second author
    - *FinalOutput*: Label assigned after the resolution of the conflicts.
    - Also, the sub-directory contains sheets with inconsistencies after resolving conflicts. The directory `inconsistency_incremental_coding` contains .csv files with the following columns:
    - *Dataset*: The dataset considered
    - *Human*: The label assigned by the human in the original paper
    - *Machine*: The label assigned by GPT-4-Turbo
    - *Classification*: The final label assigned by the authors after resolving the conflicts. Multiple classifications for a single instance are separated by a comma “,”
    - *Final*: final label assigned after the resolution of the incompatibilities
    - Similarly, the sub-directory `inconsistency_independent_coding` contains a .csv file with the same columns as before, but this is for the case of independent coding.
    - `manual_analysis_rq2`: This directory contains .csv files for all datasets (`DL_Faults_redundant_tag.csv`, `DRL_Challenges_redundant_tag.csv`, `Functional_redundant_tag.csv`) to perform manual analysis for RQ2.
    - The `DL_Faults_redundant_tag.csv` file contains the following columns:
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags are redundant matching or not
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The `Functional_redundant_tag.csv` file contains the same columns as before
    - The `DRL_Challenges_redundant_tag.csv` file is organized as follows:
    - *Tags Suggested*: The final tag suggested by GPT-4-Turbo
    - *Tags Redundant*: tags identified as redundant by GPT-4-Turbo
    - *Matched*: inspection by the authors to see if the tags redundant matching or not with the tags suggested
    - *FinalTag*: final tag assigned by the authors after the resolution of the conflict
    - The sub-directory `code_consolidation_mapping_overview` contains .csv files (`DL_Faults_rq2_overview.csv`, `DRL_Challenges_rq2_overview.csv`, `Functional_rq2_overview.csv`) organized as follows:
    - *Initial_Tags*: list of the unique initial tags assigned by GPT-4-Turbo for each dataset
    - *Mapped_tags*: list of tags mapped by GPT-4-Turbo
    - *Unmatched_tags*: list of unmatched tags by GPT-4-Turbo
    - *Aggregating_tags*: list of consolidated tags
    - *Final_tags*: list of final tags after the consolidation task
    ## stats
    - `RQ1`: contains script and datasets used to perform metrics for RQ1. The analysis calculates all possible combinations between Matched, More Abstract, More Specific, and Unmatched.
    - `RQ1_Stats.ipynb` is a Python Jupyter nooteook to compute the RQ1 metrics. To use it, as explained in the notebook, it is necessary to change the values of variables contained in the first code block.
    - `independent-prompting`: Contains the datasets related to the independent prompting. Each line contains the following fields:
    - *Link*: Link to the artifact being tagged
    - *Prompt*: Prompt sent to GPT-4-Turbo
    - *FinalTag*: Artifact coding from the replicated study
    - *chatgpt\_output_text*: GPT-4-Turbo output
    - *chatgpt\_output*: Codes parsed from the GPT-4-Turbo output
    - *Author1*: Annotator 1 evaluation of the coding
    - *Author2*: Annotator 2 evaluation of the coding
    - *FinalOutput*: Consolidated evaluation
    - `incremental-prompting`: Contains the datasets related to the incremental prompting (same format as independent prompting)
    - `results`: contains files for the RQ1 quantitative results. The files are named `RQ1\_<
    - `RQ2`: contains the script used to perform metrics for RQ2, the datasets it uses, and its output.
    - `RQ2_SetStats.ipynb` is the Python Jupyter notebook to perform the analyses. The scripts takes as input the following types of files, contained in the directory contains the script used to perform the metrics for RQ2. The script takes in input:
    - RQ1 Data Files (`RQ1_DLFaults_Issues.csv`, `RQ1_DLFaults_Commits.csv`, and `RQ1_DLFaults_SO.csv`, joined in a single .csv `RQ1_DLFaults.csv`). These are the same files used in RQ1.
    - Mapping Files (`RQ2_Mappings_DRL.csv`, `RQ2_Mappings_Functional.csv`, `RQ2_Mappings_DLFaults.csv`). These contain the mappings between human tags (*HumanTags*), GPT-4-Turbo tags (*Final Tags*), with indicated the type of matching (*MatchType*).
    - Additional codes creating during the consolidation (`RQ2_newCodes_DRL.csv`, `RQ2_newCodes_Functional.csv`, `RQ2_newCodes_DLFaults.csv`), annotated with the matching: *new code*,*old code*,*human code*,*match type*
    - Set files (`RQ2_Sets_DRL.csv`, `RQ2_Sets_Functional.csv`, `RQ2_Sets_DLFaults.csv`). Each file contains the following columns:
    - *HumanTags*: List of tags from the original dataset
    - *InitialTags*: Set of tags from RQ1,
    - *ConsolidatedTags*: Tags that have been consolidated,
    - *FinalTags*: Final set of tags (results of RQ2, used in RQ3)
    - *NewTags*: New tags created during consolidation
    - `RQ2_Set_Metrics.csv`: Reports the RQ2 output metrics (Precision, Recall, F1-Score, Jaccard).
  17. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of CĂłrdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  18. e

    Pre-compiled metrics data sets, links to yearly statistics files in CSV...

    • b2find.eudat.eu
    Updated Dec 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    Dataset updated
    Dec 2, 2018
    Description

    Errata: On Dec 2nd, 2018, several yearly statistics files were replaced with new versions to correct an inconsistency related to the computation of the "dma8epax" statistics. As written in Schultz et al. (2017) [https://doi.org/10.1525/elementa.244], Supplement 1, Table 6: "When the aggregation period is “seasonal”, “summer”, or “annual”, the 4th highest daily 8-hour maximum of the aggregation period will be computed.". The data values for these aggregation periods are correct, however, the header information in the original files stated that the respective data column would contain "average daily maximum 8-hour ozone mixing ratio (nmol mol-1)". Therefore, the header of the seasonal, summer, and annual files has been corrected. Furthermore, the "dma8epax" column in the monthly files erroneously contained 4th highest daily maximum 8-hour average values, while it should have listed monthly average values instead. The data of this metric in the monthly files have therefore been replaced. The new column header reads "avgdma8epax". The updated files contain a version label "1.1" and a brief description of the error. If you have made use of previous TOAR data files with the "dma8epax" metric, please exchange your data files.

  19. w

    Our World In Data - Dataset - waterdata

    • wbwaterdata.org
    Updated Jul 12, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Our World In Data - Dataset - waterdata [Dataset]. https://wbwaterdata.org/dataset/our-world-in-data
    Explore at:
    Dataset updated
    Jul 12, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database collates 3552 development indicators from different studies with data by country and year, including single year and multiple year time series. The data is presented as charts, the data can be downloaded from linked project pages/references for each set, and the data for each presented graph is available as a CSV file as well as a visual download of the graph (both available via the download link under each chart).

  20. Data from: Pre-compiled metrics data sets, links to gridded files in NetCDF...

    • doi.pangaea.de
    • search.dataone.org
    html, tsv
    Updated Sep 8, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin G Schultz; Sabine Schröder; Olga Lyapina; Owen R Cooper (2017). Pre-compiled metrics data sets, links to gridded files in NetCDF format [Dataset]. http://doi.org/10.1594/PANGAEA.880506
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    PANGAEA
    Authors
    Martin G Schultz; Sabine Schröder; Olga Lyapina; Owen R Cooper
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1990 - Dec 31, 2014
    Variables measured
    DATE/TIME, File name, File size, Uniform resource locator/link to file
    Description

    Errata: Due to a coding error, monthly files with "dma8epax" statistics were wrongly aggregated. This concerns all gridded files of this metric as well as the monthly aggregated csv files. All erroneous files were replaced with corrected versions on Jan, 16th, 2018. Each updated file contains a version label "1.1" and a brief description of the error. If you have made use of previous TOAR data files with the "dma8epax" metric, please exchange your data files.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14330132
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu