100+ datasets found
  1. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  2. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  3. GitTables 1M - CSV files

    • zenodo.org
    • explore.openaire.eu
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  4. Company Datasets for Business Profiling

    • datarade.ai
    Updated Feb 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Feb 23, 2017
    Dataset authored and provided by
    Oxylabs
    Area covered
    British Indian Ocean Territory, Northern Mariana Islands, Bangladesh, Moldova (Republic of), Nepal, Isle of Man, Tunisia, Taiwan, Canada, Andorra
    Description

    Company Datasets for valuable business insights!

    Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

    These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

    • Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

    We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

    • Company name;
    • Size;
    • Founding date;
    • Location;
    • Industry;
    • Revenue;
    • Employee count;
    • Competitors.

    You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

    Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

    With Oxylabs Datasets, you can count on:

    • Fresh and accurate data collected and parsed by our expert web scraping team.
    • Time and resource savings, allowing you to focus on data analysis and achieving your business goals.
    • A customized approach tailored to your specific business needs.
    • Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!

  5. m

    1000 Empirical Time series

    • bridges.monash.edu
    • researchdata.edu.au
    • +1more
    png
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Ben Fulcher
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.

  6. sample csv

    • kaggle.com
    zip
    Updated Apr 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DevanshArora7 (2024). sample csv [Dataset]. https://www.kaggle.com/datasets/devansharora7/sample-csv/code
    Explore at:
    zip(134132 bytes)Available download formats
    Dataset updated
    Apr 7, 2024
    Authors
    DevanshArora7
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by DevanshArora7

    Released under Apache 2.0

    Contents

  7. UCI and OpenML Data Sets for Ordinal Quantification

    • zenodo.org
    zip
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()"
    julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  8. Google Analytics Sample

    • kaggle.com
    zip
    Updated Sep 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google BigQuery (2019). Google Analytics Sample [Dataset]. https://www.kaggle.com/datasets/bigquery/google-analytics-sample
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 19, 2019
    Dataset provided by
    Googlehttp://google.com/
    BigQueryhttps://cloud.google.com/bigquery
    Authors
    Google BigQuery
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website.

    Content

    The sample dataset contains Google Analytics 360 data from the Google Merchandise Store, a real ecommerce store. The Google Merchandise Store sells Google branded merchandise. The data is typical of what you would see for an ecommerce website. It includes the following kinds of information:

    Traffic source data: information about where website visitors originate. This includes data about organic traffic, paid search traffic, display traffic, etc. Content data: information about the behavior of users on the site. This includes the URLs of pages that visitors look at, how they interact with content, etc. Transactional data: information about the transactions that occur on the Google Merchandise Store website.

    Fork this kernel to get started.

    Acknowledgements

    Data from: https://bigquery.cloud.google.com/table/bigquery-public-data:google_analytics_sample.ga_sessions_20170801

    Banner Photo by Edho Pratama from Unsplash.

    Inspiration

    What is the total number of transactions generated per device browser in July 2017?

    The real bounce rate is defined as the percentage of visits with a single pageview. What was the real bounce rate per traffic source?

    What was the average number of product pageviews for users who made a purchase in July 2017?

    What was the average number of product pageviews for users who did not make a purchase in July 2017?

    What was the average total transactions per user that made a purchase in July 2017?

    What is the average amount of money spent per session in July 2017?

    What is the sequence of pages viewed?

  9. Z

    Film Circulation dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
    Explore at:
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Samoilova, Evgenia (Zhenya)
    Loist, Skadi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.

    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

    The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This

  10. c

    Data from: Datasets used to train the Generative Adversarial Networks used...

    • opendata.cern.ch
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    CERN Open Data Portal
    Authors
    ATLAS collaboration
    Description

    Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

    The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

    The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.

    Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.

    Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.

  11. H

    Dataset metadata of known Dataverse installations, August 2024

    • dataverse.harvard.edu
    Updated Jan 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...

  12. B

    Residential School Locations Dataset (CSV Format)

    • borealisdata.ca
    • search.dataone.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Borealis
    Authors
    Rosa Orlandini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Area covered
    Canada
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  13. mars data csv

    • kaggle.com
    zip
    Updated Oct 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shakib Absar (2024). mars data csv [Dataset]. https://www.kaggle.com/datasets/shakibabsar42/mars-data-csv
    Explore at:
    zip(9717343 bytes)Available download formats
    Dataset updated
    Oct 4, 2024
    Authors
    Shakib Absar
    Description

    Dataset

    This dataset was created by Shakib Absar

    Contents

  14. d

    Water-Quality Data for Discrete Samples and Continuous Monitoring on the...

    • catalog.data.gov
    • data.usgs.gov
    • +2more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Water-Quality Data for Discrete Samples and Continuous Monitoring on the Merrimack River, Massachusetts, June to September 2020 [Dataset]. https://catalog.data.gov/dataset/water-quality-data-for-discrete-samples-and-continuous-monitoring-on-the-merrimack-river-m
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Massachusetts, Merrimack River
    Description

    This data release includes water-quality data collected at up to thirteen locations along the Merrimack River and Merrimack River Estuary in Massachusetts. In this study, conducted by the U.S. Geological Survey (USGS) in cooperation with the Massachusetts Department of Environmental Protection, discrete samples were collected, and continuous monitoring was completed from June to September 2020. The data include results of measured field properties (water temperature, specific conductivity, pH, dissolved oxygen) and laboratory concentrations of nitrogen and phosphorus species, total carbon, pheophytin-a, and chlorophyll-a. These data were collected to assess selected (mainly nutrients) water-quality conditions in the Merrimack River and Merrimack River Estuary at the thirteen locations and identify areas where more water-quality monitoring is needed. The discrete samples and continuous-monitoring data are also available in the USGS National Water Information System at https://waterdata.usgs.gov/nwis. This data release consists of (1) Table of the discrete water-quality data collected (Merrimack_DiscreteWQ_Data.csv); (2) Statistical summaries including the minimum, median, and maximum of the discrete water-quality data collected (Merrimack_DiscreteWQ_Statistical_Data.original.csv); (3) Statistical summaries including the minimum, median, and maximum of the continuous water-quality data collected (Merrimack_ContinuousWQ_Statistical_Data.csv); (4) Table of vertical profile data (Merrimack_VerticalWQ_Profiles_Data.csv); (5) Table of continuous monitor deployment location and dates (Merrimack_ContinuousWQ_Deployment_Dates.csv); (6) Time-series plots of continuous water-quality data (Continuous_QW_Plots_All.zip); (7) Vertical profile plots (Vertical Profiles_QW_Plots.zip).

  15. Developer Community and Code Datasets

    • datarade.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset authored and provided by
    Oxylabs
    Area covered
    El Salvador, Tuvalu, Bahamas, Philippines, United Kingdom, South Sudan, Djibouti, Guyana, Saint Pierre and Miquelon, Marshall Islands
    Description

    Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

    Data Sources:

    1. GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

    2. StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

    3. DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

    Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

    With our datasets, you'll receive:

    • Usernames;
    • Companies;
    • Locations;
    • Job Titles;
    • Follower Counts;
    • Contact Details;
    • Employability Statuses;
    • And More.

    Choose from various output formats, storage options, and delivery frequencies:

    • Get datasets in CSV, JSON, or other preferred formats.
    • Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.
    • Receive datasets either once or as per your agreed-upon schedule.

    Why choose our Datasets?

    1. Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

    2. Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

    3. Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

    4. Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

    Pricing Options:

    Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

    Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

    Experience a seamless journey with Oxylabs:

    • Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.
    • Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.
    • Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.
    • Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

    Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!

  16. Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_w3r2280w0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    HIV Prevention Trials Network
    HIV Vaccine Trials Networkhttp://www.hvtn.org/
    National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
    PEPFAR
    Authors
    Dylan Westfall; Mullins James
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.

  17. m

    Data from: GreEn-ER - Electricity Consumption Data of a Tertiary Building

    • data.mendeley.com
    • search.datacite.org
    Updated Sep 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo Martin Nascimento (2020). GreEn-ER - Electricity Consumption Data of a Tertiary Building [Dataset]. http://doi.org/10.17632/h8mmnthn5w.1
    Explore at:
    Dataset updated
    Sep 20, 2020
    Authors
    Gustavo Martin Nascimento
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides electricity consumption data collected from the building management system of GreEn-ER. This building, located in Grenoble, hosts Grenoble-INP Ense³ Engineering School and the G2ELab (Grenoble Electrical Engineering Laboratory). It brings together in one place the teaching and research actors around new energy technologies.

    The electricity consumption of the building is highly monitored with plus than 300 meters.

    The data from each meter is available in one csv file, which contains two columns.

    One contains the Timestamp and the other contains de electricity consumption in kWh.

    The sampling rate for all data is 10 min.

    There are data available for 2017 and 2018.

    The dataset also contains data of the external temperature for 2017 and 2018.

    The files are structured as follows:

    • The main folder called "Data" contains 2 sub-folders, each one corresponding to one year (2017 and 2018).

    • Each sub-folder contains 3 other sub-folders, each one corresponding to a sector of the building.

    • The main folder "Data" also contains the csv files with the electricity consumption data of the whole building and a file called "Temp.csv" with the temperature data.

    • The separator used in the csv files is ";".

    • The sampling rate is 10 min and the unity of the consumption is kWh. It means that each sample corresponds to the energy consumption in these 10 minutes. So if the user wants to retrieve the mean power in this period (that corresponds to each sample), the value must be multiplied by 6.

    • Four Jupyter Notebook files, a format that allows combining text, graphics and code in python are also available. These files allow exploring all the data within the dataset.

    • These jupyter notebook files contains all the metadata necessary for understanding the system, like drawings of the system design, of the building etc.

    • Each file is named by the number of its meter. These numbers can be retrieved in tables and drawings available in the Jupyter Notebooks.

    • A couple of csv files with the system design are also available. They are called "TGBT1_n.csv", "TGBT2_n.csv" and "PREDIS-MHI_n.csv".

  18. Amazon Fine Food Reviews

    • kaggle.com
    zip
    Updated May 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Network Analysis Project (2017). Amazon Fine Food Reviews [Dataset]. https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews
    Explore at:
    zip(253873708 bytes)Available download formats
    Dataset updated
    May 1, 2017
    Dataset authored and provided by
    Stanford Network Analysis Project
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.

    Contents

    • Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite
    • database.sqlite: Contains the table 'Reviews'

    Data includes:
    - Reviews from Oct 1999 - Oct 2012
    - 568,454 reviews
    - 256,059 users
    - 74,258 products
    - 260 users with > 50 reviews

    wordcloud

    Acknowledgements

    See this SQLite query for a quick sample of the dataset.

    If you publish articles based on this dataset, please cite the following paper:

  19. Z

    Data from: A comprehensive dataset for the accelerated development and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carreira Pedro, Hugo (2020). A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2826938
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Carreira Pedro, Hugo
    Larson, David
    Coimbra, Carlos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description This repository contains a comprehensive solar irradiance, imaging, and forecasting dataset. The goal with this release is to provide standardized solar and meteorological datasets to the research community for the accelerated development and benchmarking of forecasting methods. The data consist of three years (2014–2016) of quality-controlled, 1-min resolution global horizontal irradiance and direct normal irradiance ground measurements in California. In addition, we provide overlapping data from commonly used exogenous variables, including sky images, satellite imagery, Numerical Weather Prediction forecasts, and weather data. We also include sample codes of baseline models for benchmarking of more elaborated models.

    Data usage The usage of the datasets and sample codes presented here is intended for research and development purposes only and implies explicit reference to the paper: Pedro, H.T.C., Larson, D.P., Coimbra, C.F.M., 2019. A comprehensive dataset for the accelerated development and benchmarking of solar forecasting methods. Journal of Renewable and Sustainable Energy 11, 036102. https://doi.org/10.1063/1.5094494

    Although every effort was made to ensure the quality of the data, no guarantees or liabilities are implied by the authors or publishers of the data.

    Sample code As part of the data release, we are also including the sample code written in Python 3. The preprocessed data used in the scripts are also provided. The code can be used to reproduce the results presented in this work and as a starting point for future studies. Besides the standard scientific Python packages (numpy, scipy, and matplotlib), the code depends on pandas for time-series operations, pvlib for common solar-related tasks, and scikit-learn for Machine Learning models. All required Python packages are readily available on Mac, Linux, and Windows and can be installed via, e.g., pip.

    Units All time stamps are in UTC (YYYY-MM-DD HH:MM:SS). All irradiance and weather data are in SI units. Sky image features are derived from 8-bit RGB (256 color levels) data. Satellite images are derived from 8-bit gray-scale (256 color levels) data.

    Missing data The string "NAN" indicates missing data

    File formats All time series data files as in CSV (comma separated values) Images are given in tar.bz2 files

    Files

    Folsom_irradiance.csv Primary One-minute GHI, DNI, and DHI data.

    Folsom_weather.csv Primary One-minute weather data.

    Folsom_sky_images_{YEAR}.tar.bz2 Primary Tar archives with daytime sky images captured at 1-min intervals for the years 2014, 2015, and 2016, compressed with bz2.

    Folsom_NAM_lat{LAT}_lon{LON}.csv Primary NAM forecasts for the four nodes nearest the target location. {LAT} and {LON} are replaced by the node’s coordinates listed in Table I in the paper.

    Folsom_sky_image_features.csv Secondary Features derived from the sky images.

    Folsom_satellite.csv Secondary 10 pixel by 10 pixel GOES-15 images centered in the target location.

    Irradiance_features_{horizon}.csv Secondary Irradiance features for the different forecasting horizons ({horizon} 1⁄4 {intra-hour, intra-day, day-ahead}).

    Sky_image_features_intra-hour.csv Secondary Sky image features for the intra-hour forecasting issuing times.

    Sat_image_features_intra-day.csv Secondary Satellite image features for the intra-day forecasting issuing times.

    NAM_nearest_node_day-ahead.csv Secondary NAM forecasts (GHI, DNI computed with the DISC algorithm, and total cloud cover) for the nearest node to the target location prepared for day-ahead forecasting.

    Target_{horizon}.csv Secondary Target data for the different forecasting horizons.

    Forecast_{horizon}.py Code Python script used to create the forecasts for the different horizons.

    Postprocess.py Code Python script used to compute the error metric for all the forecasts.

  20. Z

    Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mastropaolo, Antonio (2024). Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8200098
    Explore at:
    Dataset updated
    Jan 16, 2024
    Dataset provided by
    BAVOTA, Gabriele
    Di Penta, Massimiliano
    Pepe, Federica
    Mastropaolo, Antonio
    Canfora, Gerardo
    Nardone, Vittoria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

    Root directory

    • statistics.r: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements
    • modelsInfo.zip: zip file containing all the downloaded model cards (in JSON format)
    • script: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

    Dataset

    • Dataset/Dataset_HF-models-list.csv: list of HF models analyzed
    • Dataset/Dataset_github-prj-list.txt: list of GitHub projects using the transformers library
    • Dataset/Dataset_github-Prj_model-Used.csv: contains usage pairs: project, model
    • Dataset/Dataset_prj-num-models-reused.csv: number of models used by each GitHub project
    • Dataset/Dataset_model-download_num-prj_correlation.csv contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

    RQ1

    • RQ1/RQ1_dataset-list.txt: list of HF datasets
    • RQ1/RQ1_datasetSample.csv: sample set of models used for the manual analysis of datasets
    • RQ1/RQ1_analyzeDatasetTags.py: Python script to analyze model tags for the presence of datasets. it requires to unzip the modelsInfo.zip in a directory with the same name (modelsInfo) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the RQ2/countDataset.py script
    • RQ1/RQ1_countDataset.py: given the output of RQ2/analyzeDatasetTags.py (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis
    • RQ1/RQ1_datasetTags.csv: output of RQ2/analyzeDatasetTags.py
    • RQ1/RQ1_dataset_usage_count.csv: output of RQ2/countDataset.py

    RQ2

    • RQ2/tableBias.pdf: table detailing the number of occurrences of different types of bias by model Task
    • RQ2/RQ2_bias_classification_sheet.csv: results of the manual labeling
    • RQ2/RQ2_isBiased.csv: file to compute the inter-rater agreement of whether or not a model documents Bias
    • RQ2/RQ2_biasAgrLabels.csv: file to compute the inter-rater agreement related to bias categories
    • RQ2/RQ2_final_bias_categories_with_levels.csv: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

    RQ3

    • RQ3/RQ3_LicenseValidation.csv: manual validation of a sample of licenses
    • RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt: lists of licenses with different permissiveness
    • RQ3/RQ3_prjs_license.csv: for each project linked to models, among other fields it indicates the license tag and name
    • RQ3/RQ3_models_license.csv: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license
    • RQ3/RQ3_model-prj-license_contingency_table.csv: usage contingency table between projects' licenses (columns) and models' licenses (rows)
    • RQ3/RQ3_models_prjs_licenses_with_type.csv: pairs project-model, with their respective licenses and permissiveness level

    scripts

    Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
Organization logo

Sample Graph Datasets in CSV Format

Explore at:
csvAvailable download formats
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Edwin Carreño; Edwin Carreño
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

  • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
  • dataset_30_edges_interactions.csv: contains 47 rows (edges).
  • the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the ColumnTypeDescription
UniProt IDstringprotein identification
labelstringprotein label (type of node)
propertiesstringa dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the ColumnTypeDescription
Relationship IDstringrelationship identification
Source IDstringidentification of the source protein in the relationship
Target IDstringidentification of the target protein in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata

GraphNumber of NodesNumber of EdgesSparse graph

dataset_30*

30

47

Y

dataset_60*

60

181

Y

dataset_120*

120

689

Y

dataset_240*

240

2819

Y

dataset_300*

300

4658

Y

dataset_600*

600

18004

Y

dataset_1200*

1200

71785

Y

dataset_2400*

2400

288600

Y

dataset_3000*

3000

449727

Y

dataset_6000*

6000

1799413

Y

dataset_12000*

12000

7199863

Y

dataset_24000*

24000

28792361

Y

dataset_30000*

30000

44991744

Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringnode identification
labelstringnode label (type of node)
propertiesstringa dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the ColumnTypeDescription
IDstringrelationship identification
sourcestringidentification of the source node in the relationship
targetstringidentification of the target node in the relationship
labelstringrelationship label (type of relationship)
propertiesstringa dictionary containing properties related to the relationship.

Metadata (tiny graphs)

GraphNumber of NodesNumber of EdgesSparse graph
dataset_dummy*36N
dataset_dummy2*36N
Search
Clear search
Close search
Google apps
Main menu