100+ datasets found

Sample Graph Datasets in CSV Format

zenodo.org

csv

Updated Dec 9, 2024

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
dataset_30_edges_interactions.csv: contains 47 rows (edges).
the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the Column	Type	Description
UniProt ID	string	protein identification
label	string	protein label (type of node)
properties	string	a dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the Column	Type	Description
Relationship ID	string	relationship identification
Source ID	string	identification of the source protein in the relationship
Target ID	string	identification of the target protein in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_30*	30	47	Y
dataset_60*	60	181	Y
dataset_120*	120	689	Y
dataset_240*	240	2819	Y
dataset_300*	300	4658	Y
dataset_600*	600	18004	Y
dataset_1200*	1200	71785	Y
dataset_2400*	2400	288600	Y
dataset_3000*	3000	449727	Y
dataset_6000*	6000	1799413	Y
dataset_12000*	12000	7199863	Y
dataset_24000*	24000	28792361	Y
dataset_30000*	30000	44991744	Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	node identification
label	string	node label (type of node)
properties	string	a dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	relationship identification
source	string	identification of the source node in the relationship
target	string	identification of the target node in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata (tiny graphs)

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_dummy*	3	6	N
dataset_dummy2*	3	6	N

CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
B
Residential School Locations Dataset (CSV Format)
borealisdata.ca
search.dataone.org
Updated Jun 5, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/RIYEMU
Dataset updated
Jun 5, 2019
Dataset provided by
Borealis
Authors
Rosa Orlandini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1863 - Jun 30, 1998
Area covered
Canada
Description
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
h
doc-formats-csv-1
huggingface.co
Updated Nov 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasets examples (2023). doc-formats-csv-1 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset authored and provided by
Datasets examples
Description
[doc] formats - csv - 1

This dataset contains one csv file at the root:

data.csv

kind,sound dog,woof cat,meow pokemon,pika human,hello

The YAML section of the README does not contain anything related to loading the data (only the size category metadata):

size_categories:

- n<1K
1000 Empirical Time series
figshare.com
bridges.monash.edu
+1more
png
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ben Fulcher (2023). 1000 Empirical Time series [Dataset]. http://doi.org/10.6084/m9.figshare.5436136.v10
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5436136.v10
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Ben Fulcher
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A diverse selection of 1000 empirical time series, along with results of an hctsa feature extraction, using v1.06 of hctsa and Matlab 2019b, computed on a server at The University of Sydney.The results of the computation are in the hctsa file, HCTSA_Empirical1000.mat for use in Matlab using v1.06 of hctsa.The same data is also provided in .csv format for the hctsa_datamatrix.csv (results of feature computation), with information about rows (time series) in hctsa_timeseries-info.csv, information about columns (features) in hctsa_features.csv (and corresponding hctsa code used to compute each feature in hctsa_masterfeatures.csv), and the data of individual time series (each line a time series, for time series described in hctsa_timeseries-info.csv) is in hctsa_timeseries-data.csv. These .csv files were produced by running >>OutputToCSV(HCTSA_Empirical1000.mat,true,true); in hctsa.The input file, INP_Empirical1000.mat, is for use with hctsa, and contains the time-series data and metadata for the 1000 time series. For example, massive feature extraction from these data on the user's machine, using hctsa, can proceed as>> TS_Init('INP_Empirical1000.mat');Some visualizations of the dataset are in CarpetPlot.png (first 1000 samples of all time series as a carpet (color) plot) and 150TS-250samples.png (conventional time-series plots of the first 250 samples of a sample of 150 time series from the dataset). More visualizations can be performed by the user using TS_PlotTimeSeries from the hctsa package.See links in references for more comprehensive documentation for performing methodological comparison using this dataset, and on how to download and use v1.06 of hctsa.
GitTables 1M - CSV files
zenodo.org
explore.openaire.eu
zip
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6515973
Dataset updated
Jun 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains >800K CSV files behind the GitTables 1M corpus.

For more information about the GitTables corpus, visit:

- our website for GitTables, or

- the main GitTables download page on Zenodo.
sample csv
kaggle.com
zip
Updated Apr 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DevanshArora7 (2024). sample csv [Dataset]. https://www.kaggle.com/datasets/devansharora7/sample-csv/code
Explore at:
zip(134132 bytes)Available download formats
Dataset updated
Apr 7, 2024
Authors
DevanshArora7
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by DevanshArora7

Released under Apache 2.0

Contents
Company Datasets for Business Profiling
datarade.ai
Updated Feb 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs (2017). Company Datasets for Business Profiling [Dataset]. https://datarade.ai/data-products/company-datasets-for-business-profiling-oxylabs
Explore at:
.json, .xml, .csv, .xlsAvailable download formats
Dataset updated
Feb 23, 2017
Dataset authored and provided by
Oxylabs
Area covered
Isle of Man, Tunisia, Nepal, Moldova (Republic of), Taiwan, Canada, British Indian Ocean Territory, Bangladesh, Andorra, Northern Mariana Islands
Description
Company Datasets for valuable business insights!

Discover new business prospects, identify investment opportunities, track competitor performance, and streamline your sales efforts with comprehensive Company Datasets.

These datasets are sourced from top industry providers, ensuring you have access to high-quality information:

Owler: Gain valuable business insights and competitive intelligence. -AngelList: Receive fresh startup data transformed into actionable insights. -CrunchBase: Access clean, parsed, and ready-to-use business data from private and public companies. -Craft.co: Make data-informed business decisions with Craft.co's company datasets. -Product Hunt: Harness the Product Hunt dataset, a leader in curating the best new products.

We provide fresh and ready-to-use company data, eliminating the need for complex scraping and parsing. Our data includes crucial details such as:

Company name;

Size;

Founding date;

Location;

Industry;

Revenue;

Employee count;

Competitors.

You can choose your preferred data delivery method, including various storage options, delivery frequency, and input/output formats.

Receive datasets in CSV, JSON, and other formats, with storage options like AWS S3 and Google Cloud Storage. Opt for one-time, monthly, quarterly, or bi-annual data delivery.

With Oxylabs Datasets, you can count on:

Fresh and accurate data collected and parsed by our expert web scraping team.

Time and resource savings, allowing you to focus on data analysis and achieving your business goals.

A customized approach tailored to your specific business needs.

Legal compliance in line with GDPR and CCPA standards, thanks to our membership in the Ethical Web Data Collection Initiative.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Unlock the power of data with Oxylabs' Company Datasets and supercharge your business insights today!
H
Dataset metadata of known Dataverse installations, August 2024
dataverse.harvard.edu
Updated Jan 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2025). Dataset metadata of known Dataverse installations, August 2024 [Dataset]. http://doi.org/10.7910/DVN/2SA6SN
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/2SA6SN
Dataset updated
Jan 1, 2025
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 101 Dataverse installations, information about the metadata blocks of 106 installations, and the lists of pre-defined licenses or dataset terms that depositors can apply to datasets in the 88 installations that were running versions of the Dataverse software that include the "multiple-license" feature. The data is useful for improving understandings about how certain Dataverse features and metadata fields are used and for learning about the quality of dataset and file-level metadata within and across Dataverse installations. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 25 and August 30, 2024 using a "get_dataverse_installations_metadata" function in a collection of Python functions at https://github.com/jggautier/dataverse-scripts/blob/main/dataverse_repository_curation_assistant/dataverse_repository_curation_assistant_functions.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL for which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens in order to use certain API endpoints. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author_2024.08.25-2024.08.30.csv │ ├── contributor_2024.08.25-2024.08.30.csv │ ├── data_source_2024.08.25-2024.08.30.csv │ ├── ... │ └── topic_classification_2024.08.25-2024.08.30.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2024.08.26_15.52.42.zip │ ├── dataset_pids_Abacus_2024.08.26_15.52.42.csv │ ├── Dataverse_JSON_metadata_2024.08.26_15.52.42 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.9 │ ├── astrophysics_v5.9.json │ ├── biomedical_v5.9.json │ ├── citation_v5.9.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2024.08.26_00.02.51.zip │ ├── ... │ └── Yale_Dataverse_2024.08.25_03.52.57.zip └── dataverse_installations_summary_2024.08.30.csv └── dataset_pids_from_most_known_dataverse_installations_2024.08.csv └── license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv └── metadatablocks_from_most_known_dataverse_installations_2024.08.30.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the "Citation" metadata block and "Geospatial" metadata block of datasets in the 101 Dataverse installations. For example, author_2024.08.25-2024.08.30.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in 101 installations, with a column for each of the four child fields: author name, affiliation, identifier type, and identifier. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 106 zip files, one zip file for each of the 106 Dataverse installations whose sites were functioning when I attempted to collect their metadata. Each zip file contains a directory with JSON files that have information about the installation's metadata fields, such as the field names and how they're organized. For installations that had published datasets, and I was able to use Dataverse APIs to download the dataset metadata, the zip file also contains: A CSV file listing information about the datasets published in the installation, including a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. A directory of JSON files that contain the metadata of the installation's published, non-deaccessioned dataset versions in the Dataverse JSON metadata schema. The dataverse_installations_summary_2024.08.30.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata included and not included in this dataset. The dataset_pids_from_most_known_dataverse_installations_2024.08.csv file contains the dataset PIDs of published datasets in 101 Dataverse installations, with a column to indicate if the Python script was able to download the dataset's metadata. It's a union of all "dataset_pids_....csv" files in each of the 101 zip files in the dataverse_json_metadata_from_each_known_dataverse_installation directory. The license_options_for_each_dataverse_installation_2024.08.28_14.42.54.csv file contains information about the licenses and...
Raw Data - CSV Files
osf.io
Updated Apr 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn Conn (2020). Raw Data - CSV Files [Dataset]. https://osf.io/h5wbt
Explore at:
Dataset updated
Apr 27, 2020
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Katelyn Conn
Description
Raw Data in .csv format for use with the R data wrangling scripts.
Z
Film Circulation dataset
data.niaid.nih.gov
zenodo.org
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samoilova, Evgenia (Zhenya) (2024). Film Circulation dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7887671
Explore at:
Dataset updated
Jul 12, 2024
Dataset provided by
Samoilova, Evgenia (Zhenya)
Loist, Skadi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories, units of measurement, data sources and coding and missing data.

The csv file “4_festival-library_dataset_imdb-and-survey” contains data on all unique festivals collected from both IMDb and survey sources. This dataset appears in wide format, all information for each festival is listed in one row. This
UCI and OpenML Data Sets for Ordinal Quantification
zenodo.org
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. http://doi.org/10.5281/zenodo.8177302
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8177302
Dataset updated
Jul 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mirko Bunse; Mirko Bunse; Alejandro Moreo; Alejandro Moreo; Fabrizio Sebastiani; Fabrizio Sebastiani; Martin Senz; Martin Senz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
ML Basics Data Files
kaggle.com
zip
Updated Dec 7, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satish Gunjal (2020). ML Basics Data Files [Dataset]. https://www.kaggle.com/satishgunjal/ml-basics-data-files
Explore at:
zip(643601 bytes)Available download formats
Dataset updated
Dec 7, 2020
Authors
Satish Gunjal
Description
Dataset

This dataset was created by Satish Gunjal

Released under Other (specified in description)

Contents
c
Data from: Datasets used to train the Generative Adversarial Networks used...
opendata.cern.ch
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATLAS collaboration (2021). Datasets used to train the Generative Adversarial Networks used in ATLFast3 [Dataset]. http://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Explore at:
Unique identifier
https://doi.org/10.7483/OPENDATA.ATLAS.UXKX.TXBN
Dataset updated
2021
Dataset provided by
CERN Open Data Portal
Authors
ATLAS collaboration
Description
Three datasets are available, each consisting of 15 csv files. Each file containing the voxelised shower information obtained from single particles produced at the front of the calorimeter in the |η| range (0.2-0.25) simulated in the ATLAS detector. Two datasets contain photons events with different statistics; the larger sample has about 10 times the number of events as the other. The other dataset contains pions. The pion dataset and the photon dataset with the lower statistics were used to train the corresponding two GANs presented in the AtlFast3 paper SIMU-2018-04.

The information in each file is a table; the rows correspond to the events and the columns to the voxels. The voxelisation procedure is described in the AtlFast3 paper linked above and in the dedicated PUB note ATL-SOFT-PUB-2020-006. In summary, the detailed energy deposits produced by ATLAS were converted from x,y,z coordinates to local cylindrical coordinates defined around the particle 3-momentum at the entrance of the calorimeter. The energy deposits in each layer were then grouped in voxels and for each voxel the energy was stored in the csv file. For each particle, there are 15 files corresponding to the 15 energy points used to train the GAN. The name of the csv file defines both the particle and the energy of the sample used to create the file.

The size of the voxels is described in the binning.xml file. Software tools to read the XML file and manipulate the spatial information of voxels are provided in the FastCaloGAN repository.
Updated on February 10th 2022. A new dataset photons_samples_highStat.tgz was added to this record and the binning.xml file was updated accordingly.
Updated on April 18th 2023. A new dataset pions_samples_highStat.tgz was added to this record.
d
Data Management Plan Examples Database
search.dataone.org
borealisdata.ca
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Sep 4, 2024
Dataset provided by
Borealis
Authors
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
Time period covered
Jan 1, 2011 - Jan 1, 2023
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
Z
Brussel mobility Twitter sentiment analysis CSV Dataset
data.niaid.nih.gov
zenodo.org
Updated May 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Betancur Arenas, Juliana (2024). Brussel mobility Twitter sentiment analysis CSV Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11401123
Explore at:
Dataset updated
May 31, 2024
Dataset provided by
Ginis, Vincent
Betancur Arenas, Juliana
van Vessem, Charlotte
Tori, Floriano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Brussels
Description
SSH CENTRE (Social Sciences and Humanities for Climate, Energy aNd Transport Research Excellence) is a Horizon Europe project, engaging directly with stakeholders across research, policy, and business (including citizens) to strengthen social innovation, SSH-STEM collaboration, transdisciplinary policy advice, inclusive engagement, and SSH communities across Europe, accelerating the EU’s transition to carbon neutrality. SSH CENTRE is based in a range of activities related to Open Science, inclusivity and diversity – especially with regards Southern and Eastern Europe and different career stages – including: development of novel SSH-STEM collaborations to facilitate the delivery of the EU Green Deal; SSH knowledge brokerage to support regions in transition; and the effective design of strategies for citizen engagement in EU R&I activities. Outputs include action-led agendas and building stakeholder synergies through regular Policy Insight events.This is captured in a high-profile virtual SSH CENTRE generating and sharing best practice for SSH policy advice, overcoming fragmentation to accelerate the EU’s journey to a sustainable future.The documents uploaded here are part of WP2 whereby novel, interdisciplinary teams were provided funding to undertake activities to develop a policy recommendation related to EU Green Deal policy. Each of these policy recommendations, and the activities that inform them, will be written-up as a chapter in an edited book collection. Three books will make up this edited collection - one on climate, one on energy and one on mobility. As part of writing a chapter for the SSH CENTRE book on ‘Mobility’, we set out to analyse the sentiment of users on Twitter regarding shared and active mobility modes in Brussels. This involved us collecting tweets between 2017-2022. A tweet was collected if it contained a previously defined mobility keyword (for example: metro) and either the name of a (local) politician, a neighbourhood or municipality, or a (shared) mobility provider. The files attached to this Zenodo webpage is a csv files containing the tweets collected.”.
7+ Million Company Dataset
kaggle.com
zip
Updated May 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
People Data Labs (2019). 7+ Million Company Dataset [Dataset]. https://www.kaggle.com/datasets/peopledatalabssf/free-7-million-company-dataset
Explore at:
zip(291957415 bytes)Available download formats
Dataset updated
May 10, 2019
Authors
People Data Labs
Description
Dataset

This dataset was created by People Data Labs

Contents
Bank Marketing Classification Dataset
kaggle.com
Updated Aug 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BALAJI VARA PRASAD DEGA (2024). Bank Marketing Classification Dataset [Dataset]. http://doi.org/10.34740/kaggle/ds/5532086
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5532086
Dataset updated
Aug 26, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BALAJI VARA PRASAD DEGA
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

There are four datasets: 1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014] 2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs. 3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). 4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Purchase Order Data
data.ca.gov
catalog.data.gov
csv, docx, pdf
Updated Oct 23, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
Explore at:
docx, pdf, csvAvailable download formats
Dataset updated
Oct 23, 2019
Dataset authored and provided by
California Department of General Services
Description
The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

Data Limitations:
Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

Data Collection Methodology:

The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

Secondary/Related Resources:

State Contract Manual (SCM) vol. 2 http://www.dgs.ca.gov/pd/Resources/publications/SCM2.aspx

State Contract Manual (SCM) vol. 3 http://www.dgs.ca.gov/pd/Resources/publications/SCM3.aspx

Buying Green http://www.dgs.ca.gov/buyinggreen/Home.aspx

United Nations Standard Products and Services Code, http://www.unspsc.org/
Gene expression csv files
figshare.com
txt
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristina Alvira (2023). Gene expression csv files [Dataset]. http://doi.org/10.6084/m9.figshare.21861975.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21861975.v1
Dataset updated
Jun 12, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Cristina Alvira
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Csv files containing all detectable genes.

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Sample Graph Datasets in CSV Format

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.