100+ datasets found

d
Residential School Locations Dataset (CSV Format)
search.dataone.org
borealisdata.ca
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Orlandini, Rosa (2023). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
Explore at:
Unique identifier
https://doi.org/10.5683/SP2/RIYEMU
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Orlandini, Rosa
Time period covered
Jan 1, 1863 - Jun 30, 1998
Description
The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.
H
Dataset metadata of known Dataverse installations, August 2023
dataverse.harvard.edu
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8FEGUV
Dataset updated
Aug 30, 2024
Dataset provided by
Harvard Dataverse
Authors
Julian Gautier
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...
Best Books Ever Dataset
zenodo.org
csv
Updated Nov 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4265096
Dataset updated
Nov 10, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

The 25 fields of the dataset are:

| Attributes | Definition | Completeness | | ------------- | ------------- | ------------- | | bookId | Book Identifier as in goodreads.com | 100 | | title | Book title | 100 | | series | Series Name | 45 | | author | Book's Author | 100 | | rating | Global goodreads rating | 100 | | description | Book's description | 97 | | language | Book's language | 93 | | isbn | Book's ISBN | 92 | | genres | Book's genres | 91 | | characters | Main characters | 26 | | bookFormat | Type of binding | 97 | | edition | Type of edition (ex. Anniversary Edition) | 9 | | pages | Number of pages | 96 | | publisher | Editorial | 93 | | publishDate | publication date | 98 | | firstPublishDate | Publication date of first edition | 59 | | awards | List of awards | 20 | | numRatings | Number of total ratings | 100 | | ratingsByStars | Number of ratings by stars | 97 | | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 | | setting | Story setting | 22 | | coverImg | URL to cover image | 99 | | bbeScore | Score in Best Books Ever list | 100 | | bbeVotes | Number of votes in Best Books Ever list | 100 | | price | Book's price (extracted from Iberlibro) | 73 |
Z
Annotated Benchmark of Real-World Data for Approximate Functional Dependency...
data.niaid.nih.gov
zenodo.org
Updated Jul 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vansummeren, Stijn (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8098908
Explore at:
Dataset updated
Jul 1, 2023
Dataset provided by
Vansummeren, Stijn
Weytjens, Sebastiaan
Peeters, Liesbet M.
Parciak, Marcel
Hens, Niel
Neven, Frank
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

Dataset References

adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.

dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.

hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.
Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...
zenodo.org
data.niaid.nih.gov
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.5996864
Dataset updated
Apr 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
Description
Overview

This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

Options to access the dataset

There are two ways how to get access to the dataset:

1. Static dump of the dataset available in the CSV format
2. Continuously updated dataset available via REST API

In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

References

If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }

@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }

Dataset creation process

In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.

Ethical considerations

The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.

Reporting mistakes in the dataset
The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.

Dataset structure

Raw data

At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

Raw data are contained in these CSV files (and corresponding REST API endpoints):

sources.csv

articles.csv

article_media.csv

article_authors.csv

discussion_posts.csv

discussion_post_authors.csv

fact_checking_articles.csv

fact_checking_article_media.csv

claims.csv

feedback_facebook.csv

Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.

Annotations

Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

Each annotation is described by the following attributes:

category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).

type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.

method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.

its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.

At the same time, annotations are associated with a particular object identified by:

entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.

entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation
d
Warehouse and Retail Sales
catalog.data.gov
data.montgomerycountymd.gov
+4more
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.montgomerycountymd.gov (2025). Warehouse and Retail Sales [Dataset]. https://catalog.data.gov/dataset/warehouse-and-retail-sales
Explore at:
Dataset updated
Jul 5, 2025
Dataset provided by
data.montgomerycountymd.gov
Description
This dataset contains a list of sales and movement data by item and department appended monthly. Update Frequency : Monthly
ISO 639-1 Language Codes
kaggle.com
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahesh Jadhav (2023). ISO 639-1 Language Codes [Dataset]. https://www.kaggle.com/datasets/ursmaheshj/iso-639-1-language-codes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2023
Dataset provided by
Kaggle
Authors
Mahesh Jadhav
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
ISO 639-1 is a standard for language codes that assigns a two-letter code to represent a language. This code is used to identify languages in computer systems, websites, and other applications that require language tagging. The codes are based on the English names of languages, and each language is assigned a unique code that consists of two letters. For example, "en" represents English, "fr" represents French, and "es" represents Spanish. ISO 639-1 language codes are commonly used in multilingual environments to ensure consistent and accurate representation of languages across different systems and platforms.

You can easily use this dataset to combine and replace the ISO 639-1 codes in your dataset with appropriate language names.

This data is collected from below sites

https://www.iso.org/iso-639-language-codes.html

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

https://www.loc.gov/standards/iso639-2/php/code_list.php

https://lingohub.com/academy/best-practices/iso-639-1-list

https://localizely.com/iso-639-1-list
Bulk data files for all years – releases, disposals, transfers and facility...
open.canada.ca
ouvert.canada.ca
csv, html
Updated Jul 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environment and Climate Change Canada (2025). Bulk data files for all years – releases, disposals, transfers and facility locations [Dataset]. https://open.canada.ca/data/en/dataset/40e01423-7728-429c-ac9d-2954385ccdfb
Explore at:
csv, htmlAvailable download formats
Dataset updated
Jul 15, 2025
Dataset provided by
Environment And Climate Change Canadahttps://www.canada.ca/en/environment-climate-change.html
License
Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Time period covered
Jan 1, 1993 - Dec 31, 2023
Description
The National Pollutant Release Inventory (NPRI) is Canada's public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling. Each file contains data from 1993 to the latest reporting year. These CSV format datasets are in normalized or ‘list’ format and are optimized for pivot table analyses. Here is a description of each file: - The RELEASES file contains all substance release quantities. - The DISPOSALS file contains all on-site and off-site disposal quantities, including tailings and waste rock (TWR). - The TRANSFERS file contains all quantities transferred for recycling or treatment prior to disposal. - The COMMENTS file contains all the comments provided by facilities about substances included in their report. - The GEO LOCATIONS file contains complete geographic information for all facilities that have reported to the NPRI. Please consult the following resources to enhance your analysis: - Guide on using and Interpreting NPRI Data: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/using-interpreting-data.html - Access additional data from the NPRI, including datasets and mapping products: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/exploredata.html Supplemental Information More NPRI datasets and mapping products are available here: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/access.html Supporting Projects: National Pollutant Release Inventory (NPRI)
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
m
THVD (Talking Head Video Dataset)
data.mendeley.com
Updated Apr 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mario Peedor (2025). THVD (Talking Head Video Dataset) [Dataset]. http://doi.org/10.17632/ykhw8r7bfx.2
Explore at:
Unique identifier
https://doi.org/10.17632/ykhw8r7bfx.2
Dataset updated
Apr 29, 2025
Authors
Mario Peedor
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
About

We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 20,841 unique identities from around the world.

Distribution

Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.7TB

-Total Videos: 47,547

-Identities Covered: 20,841

-Resolution: 60% 4k(1980), 33% fullHD(1080)

-Formats: MP4

-Full-length videos with visible mouth movements in every frame.

-Minimum face size of 400 pixels.

-Video durations range from 20 seconds to 5 minutes.

-Faces have not been cut out, full screen videos including backgrounds.

Usage

This dataset is ideal for a variety of applications:

Face Recognition & Verification: Training and benchmarking facial recognition models.

Action Recognition: Identifying human activities and behaviors.

Re-Identification (Re-ID): Tracking identities across different videos and environments.

Deepfake Detection: Developing methods to detect manipulated videos.

Generative AI: Training high-resolution video generation models.

Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.

Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.

Coverage

Explaining the scope and coverage of the dataset:

Geographic Coverage: Worldwide

Time Range: Time range and size of the videos have been noted in the CSV file.

Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.

Languages Covered (Videos):

English: 23,038 videos

Portuguese: 1,346 videos

Spanish: 677 videos

Norwegian: 1,266 videos

Swedish: 1,056 videos

Korean: 848 videos

Polish: 1,807 videos

Indonesian: 1,163 videos

French: 1,102 videos

German: 1,276 videos

Japanese: 1,433 videos

Dutch: 1,666 videos

Indian: 1,163 videos

Czech: 590 videos

Chinese: 685 videos

Italian: 975 videos

Philipeans: 920 videos

Bulgaria: 340 videos

Romanian: 1144 videos

Arabic: 1691 videos

Who Can Use It

List examples of intended users and their use cases:

Data Scientists: Training machine learning models for video-based AI applications.

Researchers: Studying human behavior, facial analysis, or video AI advancements.

Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.

Additional Notes

Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file.
📊 Yahoo Answers 10 categories for NLP CSV
kaggle.com
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. http://doi.org/10.34740/kaggle/dsv/5339321
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/5339321
Dataset updated
Apr 7, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yassir Acharki
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

The file classes.txt contains a list of classes corresponding to each label.

The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".
MaRV Scripts and Dataset
zenodo.org
zip
Updated Dec 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14450098
Dataset updated
Dec 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

Our dataset is located at the path dataset/MaRV.json

The guidelines for replicating the study are provided below:

Requirements

1. Software Dependencies:

Python 3.10+ with packages in requirements.txt

Git: Required to clone repositories.

Java 17: RefactoringMiner requires Java 17 to perform the analysis.

PHP 8.0: Required to host the Web tool.

MySQL 8: Required to store the Web tool data.

2. Environment Variables:

Create a .env file based on .env.example in the src folder and set the variables:

CSV_PATH: Path to the CSV file containing the list of repositories to be processed.

CLONE_DIR: Directory where repositories will be cloned.

JAVA_PATH: Path to the Java executable.

REFACTORING_MINER_PATH: Path to RefactoringMiner.

Refactoring Technique Selection

1. Environment Setup:

Ensure all dependencies are installed. Install the required Python packages with:
pip install -r requirements.txt

2. Configuring the Repositories CSV:

The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

3. Executing the Script:

Configure the environment variables in the .env file and set up the repositories CSV, then run:
python3 src/run_rm.py

The RefactoringMiner output from the 126 repositories of our study is available at:
https://zenodo.org/records/14395034

4. Script Behavior:

The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.

Results and Logs:

Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.

Logs for each repository, including error messages, are saved as .log files in the same directory.

5. Count Refactorings:

To count instances for each refactoring technique, run:
python3 src/count_refactorings.py

The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

Data Gathering

To collect snippets before and after refactoring and their metadata, run:

python3 src/diff.py '[refactoring technique]'

Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

Dataset Availability:

The snippets and metadata from the 126 repositories of our study are available in the dataset directory.

To generate the SQL file for the Web tool, run:

python3 src/generate_refactorings_sql.py

Web Tool for Manual Evaluation

The Web tool scripts are available in the web directory.

Populate the data/output/snippets folder with the output of src/diff.py.

Run the sql/create_database.sql script in your database.

Import the SQL file generated by src/generate_refactorings_sql.py.

Run dataset.php to generate the MaRV dataset file.

The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
f
Example code list definition in csv format.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis (2023). Example code list definition in csv format. [Dataset]. http://doi.org/10.1371/journal.pone.0171784.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171784.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example code list definition in csv format.

Mars rover dataset

kaggle.com

Updated Mar 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Gaurav Kumar (2025). Mars rover dataset [Dataset]. https://www.kaggle.com/datasets/gauravkumar2525/mars-rover-dataset/code

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Gaurav Kumar

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

1. Description

The description section is crucial for helping users understand the purpose, context, and potential applications of your dataset. It should include the following details:

Dataset Overview: Provide a clear summary of what the dataset contains. For example, "This dataset includes images captured by NASA’s Curiosity Rover on Mars, along with metadata such as the camera used, the Martian sol (day) when the photo was taken, and the corresponding Earth date."
Source of the Data: Explain where the data comes from. If it’s obtained via an API, mention the source (e.g., "The images and metadata are retrieved from NASA's Mars Rover Photos API."). If the dataset is curated from multiple sources, list them.
Purpose and Use Cases: Describe why this dataset was created and how it can be used. For example:
- Machine Learning: Train models for image classification, object detection, and anomaly detection.
- Scientific Research: Analyze Martian surface patterns, study terrain features, or examine rover camera performance.
- Space Exploration: Understand Mars' environmental conditions and assist in future exploration planning.
Data Format and Organization: Briefly mention the format of the files (e.g., CSV file for metadata, ZIP file containing images) and how they are structured.
Licensing and Permissions: Specify if the dataset has any restrictions on usage. Since NASA data is typically public domain, state that users are free to use it for research and development.
Limitations or Considerations: Mention any potential challenges, such as missing data, limited coverage, or resolution constraints.

2. File Information

This section provides details about the files included in your dataset, helping users navigate and use them efficiently. Key points to include:

List of Files: Clearly mention all the files and their formats, such as:
- mars_rover_dataset.csv (CSV file containing metadata of images)
- mars_images.zip (Compressed folder containing all images)
Purpose of Each File:
- CSV File: Contains structured data, including image IDs, timestamps, camera details, and URLs.
- ZIP File: Stores actual Mars images, which can be extracted and used for ML training or visualization.
File Dependencies: Explain how files relate to each other. For example, "The img_src column in mars_rover_dataset.csv corresponds to the images stored in mars_images.zip. Users should extract the images before using the dataset for model training."
How to Access the Files: Provide instructions on downloading and extracting files. Example:
bash unzip mars_images.zip
This ensures that users can quickly set up the dataset in their working environment.

3. Column Descriptions

This section explains the meaning of each column in the dataset, ensuring users can analyze and interpret the data correctly. A well-structured table format is often useful:

Column Name	Description
`id`	Unique identifier for each image.
`sol`	Martian sol (day) when the image was captured.
`camera_name`	Abbreviated name of the rover's camera (e.g., "FHAZ" for Front Hazard Camera).
`camera_full_name`	Full descriptive name of the camera.
`img_src`	URL link to the image. Users can download images using this link.
`earth_date`	The Earth date corresponding to the Martian sol.
`rover_name`	Name of the rover that captured the image (e.g., "Curiosity").
`rover_status`	Current operational status of the rover (e.g., "Active" or "Complete").
`landing_date`	Date when the rover landed on Mars.
`launch_date`	Date when the rover was launched from Earth.

Additional Details:

Data Types: Indicate whether a column contains numbers, text, or dates.
Data Format: Example: earth_date is in YYYY-MM-DD format.
Special Notes: If any column has missing values or requires preprocessing, mention it.

This section helps users quickly understand the dataset's structure, making it easier for them to work with the data effectively.

c
Open Data Portal Of The City Of Mendoza
catalog.civicdataecosystem.org
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Open Data Portal Of The City Of Mendoza [Dataset]. https://catalog.civicdataecosystem.org/dataset/open-data-portal-of-the-city-of-mendoza
Explore at:
Dataset updated
May 5, 2025
Area covered
Mendoza
Description
Learn the step-by-step process to start downloading the open data of the City of Mendoza. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. Translated from Spanish Original Text: Conocé el paso a paso para empezar a descargar los datos abiertos de la Ciudad de Mendoza. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros.
W
Cambridgeshire Insight Open Data inventory list
cloud.csiss.gmu.edu
data.europa.eu
+1more
html, xml, xsd
Updated Dec 30, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United Kingdom (2019). Cambridgeshire Insight Open Data inventory list [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/cambridgeshire-insight-open-data-inventory-list
Explore at:
html, xsd, xmlAvailable download formats
Dataset updated
Dec 30, 2019
Dataset provided by
United Kingdom
License
http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence
Area covered
Cambridgeshire
Description
The Dataset Inventory is a list of all the electronic data held by the Cambridgeshire Insight: Open Data website (http://opendata.cambridgeshireinsight.org.uk).

It contains data from a variety of information themes and organisations.

Dataset

Sets of records, however structured, and not just datasets according to standard database terminology.

Resource

The resource(s) for the dataset. Here they are either: â€œDataâ€ (files or feeds/streams of data records available), or â€œDocumentâ€ (files or web pages describing datasets).

Rendition

Relates to the resource and is the format that the resource is available in (can be multiple). In regards to data they could be, for example, in csv, xml, pdf format. They may represent physical files or API calls. For a document, this could be as odf, pdf, and html for example
f
Windows Malware Detection Dataset
figshare.com
txt
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irfan Yousuf (2023). Windows Malware Detection Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21608262.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21608262.v1
Dataset updated
Mar 15, 2023
Dataset provided by
figshare
Authors
Irfan Yousuf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset for Windows Portable Executable Samples with four feature sets. It contains four CSV files, one CSV file per feature set. 1. First feature set (DLLs_Imported.csv file) contains the DLLs imported by each malware family. The first column contains SHA256 values, second column contains the label or family type of the malware while the remaining columns list the names of imported DLLs. 2. Second feature set (API_Functions.csv files) contains the API functions called by these malware alongwith their SHA256 hash values and labels. 3. Third feature set (PE_Header.csv) contains values of 52 fields of PE header. All the fields are labelled in the CSV file. 4. Fourth feature set (PE_Section.csv file) contains 9 field values of 10 different PE sections. All the fields are labelled in the CSV file.

Malware Type / family Labels:

0=Benign 1=RedLineStealer 2= Downloader
3=RAT 4=BankingTrojan 5=SnakeKeyLogger 6=Spyware
MNIST dataset for Outliers Detection - [ MNIST4OD ]
figshare.com
application/gzip
Updated May 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.9954986.v2
Dataset updated
May 17, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Giovanni Stilo; Bardh Prenkaj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10
Aggregated Data: Environmental Monitoring and Observations Effort 2010-2023
data.csiro.au
researchdata.edu.au
Updated Sep 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Donald Hobern; Shandiya Balasubramaniam (2023). Aggregated Data: Environmental Monitoring and Observations Effort 2010-2023 [Dataset]. http://doi.org/10.25919/2y9j-jk11
Explore at:
Unique identifier
https://doi.org/10.25919/2y9j-jk11
Dataset updated
Sep 26, 2023
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Donald Hobern; Shandiya Balasubramaniam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2010 - May 31, 2023
Area covered

Dataset funded by
TERN
Atlas of Living Australia
IMOS
Australian Research Data Commons
Description
Aggregated metadata on environmental monitoring and observing activities from three Australian national research infrastructures (NRIs): biodiversity survey events from the Atlas of Living Australia (ALA, https://ala.org.au/), marine observations collected by the Integrated Marine Observing System (IMOS, https://imos.org.au/) and site-based monitoring and survey efforts by the Terrestrial Ecosystem Research Network (TERN, https://tern.org.au/). This dataset provides a summary breakdown of these efforts by survey topic, region and time period from 2010 to the present.

Survey topics are mapped to an EcoAssets Earth Science Features vocabulary based on the Earth Science keywords from the Global Change Master Directory (GCMD, https://gcmd.earthdata.nasa.gov/KeywordViewer/) vocabulary, modified to use taxonomic concept URIs from the Australian National Species List (ANSL, https://biodiversity.org.au/nsl/) in place of the GCMD Earth Science > Biological Classification vocabulary. ANSL categories map more readily to biodiversity survey categories, since GCMD depends on a top-level division between vertebrates and invertebrates rather than offering an animal category. The EcoAssets Earth Science Features vocabulary, including alternative keywords used in ALA, IMOS or TERN datasets, is included in this collection.

The primary asset is AggregatedData_EnvironmentalMonitoringAndObservationsEffort_1.1.2023-06-13.csv. This contains all faceted data records for the period and supported facets related to time, space and features observed.

Two derived assets (SummaryData_MonitoringAndObservationsEffortByMarineEcoregion_1.1.2023-06-13.csv, SummaryData_MonitoringAndObservationsEffortByTerrestrialEcoregion_1.1.2023-06-13.csv) further summarise the faceted data. Each is a pivot of the aggregated dataset.

Vocabulary_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.csv contains the hierarchical terms used within this asset to categorise earth science features. TreeView_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.txt provides a simpler, more readable view. KeywordMapping_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.csv shows the mappings between these terms and the keywords used in source datasets.

The data-sources.csv file includes information on the source datasets that contributed to this asset. README.txt documents the columns in each data file. Lineage: This dataset was created by the following pipeline:

Metadata records were collected from the TERN linked data portal (https://linkeddata.tern.org.au/) for all TERN monitoring sites and survey activities. Feature terms follow the TERN Feature Type vocabulary, mapped to the EcoAssets Earth Science Features vocabulary. For features that have been measured continuously at the site, metadata records were created for each relevant year since commission of the site. For other sites and features, metadata records were generated only for years in which the site was visited. TERN metadata records are associated with site coordinates.

Metadata records were harvested for datasets in the Australian Ocean Data Network (AODN, https://portal.aodn.org.au/) portal maintained by IMOS (iso19115-3.2018 format over OAI-PMH). Feature terms follow the GCMD keywords used in these metadata records. Metadata records were created for each year overlapping the data collection period for each dataset. Where the datasets were associated with a bounding box, records were created for each IMCRA region intersecting the bounding box.

Metadata records were created for each biodiversity sample event published to the ALA and associated with a Darwin Core event ID and a named sampling protocol (see https://dwc.tdwg.org/terms/#event). Events were excluded if the set of sampled taxa included multiple kingdoms OR the sampling protocol was associated with <50 samples OR no sample included >1 species. The remaining samples were mapped to feature terms based on the taxonomic scope of all species recorded for the associated protocol. Year and coordinates were taken from the associate sample event.

Metadata records from all sources were combined and include the following values. The feature facet values are offered as a convenience for grouping records without using the hierarchical structure of the EcoAssets Earth Science Features vocabulary:

Source National Research Institute (NRI – one of ALA, IMOS, TERN)

Dataset name (site name for TERN)

Dataset URI (site URI for TERN)

Original keyword from NRI (TERN feature type, IMOS GCMD keyword, ALA taxon)

Decimal latitude (where appropriate)

Decimal longitude (where appropriate)

Year

State or Territory

IBRA7 terrestrial region

IMCRA 4.0 mesoscale marine bioregion

Feature ID from EcoAssets Earth Science Features vocabulary

Feature name associated with feature ID

Feature facet 1 – high-level facet based on feature ID – a top-level GCMD Earth Science category (6 terms)

Feature facet 2 – intermediate-level facet based on feature ID – second-level GCMD/ANSL category (29 terms)

Feature facet 3 – lower-level facet with more fine-grained taxonomic structure based on feature ID – typically a third-level GCMD/ANSL category (36 terms)

Facebook

Twitter

Click to copy link

Link copied

Cite

Orlandini, Rosa (2023). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU

Residential School Locations Dataset (CSV Format)

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5683/SP2/RIYEMU

Dataset updated

Dec 28, 2023

Dataset provided by

Borealis

Authors

Orlandini, Rosa

Time period covered

Jan 1, 1863 - Jun 30, 1998

Description

The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

Clear search

Close search

Google apps

Main menu

Residential School Locations Dataset (CSV Format)

Dataset metadata of known Dataverse installations, August 2023

Best Books Ever Dataset

Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

Warehouse and Retail Sales

ISO 639-1 Language Codes

Bulk data files for all years – releases, disposals, transfers and facility...

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Film Circulation dataset

THVD (Talking Head Video Dataset)

📊 Yahoo Answers 10 categories for NLP CSV

MaRV Scripts and Dataset

Requirements

1. Software Dependencies:

2. Environment Variables:

Refactoring Technique Selection

1. Environment Setup:

2. Configuring the Repositories CSV:

3. Executing the Script:

4. Script Behavior:

5. Count Refactorings:

Data Gathering

Web Tool for Manual Evaluation

Example code list definition in csv format.

Mars rover dataset

1. Description

2. File Information

3. Column Descriptions

Additional Details:

Open Data Portal Of The City Of Mendoza

Cambridgeshire Insight Open Data inventory list

Windows Malware Detection Dataset

MNIST dataset for Outliers Detection - [ MNIST4OD ]

Aggregated Data: Environmental Monitoring and Observations Effort 2010-2023

Residential School Locations Dataset (CSV Format)