100+ datasets found
  1. d

    Residential School Locations Dataset (CSV Format)

    • search.dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orlandini, Rosa (2023). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Orlandini, Rosa
    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  2. H

    Dataset metadata of known Dataverse installations, August 2023

    • dataverse.harvard.edu
    Updated Aug 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Gautier (2024). Dataset metadata of known Dataverse installations, August 2023 [Dataset]. http://doi.org/10.7910/DVN/8FEGUV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 30, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Julian Gautier
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the metadata of the datasets published in 85 Dataverse installations and information about each installation's metadata blocks. It also includes the lists of pre-defined licenses or terms of use that dataset depositors can apply to the datasets they publish in the 58 installations that were running versions of the Dataverse software that include that feature. The data is useful for reporting on the quality of dataset and file-level metadata within and across Dataverse installations and improving understandings about how certain Dataverse features and metadata fields are used. Curators and other researchers can use this dataset to explore how well Dataverse software and the repositories using the software help depositors describe data. How the metadata was downloaded The dataset metadata and metadata block JSON files were downloaded from each installation between August 22 and August 28, 2023 using a Python script kept in a GitHub repo at https://github.com/jggautier/dataverse-scripts/blob/main/other_scripts/get_dataset_metadata_of_all_installations.py. In order to get the metadata from installations that require an installation account API token to use certain Dataverse software APIs, I created a CSV file with two columns: one column named "hostname" listing each installation URL in which I was able to create an account and another column named "apikey" listing my accounts' API tokens. The Python script expects the CSV file and the listed API tokens to get metadata and other information from installations that require API tokens. How the files are organized ├── csv_files_with_metadata_from_most_known_dataverse_installations │ ├── author(citation)_2023.08.22-2023.08.28.csv │ ├── contributor(citation)_2023.08.22-2023.08.28.csv │ ├── data_source(citation)_2023.08.22-2023.08.28.csv │ ├── ... │ └── topic_classification(citation)_2023.08.22-2023.08.28.csv ├── dataverse_json_metadata_from_each_known_dataverse_installation │ ├── Abacus_2023.08.27_12.59.59.zip │ ├── dataset_pids_Abacus_2023.08.27_12.59.59.csv │ ├── Dataverse_JSON_metadata_2023.08.27_12.59.59 │ ├── hdl_11272.1_AB2_0AQZNT_v1.0(latest_version).json │ ├── ... │ ├── metadatablocks_v5.6 │ ├── astrophysics_v5.6.json │ ├── biomedical_v5.6.json │ ├── citation_v5.6.json │ ├── ... │ ├── socialscience_v5.6.json │ ├── ACSS_Dataverse_2023.08.26_22.14.04.zip │ ├── ADA_Dataverse_2023.08.27_13.16.20.zip │ ├── Arca_Dados_2023.08.27_13.34.09.zip │ ├── ... │ └── World_Agroforestry_-_Research_Data_Repository_2023.08.27_19.24.15.zip └── dataverse_installations_summary_2023.08.28.csv └── dataset_pids_from_most_known_dataverse_installations_2023.08.csv └── license_options_for_each_dataverse_installation_2023.09.05.csv └── metadatablocks_from_most_known_dataverse_installations_2023.09.05.csv This dataset contains two directories and four CSV files not in a directory. One directory, "csv_files_with_metadata_from_most_known_dataverse_installations", contains 20 CSV files that list the values of many of the metadata fields in the citation metadata block and geospatial metadata block of datasets in the 85 Dataverse installations. For example, author(citation)_2023.08.22-2023.08.28.csv contains the "Author" metadata for the latest versions of all published, non-deaccessioned datasets in the 85 installations, where there's a row for author names, affiliations, identifier types and identifiers. The other directory, "dataverse_json_metadata_from_each_known_dataverse_installation", contains 85 zipped files, one for each of the 85 Dataverse installations whose dataset metadata I was able to download. Each zip file contains a CSV file and two sub-directories: The CSV file contains the persistent IDs and URLs of each published dataset in the Dataverse installation as well as a column to indicate if the Python script was able to download the Dataverse JSON metadata for each dataset. It also includes the alias/identifier and category of the Dataverse collection that the dataset is in. One sub-directory contains a JSON file for each of the installation's published, non-deaccessioned dataset versions. The JSON files contain the metadata in the "Dataverse JSON" metadata schema. The Dataverse JSON export of the latest version of each dataset includes "(latest_version)" in the file name. This should help those who are interested in the metadata of only the latest version of each dataset. The other sub-directory contains information about the metadata models (the "metadata blocks" in JSON files) that the installation was using when the dataset metadata was downloaded. I included them so that they can be used when extracting metadata from the dataset's Dataverse JSON exports. The dataverse_installations_summary_2023.08.28.csv file contains information about each installation, including its name, URL, Dataverse software version, and counts of dataset metadata...

  3. Best Books Ever Dataset

    • zenodo.org
    csv
    Updated Nov 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells (2020). Best Books Ever Dataset [Dataset]. http://doi.org/10.5281/zenodo.4265096
    Explore at:
    csvAvailable download formats
    Dataset updated
    Nov 10, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Lorena Casanova Lozano; Sergio Costa Planells; Lorena Casanova Lozano; Sergio Costa Planells
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset has been collected in the frame of the Prac1 of the subject Tipology and Data Life Cycle of the Master's Degree in Data Science of the Universitat Oberta de Catalunya (UOC).

    The dataset contains 25 variables and 52478 records corresponding to books on the GoodReads Best Books Ever list (the larges list on the site).

    Original code used to retrieve the dataset can be found on github repository: github.com/scostap/goodreads_bbe_dataset

    The data was retrieved in two sets, the first 30000 books and then the remainig 22478. Dates were not parsed and reformated on the second chunk so publishDate and firstPublishDate are representet in a mm/dd/yyyy format for the first 30000 records and Month Day Year for the rest.

    Book cover images can be optionally downloaded from the url in the 'coverImg' field. Python code for doing so and an example can be found on the github repo.

    The 25 fields of the dataset are:

    | Attributes | Definition | Completeness |
    | ------------- | ------------- | ------------- | 
    | bookId | Book Identifier as in goodreads.com | 100 |
    | title | Book title | 100 |
    | series | Series Name | 45 |
    | author | Book's Author | 100 |
    | rating | Global goodreads rating | 100 |
    | description | Book's description | 97 |
    | language | Book's language | 93 |
    | isbn | Book's ISBN | 92 |
    | genres | Book's genres | 91 |
    | characters | Main characters | 26 |
    | bookFormat | Type of binding | 97 |
    | edition | Type of edition (ex. Anniversary Edition) | 9 |
    | pages | Number of pages | 96 |
    | publisher | Editorial | 93 |
    | publishDate | publication date | 98 |
    | firstPublishDate | Publication date of first edition | 59 |
    | awards | List of awards | 20 |
    | numRatings | Number of total ratings | 100 |
    | ratingsByStars | Number of ratings by stars | 97 |
    | likedPercent | Derived field, percent of ratings over 2 starts (as in GoodReads) | 99 |
    | setting | Story setting | 22 |
    | coverImg | URL to cover image | 99 |
    | bbeScore | Score in Best Books Ever list | 100 |
    | bbeVotes | Number of votes in Best Books Ever list | 100 |
    | price | Book's price (extracted from Iberlibro) | 73 |

  4. Z

    Annotated Benchmark of Real-World Data for Approximate Functional Dependency...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vansummeren, Stijn (2023). Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8098908
    Explore at:
    Dataset updated
    Jul 1, 2023
    Dataset provided by
    Vansummeren, Stijn
    Weytjens, Sebastiaan
    Peeters, Liesbet M.
    Parciak, Marcel
    Hens, Niel
    Neven, Frank
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated Benchmark of Real-World Data for Approximate Functional Dependency Discovery

    This collection consists of ten open access relations commonly used by the data management community. In addition to the relations themselves (please take note of the references to the original sources below), we added three lists in this collection that describe approximate functional dependencies found in the relations. These lists are the result of a manual annotation process performed by two independent individuals by consulting the respective schemas of the relations and identifying column combinations where one column implies another based on its semantics. As an example, in the claims.csv file, the AirportCode implies AirportName, as each code should be unique for a given airport.

    The file ground_truth.csv is a comma separated file containing approximate functional dependencies. table describes the relation we refer to, lhs and rhs reference two columns of those relations where semantically we found that lhs implies rhs.

    The file excluded_candidates.csv and included_candidates.csv list all column combinations that were excluded or included in the manual annotation, respectively. We excluded a candidate if there was no tuple where both attributes had a value or if the g3_prime value was too small.

    Dataset References

    adult.csv: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

    claims.csv: TSA Claims Data 2002 to 2006, published by the U.S. Department of Homeland Security.

    dblp10k.csv: Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248. Made available as DBLP Dataset 2.

    hospital.csv: Hospital dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

    t_biocase_... files: t_bioc_... files used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

    tax.csv: Tax dataset used in Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting set enumeration with partial information for unique column combination discovery. Proc. VLDB Endow. 13, 12 (August 2020), 2270–2283. https://doi.org/10.14778/3407790.3407824. Made available as part the dataset collection to that paper.

  5. Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova (2022). Dataset for the paper: "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" [Dataset]. http://doi.org/10.5281/zenodo.5996864
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ivan Srba; Ivan Srba; Branislav Pecher; Branislav Pecher; Matus Tomlein; Matus Tomlein; Robert Moro; Robert Moro; Elena Stefancova; Elena Stefancova; Jakub Simko; Jakub Simko; Maria Bielikova; Maria Bielikova
    Description

    Overview

    This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).

    The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.

    Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.

    The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).

    The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.

    Options to access the dataset

    There are two ways how to get access to the dataset:

    1. Static dump of the dataset available in the CSV format
    2. Continuously updated dataset available via REST API

    In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.

    References

    If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:

    @inproceedings{SrbaMonantPlatform,
      author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria},
      booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)},
      pages = {1--7},
      title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior},
      year = {2019}
    }
    @inproceedings{SrbaMonantMedicalDataset,
      author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria},
      booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)},
      numpages = {11},
      title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims},
      year = {2022},
      doi = {10.1145/3477495.3531726},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3477495.3531726},
    }
    


    Dataset creation process

    In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.


    Ethical considerations

    The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.

    The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.

    As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.

    Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.


    Reporting mistakes in the dataset

    The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.


    Dataset structure

    Raw data

    At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.

    Raw data are contained in these CSV files (and corresponding REST API endpoints):

    • sources.csv
    • articles.csv
    • article_media.csv
    • article_authors.csv
    • discussion_posts.csv
    • discussion_post_authors.csv
    • fact_checking_articles.csv
    • fact_checking_article_media.csv
    • claims.csv
    • feedback_facebook.csv

    Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.


    Annotations

    Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.

    Each annotation is described by the following attributes:

    1. category of annotation (`annotation_category`). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
    2. type of annotation (`annotation_type_id`). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
    3. method which created annotation (`method_id`). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
    4. its value (`value`). The value is stored in JSON format and its structure differs according to particular annotation type.


    At the same time, annotations are associated with a particular object identified by:

    1. entity type (parameter entity_type in case of entity annotations, or source_entity_type and target_entity_type in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
    2. entity id (parameter entity_id in case of entity annotations, or source_entity_id and target_entity_id in case of relation

  6. d

    Warehouse and Retail Sales

    • catalog.data.gov
    • data.montgomerycountymd.gov
    • +4more
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.montgomerycountymd.gov (2025). Warehouse and Retail Sales [Dataset]. https://catalog.data.gov/dataset/warehouse-and-retail-sales
    Explore at:
    Dataset updated
    Jul 5, 2025
    Dataset provided by
    data.montgomerycountymd.gov
    Description

    This dataset contains a list of sales and movement data by item and department appended monthly. Update Frequency : Monthly

  7. ISO 639-1 Language Codes

    • kaggle.com
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahesh Jadhav (2023). ISO 639-1 Language Codes [Dataset]. https://www.kaggle.com/datasets/ursmaheshj/iso-639-1-language-codes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Dataset provided by
    Kaggle
    Authors
    Mahesh Jadhav
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ISO 639-1 is a standard for language codes that assigns a two-letter code to represent a language. This code is used to identify languages in computer systems, websites, and other applications that require language tagging. The codes are based on the English names of languages, and each language is assigned a unique code that consists of two letters. For example, "en" represents English, "fr" represents French, and "es" represents Spanish. ISO 639-1 language codes are commonly used in multilingual environments to ensure consistent and accurate representation of languages across different systems and platforms.

    You can easily use this dataset to combine and replace the ISO 639-1 codes in your dataset with appropriate language names.

    This data is collected from below sites

  8. Bulk data files for all years – releases, disposals, transfers and facility...

    • open.canada.ca
    • ouvert.canada.ca
    csv, html
    Updated Jul 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Environment and Climate Change Canada (2025). Bulk data files for all years – releases, disposals, transfers and facility locations [Dataset]. https://open.canada.ca/data/en/dataset/40e01423-7728-429c-ac9d-2954385ccdfb
    Explore at:
    csv, htmlAvailable download formats
    Dataset updated
    Jul 15, 2025
    Dataset provided by
    Environment And Climate Change Canadahttps://www.canada.ca/en/environment-climate-change.html
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Time period covered
    Jan 1, 1993 - Dec 31, 2023
    Description

    The National Pollutant Release Inventory (NPRI) is Canada's public inventory of pollutant releases (to air, water and land), disposals and transfers for recycling. Each file contains data from 1993 to the latest reporting year. These CSV format datasets are in normalized or ‘list’ format and are optimized for pivot table analyses. Here is a description of each file: - The RELEASES file contains all substance release quantities. - The DISPOSALS file contains all on-site and off-site disposal quantities, including tailings and waste rock (TWR). - The TRANSFERS file contains all quantities transferred for recycling or treatment prior to disposal. - The COMMENTS file contains all the comments provided by facilities about substances included in their report. - The GEO LOCATIONS file contains complete geographic information for all facilities that have reported to the NPRI. Please consult the following resources to enhance your analysis: - Guide on using and Interpreting NPRI Data: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/using-interpreting-data.html - Access additional data from the NPRI, including datasets and mapping products: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/exploredata.html Supplemental Information More NPRI datasets and mapping products are available here: https://www.canada.ca/en/environment-climate-change/services/national-pollutant-release-inventory/tools-resources-data/access.html Supporting Projects: National Pollutant Release Inventory (NPRI)

  9. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  10. Film Circulation dataset

    • zenodo.org
    • data.niaid.nih.gov
    bin, csv, png
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
    Explore at:
    csv, png, binAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

    A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

    Please cite this when using the dataset.


    Detailed description of the dataset:

    1 Film Dataset: Festival Programs

    The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

    The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

    The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.


    2 Survey Dataset

    The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

    The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

    The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

    The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.


    3 IMDb & Scripts

    The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

    The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

    The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

    The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

    The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

    The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

    The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

    The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

    The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

    The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

    The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

    The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

    The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

    The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

    The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

    The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

    The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

    The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

    The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.


    4 Festival Library Dataset

    The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

    The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,

  11. m

    THVD (Talking Head Video Dataset)

    • data.mendeley.com
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mario Peedor (2025). THVD (Talking Head Video Dataset) [Dataset]. http://doi.org/10.17632/ykhw8r7bfx.2
    Explore at:
    Dataset updated
    Apr 29, 2025
    Authors
    Mario Peedor
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    About

    We provide a comprehensive talking-head video dataset with over 50,000 videos, totaling more than 500 hours of footage and featuring 20,841 unique identities from around the world.

    Distribution

    Detailing the format, size, and structure of the dataset: Data Volume: -Total Size: 2.7TB

    -Total Videos: 47,547

    -Identities Covered: 20,841

    -Resolution: 60% 4k(1980), 33% fullHD(1080)

    -Formats: MP4

    -Full-length videos with visible mouth movements in every frame.

    -Minimum face size of 400 pixels.

    -Video durations range from 20 seconds to 5 minutes.

    -Faces have not been cut out, full screen videos including backgrounds.

    Usage

    This dataset is ideal for a variety of applications:

    Face Recognition & Verification: Training and benchmarking facial recognition models.

    Action Recognition: Identifying human activities and behaviors.

    Re-Identification (Re-ID): Tracking identities across different videos and environments.

    Deepfake Detection: Developing methods to detect manipulated videos.

    Generative AI: Training high-resolution video generation models.

    Lip Syncing Applications: Enhancing AI-driven lip-syncing models for dubbing and virtual avatars.

    Background AI Applications: Developing AI models for automated background replacement, segmentation, and enhancement.

    Coverage

    Explaining the scope and coverage of the dataset:

    Geographic Coverage: Worldwide

    Time Range: Time range and size of the videos have been noted in the CSV file.

    Demographics: Includes information about age, gender, ethnicity, format, resolution, and file size.

    Languages Covered (Videos):

    English: 23,038 videos

    Portuguese: 1,346 videos

    Spanish: 677 videos

    Norwegian: 1,266 videos

    Swedish: 1,056 videos

    Korean: 848 videos

    Polish: 1,807 videos

    Indonesian: 1,163 videos

    French: 1,102 videos

    German: 1,276 videos

    Japanese: 1,433 videos

    Dutch: 1,666 videos

    Indian: 1,163 videos

    Czech: 590 videos

    Chinese: 685 videos

    Italian: 975 videos

    Philipeans: 920 videos

    Bulgaria: 340 videos

    Romanian: 1144 videos

    Arabic: 1691 videos

    Who Can Use It

    List examples of intended users and their use cases:

    Data Scientists: Training machine learning models for video-based AI applications.

    Researchers: Studying human behavior, facial analysis, or video AI advancements.

    Businesses: Developing facial recognition systems, video analytics, or AI-driven media applications.

    Additional Notes

    Ensure ethical usage and compliance with privacy regulations. The dataset’s quality and scale make it valuable for high-performance AI training. Potential preprocessing (cropping, down sampling) may be needed for different use cases. Dataset has not been completed yet and expands daily, please contact for most up to date CSV file. The dataset has been divided into 100GB zipped files and is hosted on a private server (with the option to upload to the cloud if needed). To verify the dataset's quality, please contact me for the full CSV file.

  12. 📊 Yahoo Answers 10 categories for NLP CSV

    • kaggle.com
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yassir Acharki (2023). 📊 Yahoo Answers 10 categories for NLP CSV [Dataset]. http://doi.org/10.34740/kaggle/dsv/5339321
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yassir Acharki
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    The Yahoo! Answers topic classification dataset is constructed using 10 largest main categories. Each class contains 140,000 training samples and 6,000 testing samples. Therefore, the total number of training samples is 1,400,000 and testing samples 60,000 in this dataset. From all the answers and other meta-information, we only used the best answer content and the main category information.

    The file classes.txt contains a list of classes corresponding to each label.

    The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 4 columns in them, corresponding to class index (1 to 10), question title, question content and best answer. The text fields are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is " ".

  13. MaRV Scripts and Dataset

    • zenodo.org
    zip
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). MaRV Scripts and Dataset [Dataset]. http://doi.org/10.5281/zenodo.14450098
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MaRV dataset consists of 693 manually evaluated code pairs extracted from 126 GitHub Java repositories, covering four types of refactoring. The dataset also includes metadata describing the refactored elements. Each code pair was assessed by two reviewers selected from a pool of 40 participants. The MaRV dataset is continuously evolving and is supported by a web-based tool for evaluating refactoring representations. This dataset aims to enhance the accuracy and reliability of state-of-the-art models in refactoring tasks, such as refactoring candidate identification and code generation, by providing high-quality annotated data.

    Our dataset is located at the path dataset/MaRV.json

    The guidelines for replicating the study are provided below:

    Requirements

    1. Software Dependencies:

    • Python 3.10+ with packages in requirements.txt
    • Git: Required to clone repositories.
    • Java 17: RefactoringMiner requires Java 17 to perform the analysis.
    • PHP 8.0: Required to host the Web tool.
    • MySQL 8: Required to store the Web tool data.

    2. Environment Variables:

    • Create a .env file based on .env.example in the src folder and set the variables:
      • CSV_PATH: Path to the CSV file containing the list of repositories to be processed.
      • CLONE_DIR: Directory where repositories will be cloned.
      • JAVA_PATH: Path to the Java executable.
      • REFACTORING_MINER_PATH: Path to RefactoringMiner.

    Refactoring Technique Selection

    1. Environment Setup:

    • Ensure all dependencies are installed. Install the required Python packages with:
      pip install -r requirements.txt
      

    2. Configuring the Repositories CSV:

    • The CSV file specified in CSV_PATH should contain a column named name with GitHub repository names (format: username/repo).

    3. Executing the Script:

    • Configure the environment variables in the .env file and set up the repositories CSV, then run:
      python3 src/run_rm.py
      
    • The RefactoringMiner output from the 126 repositories of our study is available at:
      https://zenodo.org/records/14395034

    4. Script Behavior:

    • The script clones each repository listed in the CSV file into the directory specified by CLONE_DIR, retrieves the default branch, and runs RefactoringMiner to analyze it.
    • Results and Logs:
      • Analysis results from RefactoringMiner are saved as .json files in CLONE_DIR.
      • Logs for each repository, including error messages, are saved as .log files in the same directory.

    5. Count Refactorings:

    • To count instances for each refactoring technique, run:
      python3 src/count_refactorings.py
      
    • The output CSV file, named refactoring_count_by_type_and_file, shows the number of refactorings for each technique, grouped by repository.

    Data Gathering

    • To collect snippets before and after refactoring and their metadata, run:

      python3 src/diff.py '[refactoring technique]'
      

      Replace [refactoring technique] with the desired technique name (e.g., Extract Method).

    • The script creates a directory for each repository and subdirectories named with the commit SHA. Each commit may have one or more refactorings.

    • Dataset Availability:

      • The snippets and metadata from the 126 repositories of our study are available in the dataset directory.
    • To generate the SQL file for the Web tool, run:

      python3 src/generate_refactorings_sql.py
      

    Web Tool for Manual Evaluation

    • The Web tool scripts are available in the web directory.
    • Populate the data/output/snippets folder with the output of src/diff.py.
    • Run the sql/create_database.sql script in your database.
    • Import the SQL file generated by src/generate_refactorings_sql.py.
    • Run dataset.php to generate the MaRV dataset file.
    • The MaRV dataset, generated by the Web tool, is available in the dataset directory of the replication package.
  14. f

    Example code list definition in csv format.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis (2023). Example code list definition in csv format. [Dataset]. http://doi.org/10.1371/journal.pone.0171784.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    David A. Springate; Rosa Parisi; Ivan Olier; David Reeves; Evangelos Kontopantelis
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example code list definition in csv format.

  15. Mars rover dataset

    • kaggle.com
    Updated Mar 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gaurav Kumar (2025). Mars rover dataset [Dataset]. https://www.kaggle.com/datasets/gauravkumar2525/mars-rover-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gaurav Kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    1. Description

    The description section is crucial for helping users understand the purpose, context, and potential applications of your dataset. It should include the following details:

    • Dataset Overview: Provide a clear summary of what the dataset contains. For example, "This dataset includes images captured by NASA’s Curiosity Rover on Mars, along with metadata such as the camera used, the Martian sol (day) when the photo was taken, and the corresponding Earth date."
    • Source of the Data: Explain where the data comes from. If it’s obtained via an API, mention the source (e.g., "The images and metadata are retrieved from NASA's Mars Rover Photos API."). If the dataset is curated from multiple sources, list them.
    • Purpose and Use Cases: Describe why this dataset was created and how it can be used. For example:
      • Machine Learning: Train models for image classification, object detection, and anomaly detection.
      • Scientific Research: Analyze Martian surface patterns, study terrain features, or examine rover camera performance.
      • Space Exploration: Understand Mars' environmental conditions and assist in future exploration planning.
    • Data Format and Organization: Briefly mention the format of the files (e.g., CSV file for metadata, ZIP file containing images) and how they are structured.
    • Licensing and Permissions: Specify if the dataset has any restrictions on usage. Since NASA data is typically public domain, state that users are free to use it for research and development.
    • Limitations or Considerations: Mention any potential challenges, such as missing data, limited coverage, or resolution constraints.

    2. File Information

    This section provides details about the files included in your dataset, helping users navigate and use them efficiently. Key points to include:

    • List of Files: Clearly mention all the files and their formats, such as:
      • mars_rover_dataset.csv (CSV file containing metadata of images)
      • mars_images.zip (Compressed folder containing all images)
    • Purpose of Each File:
      • CSV File: Contains structured data, including image IDs, timestamps, camera details, and URLs.
      • ZIP File: Stores actual Mars images, which can be extracted and used for ML training or visualization.
    • File Dependencies: Explain how files relate to each other. For example, "The img_src column in mars_rover_dataset.csv corresponds to the images stored in mars_images.zip. Users should extract the images before using the dataset for model training."
    • How to Access the Files: Provide instructions on downloading and extracting files. Example:
      bash unzip mars_images.zip
      This ensures that users can quickly set up the dataset in their working environment.

    3. Column Descriptions

    This section explains the meaning of each column in the dataset, ensuring users can analyze and interpret the data correctly. A well-structured table format is often useful:

    Column NameDescription
    idUnique identifier for each image.
    solMartian sol (day) when the image was captured.
    camera_nameAbbreviated name of the rover's camera (e.g., "FHAZ" for Front Hazard Camera).
    camera_full_nameFull descriptive name of the camera.
    img_srcURL link to the image. Users can download images using this link.
    earth_dateThe Earth date corresponding to the Martian sol.
    rover_nameName of the rover that captured the image (e.g., "Curiosity").
    rover_statusCurrent operational status of the rover (e.g., "Active" or "Complete").
    landing_dateDate when the rover landed on Mars.
    launch_dateDate when the rover was launched from Earth.

    Additional Details:

    • Data Types: Indicate whether a column contains numbers, text, or dates.
    • Data Format: Example: earth_date is in YYYY-MM-DD format.
    • Special Notes: If any column has missing values or requires preprocessing, mention it.

    This section helps users quickly understand the dataset's structure, making it easier for them to work with the data effectively.

  16. c

    Open Data Portal Of The City Of Mendoza

    • catalog.civicdataecosystem.org
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Open Data Portal Of The City Of Mendoza [Dataset]. https://catalog.civicdataecosystem.org/dataset/open-data-portal-of-the-city-of-mendoza
    Explore at:
    Dataset updated
    May 5, 2025
    Area covered
    Mendoza
    Description

    Learn the step-by-step process to start downloading the open data of the City of Mendoza. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. To access and download the open data of the City of Mendoza, you do not need to register or create a user account. Access to the repository is free, and all datasets can be downloaded free of charge and without restrictions. The homepage has access buttons to 14 data categories and a search engine where you can directly enter the topic you want to access. Each data category refers to a section of the platform where you will find the various datasets available, grouped by theme. As an example, if we enter the Security section, we find different datasets within. Once you enter the dataset, you will find a list of resources. Each of these resources is a file that contains the data. For example, the dataset Security Dependencies includes specific information about each of the dependencies and allows you to access the information published in different formats and download it. In this case, if you want to open the file with the Excel program, you must click on the download button of the second resource that specifies that the format is CSV. Likewise, in other sections, there are datasets with information in various formats, such as XLS and KMZ. Each of the datasets also contains a file with additional information where you can see the last update date, the update frequency, and which government area is generating this information, among other things. Translated from Spanish Original Text: Conocé el paso a paso para empezar a descargar los datos abiertos de la Ciudad de Mendoza. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros. Para acceder y descargar los datos abiertos de la Ciudad de Mendoza, no necesitás realizar ningún tipo de registro ni crear un usuario. El acceso al repositorio es libre y todos los datasets se pueden descargar de manera gratuita y sin restricciones. La página de inicio cuenta con botones de acceso a 14 categorías de datos y un buscador en donde podés ingresar directamente al tema al que quieras acceder. Cada categoría de datos, refiere a una sección de la plataforma en donde vas a encontrar los distintos datasets disponibles agrupados por temática. A modo de ejemplo, si ingresamos en la sección Seguridad, dentro encontramos diferentes datasets. Una vez que ingresas al dataset, encontrarás una lista de recursos. Cada uno de estos recursos es un archivo que contiene los datos. Por ejemplo, el dataset Dependencias de Seguridad incluye información específica sobre cada una de las dependencias y te permite acceder a la información publicada en distintos formatos y descargarla. En este caso, si quieres abrir el archivo con el programa Excel deberás hacer clic sobre el botón descargar del segundo recurso que especifica que el formato es CSV. Así como también, en otras secciones hay datasets con la información en diversos formatos, como XLS y KMZ Cada uno de los datasets, contiene además una ficha con información adicional en donde podés ver la última fecha de actualización, la frecuencia de actualización y qué área de gobierno es la generadora de esta información, entre otros.

  17. W

    Cambridgeshire Insight Open Data inventory list

    • cloud.csiss.gmu.edu
    • data.europa.eu
    • +1more
    html, xml, xsd
    Updated Dec 30, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United Kingdom (2019). Cambridgeshire Insight Open Data inventory list [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/cambridgeshire-insight-open-data-inventory-list
    Explore at:
    html, xsd, xmlAvailable download formats
    Dataset updated
    Dec 30, 2019
    Dataset provided by
    United Kingdom
    License

    http://reference.data.gov.uk/id/open-government-licencehttp://reference.data.gov.uk/id/open-government-licence

    Area covered
    Cambridgeshire
    Description

    The Dataset Inventory is a list of all the electronic data held by the Cambridgeshire Insight: Open Data website (http://opendata.cambridgeshireinsight.org.uk).

    It contains data from a variety of information themes and organisations.

    Dataset

    Sets of records, however structured, and not just datasets according to standard database terminology.

    Resource

    The resource(s) for the dataset. Here they are either: “Data†(files or feeds/streams of data records available), or “Document†(files or web pages describing datasets).

    Rendition

    Relates to the resource and is the format that the resource is available in (can be multiple). In regards to data they could be, for example, in csv, xml, pdf format. They may represent physical files or API calls. For a document, this could be as odf, pdf, and html for example

  18. f

    Windows Malware Detection Dataset

    • figshare.com
    txt
    Updated Mar 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irfan Yousuf (2023). Windows Malware Detection Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.21608262.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Mar 15, 2023
    Dataset provided by
    figshare
    Authors
    Irfan Yousuf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset for Windows Portable Executable Samples with four feature sets. It contains four CSV files, one CSV file per feature set. 1. First feature set (DLLs_Imported.csv file) contains the DLLs imported by each malware family. The first column contains SHA256 values, second column contains the label or family type of the malware while the remaining columns list the names of imported DLLs. 2. Second feature set (API_Functions.csv files) contains the API functions called by these malware alongwith their SHA256 hash values and labels. 3. Third feature set (PE_Header.csv) contains values of 52 fields of PE header. All the fields are labelled in the CSV file. 4. Fourth feature set (PE_Section.csv file) contains 9 field values of 10 different PE sections. All the fields are labelled in the CSV file.

    Malware Type / family Labels:

    0=Benign 1=RedLineStealer 2= Downloader
    3=RAT 4=BankingTrojan 5=SnakeKeyLogger 6=Spyware

  19. MNIST dataset for Outliers Detection - [ MNIST4OD ]

    • figshare.com
    application/gzip
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giovanni Stilo; Bardh Prenkaj (2024). MNIST dataset for Outliers Detection - [ MNIST4OD ] [Dataset]. http://doi.org/10.6084/m9.figshare.9954986.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 17, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Giovanni Stilo; Bardh Prenkaj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Here we present a dataset, MNIST4OD, of large size (number of dimensions and number of instances) suitable for Outliers Detection task.The dataset is based on the famous MNIST dataset (http://yann.lecun.com/exdb/mnist/).We build MNIST4OD in the following way:To distinguish between outliers and inliers, we choose the images belonging to a digit as inliers (e.g. digit 1) and we sample with uniform probability on the remaining images as outliers such as their number is equal to 10% of that of inliers. We repeat this dataset generation process for all digits. For implementation simplicity we then flatten the images (28 X 28) into vectors.Each file MNIST_x.csv.gz contains the corresponding dataset where the inlier class is equal to x.The data contains one instance (vector) in each line where the last column represents the outlier label (yes/no) of the data point. The data contains also a column which indicates the original image class (0-9).See the following numbers for a complete list of the statistics of each datasets ( Name | Instances | Dimensions | Number of Outliers in % ):MNIST_0 | 7594 | 784 | 10MNIST_1 | 8665 | 784 | 10MNIST_2 | 7689 | 784 | 10MNIST_3 | 7856 | 784 | 10MNIST_4 | 7507 | 784 | 10MNIST_5 | 6945 | 784 | 10MNIST_6 | 7564 | 784 | 10MNIST_7 | 8023 | 784 | 10MNIST_8 | 7508 | 784 | 10MNIST_9 | 7654 | 784 | 10

  20. Aggregated Data: Environmental Monitoring and Observations Effort 2010-2023

    • data.csiro.au
    • researchdata.edu.au
    Updated Sep 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donald Hobern; Shandiya Balasubramaniam (2023). Aggregated Data: Environmental Monitoring and Observations Effort 2010-2023 [Dataset]. http://doi.org/10.25919/2y9j-jk11
    Explore at:
    Dataset updated
    Sep 26, 2023
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Donald Hobern; Shandiya Balasubramaniam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2010 - May 31, 2023
    Area covered
    Dataset funded by
    TERN
    Atlas of Living Australia
    IMOS
    Australian Research Data Commons
    Description

    Aggregated metadata on environmental monitoring and observing activities from three Australian national research infrastructures (NRIs): biodiversity survey events from the Atlas of Living Australia (ALA, https://ala.org.au/), marine observations collected by the Integrated Marine Observing System (IMOS, https://imos.org.au/) and site-based monitoring and survey efforts by the Terrestrial Ecosystem Research Network (TERN, https://tern.org.au/). This dataset provides a summary breakdown of these efforts by survey topic, region and time period from 2010 to the present.

    Survey topics are mapped to an EcoAssets Earth Science Features vocabulary based on the Earth Science keywords from the Global Change Master Directory (GCMD, https://gcmd.earthdata.nasa.gov/KeywordViewer/) vocabulary, modified to use taxonomic concept URIs from the Australian National Species List (ANSL, https://biodiversity.org.au/nsl/) in place of the GCMD Earth Science > Biological Classification vocabulary. ANSL categories map more readily to biodiversity survey categories, since GCMD depends on a top-level division between vertebrates and invertebrates rather than offering an animal category. The EcoAssets Earth Science Features vocabulary, including alternative keywords used in ALA, IMOS or TERN datasets, is included in this collection.

    The primary asset is AggregatedData_EnvironmentalMonitoringAndObservationsEffort_1.1.2023-06-13.csv. This contains all faceted data records for the period and supported facets related to time, space and features observed.

    Two derived assets (SummaryData_MonitoringAndObservationsEffortByMarineEcoregion_1.1.2023-06-13.csv, SummaryData_MonitoringAndObservationsEffortByTerrestrialEcoregion_1.1.2023-06-13.csv) further summarise the faceted data. Each is a pivot of the aggregated dataset.

    Vocabulary_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.csv contains the hierarchical terms used within this asset to categorise earth science features. TreeView_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.txt provides a simpler, more readable view. KeywordMapping_EcoAssetsEarthScienceFeatures_1.0.2023-03-23.csv shows the mappings between these terms and the keywords used in source datasets.

    The data-sources.csv file includes information on the source datasets that contributed to this asset. README.txt documents the columns in each data file. Lineage: This dataset was created by the following pipeline:

    1. Metadata records were collected from the TERN linked data portal (https://linkeddata.tern.org.au/) for all TERN monitoring sites and survey activities. Feature terms follow the TERN Feature Type vocabulary, mapped to the EcoAssets Earth Science Features vocabulary. For features that have been measured continuously at the site, metadata records were created for each relevant year since commission of the site. For other sites and features, metadata records were generated only for years in which the site was visited. TERN metadata records are associated with site coordinates.

    2. Metadata records were harvested for datasets in the Australian Ocean Data Network (AODN, https://portal.aodn.org.au/) portal maintained by IMOS (iso19115-3.2018 format over OAI-PMH). Feature terms follow the GCMD keywords used in these metadata records. Metadata records were created for each year overlapping the data collection period for each dataset. Where the datasets were associated with a bounding box, records were created for each IMCRA region intersecting the bounding box.

    3. Metadata records were created for each biodiversity sample event published to the ALA and associated with a Darwin Core event ID and a named sampling protocol (see https://dwc.tdwg.org/terms/#event). Events were excluded if the set of sampled taxa included multiple kingdoms OR the sampling protocol was associated with <50 samples OR no sample included >1 species. The remaining samples were mapped to feature terms based on the taxonomic scope of all species recorded for the associated protocol. Year and coordinates were taken from the associate sample event.

    4. Metadata records from all sources were combined and include the following values. The feature facet values are offered as a convenience for grouping records without using the hierarchical structure of the EcoAssets Earth Science Features vocabulary:

    • Source National Research Institute (NRI – one of ALA, IMOS, TERN)
    • Dataset name (site name for TERN)
    • Dataset URI (site URI for TERN)
    • Original keyword from NRI (TERN feature type, IMOS GCMD keyword, ALA taxon)
    • Decimal latitude (where appropriate)
    • Decimal longitude (where appropriate)
    • Year
    • State or Territory
    • IBRA7 terrestrial region
    • IMCRA 4.0 mesoscale marine bioregion
    • Feature ID from EcoAssets Earth Science Features vocabulary
    • Feature name associated with feature ID
    • Feature facet 1 – high-level facet based on feature ID – a top-level GCMD Earth Science category (6 terms)
    • Feature facet 2 – intermediate-level facet based on feature ID – second-level GCMD/ANSL category (29 terms)
    • Feature facet 3 – lower-level facet with more fine-grained taxonomic structure based on feature ID – typically a third-level GCMD/ANSL category (36 terms)
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Orlandini, Rosa (2023). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU

Residential School Locations Dataset (CSV Format)

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Orlandini, Rosa
Time period covered
Jan 1, 1863 - Jun 30, 1998
Description

The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

Search
Clear search
Close search
Google apps
Main menu