74 datasets found
  1. Test Data Dummy CSV

    • figshare.com
    txt
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tori Duckworth (2023). Test Data Dummy CSV [Dataset]. http://doi.org/10.6084/m9.figshare.24500965.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tori Duckworth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This CSV represents a dummy dataset to test the functionality of trusted repository search capabilities and of research data governance practices. The associated dummy dissertation is entitled Financial Econometrics Dummy Dissertation. The dummy file is a 7KB CSV containing 5000 rows of notional demographic tabular data.

  2. All CSV Files

    • kaggle.com
    Updated May 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dummy Mummy (2022). All CSV Files [Dataset]. https://www.kaggle.com/datasets/dummymummy/all-csv-files
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dummy Mummy
    Description

    Dataset

    This dataset was created by Dummy Mummy

    Contents

  3. f

    File S1 - Mynodbcsv: Lightweight Zero-Config Database Solution for Handling...

    • figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanisław Adaszewski (2023). File S1 - Mynodbcsv: Lightweight Zero-Config Database Solution for Handling Very Large CSV Files [Dataset]. http://doi.org/10.1371/journal.pone.0103319.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Stanisław Adaszewski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Set of Python scripts to generate data for benchmarks: equivalents of ADNI_Clin_6800_geno.csv, PTDEMOG.csv, MicroarrayExpression_fixed.csv and Probes.csv files, the dummy.csv, dummy2.csv and the microbenchmark CSV files. (ZIP)

  4. Unreliable News Detection Dataset

    • kaggle.com
    Updated Mar 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farjana Kabir (2023). Unreliable News Detection Dataset [Dataset]. https://www.kaggle.com/datasets/farjanakabirsamanta/true-fake-news-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Farjana Kabir
    Description

    There exists one directory named Fake_True. The directory contains 2 csv files: - Fake.csv - True.csv

    Fake.csv - contains unreliable news True.csv - contains reliable news

    Each csv file has 4 columns: 1. title 2. text 3. subject 4. date

    Collected from MEJBAH AHAMMAD

  5. WELFake dataset for fake news detection in text data

    • zenodo.org
    • data.europa.eu
    csv
    Updated Apr 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan (2021). WELFake dataset for fake news detection in text data [Dataset]. http://doi.org/10.5281/zenodo.4561253
    Explore at:
    csvAvailable download formats
    Dataset updated
    Apr 9, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pawan K Pawan Kumar Verma; Pawan K Pawan Kumar Verma; Prateek Prateek Agrawal; Prateek Prateek Agrawal; Radu Radu Prodan; Radu Radu Prodan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We designed a larger and more generic Word Embedding over Linguistic Features for Fake News Detection (WELFake) dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, we merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training.

    Dataset contains four columns: Serial number (starting from 0); Title (about the text news heading); Text (about the news content); and Label (0 = fake and 1 = real).

    There are 78098 data entries in csv file out of which only 72134 entries are accessed as per the data frame.

    This dataset is a part of our ongoing research on "Fake News Prediction on Social Media Website" as a doctoral degree program of Mr. Pawan Kumar Verma and is partially supported by the ARTICONF project funded by the European Union’s Horizon 2020 research and innovation program.

  6. Students Test Data

    • kaggle.com
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATHARV BHARASKAR (2023). Students Test Data [Dataset]. https://www.kaggle.com/datasets/atharvbharaskar/students-test-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ATHARV BHARASKAR
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Dataset Overview: This dataset pertains to the examination results of students who participated in a series of academic assessments at a fictitious educational institution named "University of Exampleville." The assessments were administered across various courses and academic levels, with a focus on evaluating students' performance in general management and domain-specific topics.

    Columns: The dataset comprises 12 columns, each representing specific attributes and performance indicators of the students. These columns encompass information such as the students' names (which have been anonymized), their respective universities, academic program names (including BBA and MBA), specializations, the semester of the assessment, the type of examination domain (general management or domain-specific), general management scores (out of 50), domain-specific scores (out of 50), total scores (out of 100), student ranks, and percentiles.

    Data Collection: The examination data was collected during a standardized assessment process conducted by the University of Exampleville. The exams were designed to assess students' knowledge and skills in general management and their chosen domain-specific subjects. It involved students from both BBA and MBA programs who were in their final year of study.

    Data Format: The dataset is available in a structured format, typically as a CSV file. Each row represents a unique student's performance in the examination, while columns contain specific information about their results and academic details.

    Data Usage: This dataset is valuable for analyzing and gaining insights into the academic performance of students pursuing BBA and MBA degrees. It can be used for various purposes, including statistical analysis, performance trend identification, program assessment, and comparison of scores across domains and specializations. Furthermore, it can be employed in predictive modeling or decision-making related to curriculum development and student support.

    Data Quality: The dataset has undergone preprocessing and anonymization to protect the privacy of individual students. Nevertheless, it is essential to use the data responsibly and in compliance with relevant data protection regulations when conducting any analysis or research.

    Data Format: The exam data is typically provided in a structured format, commonly as a CSV (Comma-Separated Values) file. Each row in the dataset represents a unique student's examination performance, and each column contains specific attributes and scores related to the examination. The CSV format allows for easy import and analysis using various data analysis tools and programming languages like Python, R, or spreadsheet software like Microsoft Excel.

    Here's a column-wise description of the dataset:

    Name OF THE STUDENT: The full name of the student who took the exam. (Anonymized)

    UNIVERSITY: The university where the student is enrolled.

    PROGRAM NAME: The name of the academic program in which the student is enrolled (BBA or MBA).

    Specialization: If applicable, the specific area of specialization or major that the student has chosen within their program.

    Semester: The semester or academic term in which the student took the exam.

    Domain: Indicates whether the exam was divided into two parts: general management and domain-specific.

    GENERAL MANAGEMENT SCORE (OUT of 50): The score obtained by the student in the general management part of the exam, out of a maximum possible score of 50.

    Domain-Specific Score (Out of 50): The score obtained by the student in the domain-specific part of the exam, also out of a maximum possible score of 50.

    TOTAL SCORE (OUT of 100): The total score obtained by adding the scores from the general management and domain-specific parts, out of a maximum possible score of 100.

  7. B

    Residential School Locations Dataset (CSV Format)

    • borealisdata.ca
    • search.dataone.org
    Updated Jun 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosa Orlandini (2019). Residential School Locations Dataset (CSV Format) [Dataset]. http://doi.org/10.5683/SP2/RIYEMU
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2019
    Dataset provided by
    Borealis
    Authors
    Rosa Orlandini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1863 - Jun 30, 1998
    Area covered
    Canada
    Description

    The Residential School Locations Dataset [IRS_Locations.csv] contains the locations (latitude and longitude) of Residential Schools and student hostels operated by the federal government in Canada. All the residential schools and hostels that are listed in the Indian Residential School Settlement Agreement are included in this dataset, as well as several Industrial schools and residential schools that were not part of the IRRSA. This version of the dataset doesn’t include the five schools under the Newfoundland and Labrador Residential Schools Settlement Agreement. The original school location data was created by the Truth and Reconciliation Commission, and was provided to the researcher (Rosa Orlandini) by the National Centre for Truth and Reconciliation in April 2017. The dataset was created by Rosa Orlandini, and builds upon and enhances the previous work of the Truth and Reconcilation Commission, Morgan Hite (creator of the Atlas of Indian Residential Schools in Canada that was produced for the Tk'emlups First Nation and Justice for Day Scholar's Initiative, and Stephanie Pyne (project lead for the Residential Schools Interactive Map). Each individual school location in this dataset is attributed either to RSIM, Morgan Hite, NCTR or Rosa Orlandini. Many schools/hostels had several locations throughout the history of the institution. If the school/hostel moved from its’ original location to another property, then the school is considered to have two unique locations in this dataset,the original location and the new location. For example, Lejac Indian Residential School had two locations while it was operating, Stuart Lake and Fraser Lake. If a new school building was constructed on the same property as the original school building, it isn't considered to be a new location, as is the case of Girouard Indian Residential School.When the precise location is known, the coordinates of the main building are provided, and when the precise location of the building isn’t known, an approximate location is provided. For each residential school institution location, the following information is provided: official names, alternative name, dates of operation, religious affiliation, latitude and longitude coordinates, community location, Indigenous community name, contributor (of the location coordinates), school/institution photo (when available), location point precision, type of school (hostel or residential school) and list of references used to determine the location of the main buildings or sites.

  8. Fake and True News Dataset

    • figshare.com
    txt
    Updated Dec 3, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Bakkar Siddik (2020). Fake and True News Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13325198.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Dec 3, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Abu Bakkar Siddik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this dataset have to part combined namely fake news and true news. fake news collected from Kaggle and some true news collected form IEEE Data port. Therefor some true news data required to optimize with the fake news. After that i have collect some true news from different trusted online site. Finally i have concat the Fake and True news as a single dataset for the purpose to help the Researchers further if they want to research by taken this topic.

  9. Amazon product reviews (mock dataset)

    • zenodo.org
    csv
    Updated Jun 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yury Kashnitsky; Yury Kashnitsky (2022). Amazon product reviews (mock dataset) [Dataset]. http://doi.org/10.5281/zenodo.6657410
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 18, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yury Kashnitsky; Yury Kashnitsky
    License

    Attribution 1.0 (CC BY 1.0)https://creativecommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    About

    This is a mock dataset with Amazon product reviews. Classes are structured: 6 "level 1" classes, 64 "level 2" classes, and 510 "level 3" classes.


    3 files are shared:

    • train_40k.csv - training 40k Amazon product reviews
    • valid_10k.csv - 10k reviews left for validation
    • unlabeled_150k.csv - raw 150k Amazon product reviews, these can be used for language model finetuning.

    Level 1 classes are: health personal care, toys games, beauty, pet supplies, baby products, and grocery gourmet food.

    Dataset originally from https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification

  10. Geolocation Data [Longitude Latitude]

    • kaggle.com
    Updated Mar 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    You Sheng (2022). Geolocation Data [Longitude Latitude] [Dataset]. https://www.kaggle.com/datasets/liewyousheng/geolocation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    You Sheng
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Context

    Full Database of city state country available in CSV format. All Countries, States & Cities are Covered & Populated with Different Combinations & Versions.

    Each CSV has the 1. Longitude 2. Latitude

    of each location, alongside other miscellaneous country data such as 3. Currency 4. State code 5. Phone country code

    Content

    Total Countries : 250 Total States/Regions/Municipalities : 4,963 Total Cities/Towns/Districts : 148,061

    Last Updated On : 29th January 2022

    Source

    https://github.com/dr5hn/countries-states-cities-database

  11. Amazon India products dataset in CSV format

    • crawlfeeds.com
    csv, zip
    Updated Mar 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Amazon India products dataset in CSV format [Dataset]. https://crawlfeeds.com/datasets/amazon-india-products-dataset-in-csv-format
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Mar 27, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Area covered
    India
    Description

    Gain access to a structured dataset featuring thousands of products listed on Amazon India. This dataset is ideal for e-commerce analytics, competitor research, pricing strategies, and market trend analysis.

    Dataset Features:

    • Product Details: Name, Brand, Category, and Unique ID

    • Pricing Information: Current Price, Discounted Price, and Currency

    • Availability & Ratings: Stock Status, Customer Ratings, and Reviews

    • Seller Information: Seller Name and Fulfillment Details

    • Additional Attributes: Product Description, Specifications, and Images

    Dataset Specifications:

    • Format: CSV

    • Number of Records: 50,000+

    • Delivery Time: 3 Days

    • Price: $149.00

    • Availability: Immediate

    This dataset provides structured and actionable insights to support e-commerce businesses, pricing strategies, and product optimization. If you're looking for more datasets for e-commerce analysis, explore our E-commerce datasets for a broader selection.

  12. m

    Network traffic for machine learning classification

    • data.mendeley.com
    Updated Feb 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Labayen Guembe (2020). Network traffic for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.1
    Explore at:
    Dataset updated
    Feb 12, 2020
    Authors
    Víctor Labayen Guembe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

    Activities:

    Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

    The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

    The amount of data is stated as follows:

    Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

  13. Colombian biodiversity is governed by a rich and diverse policy mix0000

    • springernature.figshare.com
    txt
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alejandra Echeverri (2023). Colombian biodiversity is governed by a rich and diverse policy mix0000 [Dataset]. http://doi.org/10.6084/m9.figshare.20436162.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Alejandra Echeverri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This includes the following files:

    Policy_mix_database.xlsx: A database in Excel format with the 186 policies included in the study. Each row represents and individual policy, and columns are information about the unique policy ID (ranging from ID_1 to ID_186), name of the policy, url links to the original policy, policy goals, years the policies were introduced, lead actors, scale of governance, among others.

    Policy_mix_DB.csv: A csv file of binary trait data. Each row represents a unique policy, with columns being the unique policy ID, and dummy variables for the eight axial codes in the study. Columns B-G refer to the biodiversity scale subcategories; Columns H-P refer to ecosystem subcategories; Columns Q-U refer to Policy instrument subcategories; Columns V-AA refer to conservation paradigm subcategories; Columns AB-AR refer to Policy theme subcategories; Columns AS-AW refer to governance scale subcategories; Columns AX-BC refer to lead actor subcategories; Column BD refers to the year when the policy became effective; Columns BE-BN refer to biodiversity threats sucategories.

    Frequencies_BPD.csv: A csv file with the contingency table of dummy variables in the Policy_mix_DB.csv file and the count (number of policies) of policies for each variable.

    BPD_for_logistic_regression.csv: A csv file of the database in long format used for figures and files

    Sankey1.csv: Source data for the Figure 4 panel A

    Sankey2.csv: Source data for the Figure 4 panel B

    IPBES.csv: Source data for the Figure 4 panel C

    Echeverri_main_analysis_biodiversity_policy_mix: R script for all quantitative data analysis and for making figures 1,2,3 and all supplementary figures

    Echeverri_et_al_Figure4.jpynb: Script with Python code in a jupiter notebook for making Figure 4

  14. i

    Sample Dataset for Testing

    • ieee-dataport.org
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Outman (2025). Sample Dataset for Testing [Dataset]. https://ieee-dataport.org/documents/sample-dataset-testing
    Explore at:
    Dataset updated
    Apr 28, 2025
    Authors
    Alex Outman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    10

  15. Customer Dataset csv

    • kaggle.com
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moses Moncy (2023). Customer Dataset csv [Dataset]. https://www.kaggle.com/datasets/mosesmoncy/customer-dataset-csv
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Moses Moncy
    Description

    Dataset

    This dataset was created by Moses Moncy

    Contents

  16. o

    Fake Reviews Dataset

    • osf.io
    Updated Mar 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joni Salminen; Abhinav Nigam; Manav Parikh; Richa Gupta; verga nathanael; Timur Lazizov; Pablo Gutierrez; Jinjun YU; Yasmeen Abdelmohsen; Dongjun Wei; weig; Naji Shajarisales; Arvind; suronapee phoomvuthisarn; Todor Todorov; Yulia Gartman; Ryan Callaghan; Nguyễn Trung; Nighat Zahra; Asim Abdullah; Abhay Sharma (2025). Fake Reviews Dataset [Dataset]. https://osf.io/tyue9
    Explore at:
    Dataset updated
    Mar 2, 2025
    Dataset provided by
    Center For Open Science
    Authors
    Joni Salminen; Abhinav Nigam; Manav Parikh; Richa Gupta; verga nathanael; Timur Lazizov; Pablo Gutierrez; Jinjun YU; Yasmeen Abdelmohsen; Dongjun Wei; weig; Naji Shajarisales; Arvind; suronapee phoomvuthisarn; Todor Todorov; Yulia Gartman; Ryan Callaghan; Nguyễn Trung; Nighat Zahra; Asim Abdullah; Abhay Sharma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The generated fake reviews dataset, containing 20k fake reviews and 20k real product reviews. OR = Original reviews (presumably human created and authentic); CG = Computer-generated fake reviews.

  17. ISOT Fake News Dataset

    • kaggle.com
    Updated Dec 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Goel (2024). ISOT Fake News Dataset [Dataset]. https://www.kaggle.com/datasets/rahulogoel/isot-fake-news-dataset/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 29, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Rahul Goel
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    It is trained on data of around 45,000 news articles with a mix of real and fake news articles. The dataset is provided by the University of Victoria.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21948533%2Fa9c02011dc538fde2c967d56bfdb4778%2Fsubjects.png?generation=1735462720561554&alt=media" alt="distribution of topics">

    The dataset contains two types of articles fake and real News. This dataset was collected from realworld sources; the truthful articles were obtained by crawling articles from Reuters.com (News website). As for the fake news articles, they were collected from different sources. The fake news articles were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focus on political and World news topics.

    The dataset consists of two CSV files. The first file named “True.csv” contains more than 12,600 articles from reuter.com. The second file named “Fake.csv” contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information: article title, text, type and the date the article was published on. To match the fake news data collected for kaggle.com, we focused mostly on collecting articles from 2016 to 2017. The data collected were cleaned and processed, however, the punctuations and mistakes that existed in the fake news were kept in the text.

    The following table gives a breakdown of the categories and number of articles per category.

    NewsSize (Number of articles)Subjects
    Real-News21417TypeArticles size
    World-News10145
    Politics-News11272
    Fake-News23481TypeArticles size
    Government-News1570
    Middle-east778
    US News783
    Left-news4459
    Politics6841
    News9050

    Note- To cite this dataset use the information given by original authors:

    1. Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.
    2. Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127- 138)
  18. Example of how to manually extract incubation bouts from interactive plots...

    • figshare.com
    txt
    Updated Jan 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Bulla (2016). Example of how to manually extract incubation bouts from interactive plots of raw data - R-CODE and DATA [Dataset]. http://doi.org/10.6084/m9.figshare.2066784.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 22, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Martin Bulla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    {# General information# The script runs with R (Version 3.1.1; 2014-07-10) and packages plyr (Version 1.8.1), XLConnect (Version 0.2-9), utilsMPIO (Version 0.0.25), sp (Version 1.0-15), rgdal (Version 0.8-16), tools (Version 3.1.1) and lattice (Version 0.20-29)# --------------------------------------------------------------------------------------------------------# Questions can be directed to: Martin Bulla (bulla.mar@gmail.com)# -------------------------------------------------------------------------------------------------------- # Data collection and how the individual variables were derived is described in: #Steiger, S.S., et al., When the sun never sets: diverse activity rhythms under continuous daylight in free-living arctic-breeding birds. Proceedings of the Royal Society B: Biological Sciences, 2013. 280(1764): p. 20131016-20131016. # Dale, J., et al., The effects of life history and sexual selection on male and female plumage colouration. Nature, 2015. # Data are available as Rdata file # Missing values are NA. # --------------------------------------------------------------------------------------------------------# For better readability the subsections of the script can be collapsed # --------------------------------------------------------------------------------------------------------}{# Description of the method # 1 - data are visualized in an interactive actogram with time of day on x-axis and one panel for each day of data # 2 - red rectangle indicates the active field, clicking with the mouse in that field on the depicted light signal generates a data point that is automatically (via custom made function) saved in the csv file. For this data extraction I recommend, to click always on the bottom line of the red rectangle, as there is always data available due to a dummy variable ("lin") that creates continuous data at the bottom of the active panel. The data are captured only if greenish vertical bar appears and if new line of data appears in R console). # 3 - to extract incubation bouts, first click in the new plot has to be start of incubation, then next click depict end of incubation and the click on the same stop start of the incubation for the other sex. If the end and start of incubation are at different times, the data will be still extracted, but the sex, logger and bird_ID will be wrong. These need to be changed manually in the csv file. Similarly, the first bout for a given plot will be always assigned to male (if no data are present in the csv file) or based on previous data. Hence, whenever a data from a new plot are extracted, at a first mouse click it is worth checking whether the sex, logger and bird_ID information is correct and if not adjust it manually. # 4 - if all information from one day (panel) is extracted, right-click on the plot and choose "stop". This will activate the following day (panel) for extraction. # 5 - If you wish to end extraction before going through all the rectangles, just press "escape". }{# Annotations of data-files from turnstone_2009_Barrow_nest-t401_transmitter.RData dfr-- contains raw data on signal strength from radio tag attached to the rump of female and male, and information about when the birds where captured and incubation stage of the nest1. who: identifies whether the recording refers to female, male, capture or start of hatching2. datetime_: date and time of each recording3. logger: unique identity of the radio tag 4. signal_: signal strength of the radio tag5. sex: sex of the bird (f = female, m = male)6. nest: unique identity of the nest7. day: datetime_ variable truncated to year-month-day format8. time: time of day in hours9. datetime_utc: date and time of each recording, but in UTC time10. cols: colors assigned to "who"--------------------------------------------------------------------------------------------------------m-- contains metadata for a given nest1. sp: identifies species (RUTU = Ruddy turnstone)2. nest: unique identity of the nest3. year_: year of observation4. IDfemale: unique identity of the female5. IDmale: unique identity of the male6. lat: latitude coordinate of the nest7. lon: longitude coordinate of the nest8. hatch_start: date and time when the hatching of the eggs started 9. scinam: scientific name of the species10. breeding_site: unique identity of the breeding site (barr = Barrow, Alaska)11. logger: type of device used to record incubation (IT - radio tag)12. sampling: mean incubation sampling interval in seconds--------------------------------------------------------------------------------------------------------s-- contains metadata for the incubating parents1. year_: year of capture2. species: identifies species (RUTU = Ruddy turnstone)3. author: identifies the author who measured the bird4. nest: unique identity of the nest5. caught_date_time: date and time when the bird was captured6. recapture: was the bird capture before? (0 - no, 1 - yes)7. sex: sex of the bird (f = female, m = male)8. bird_ID: unique identity of the bird9. logger: unique identity of the radio tag --------------------------------------------------------------------------------------------------------}

  19. Z

    Dataset for "ConfSolv: Prediction of solute conformer free energies across a...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zipei Tan (2023). Dataset for "ConfSolv: Prediction of solute conformer free energies across a range of solvents" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8292519
    Explore at:
    Dataset updated
    Oct 25, 2023
    Dataset provided by
    Volker Settels
    Kevin A. Spiekermann
    Frederik Sandfort
    Florence Vermeire
    Angiras Menon
    Lagnajit Pattanaik
    William H. Green
    Zipei Tan
    Philipp Eiden
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains three archives. The first archive, full_dataset.zip, contains geometries and free energies for nearly 44,000 solute molecules with almost 9 million conformers, in 42 different solvents. The geometries and gas phase free energies are computed using density functional theory (DFT). The solvation free energy for each conformer is computed using COSMO-RS and the solution free energies are computed using the sum of the gas phase free energies and the solvation free energies. The geometries for each solute conformer are provided as ASE_atoms_objects within a pandas DataFrame, found in the compressed file dft coords.pkl.gz within full_dataset.zip. The gas-phase energies, solvation free energies, and solution free energies are also provided as a pandas DataFrame in the compressed file free_energy.pkl.gz within full_dataset.zip. Ten example data splits for both random and scaffold split types are also provided in the ZIP archive for training models. Scaffold split index 0 is used to generate results in the corresponding publication. The second archive, refined_conf_search.zip, contains geometries and free energies for a representative sample of 28 solute molecules from the full dataset that were subject to a refined conformer search and thus had more conformers located. The format of the data is identical to full_dataset.zip. The third archive contains one folder for each solvent for which we have provided free energies in full_dataset.zip. Each folder contains the .cosmo file for every solvent conformer used in the COSMOtherm calculations, a dummy input file for the COSMOtherm calculations, and a CSV file that contains the electronic energy of each solvent conformer that needs to be substituted for "EH_Line" in the dummy input file.

  20. Purchase Order Data

    • data.ca.gov
    docx
    Updated Oct 23, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    California Department of General Services (2019). Purchase Order Data [Dataset]. https://data.ca.gov/dataset/purchase-order-data
    Explore at:
    docxAvailable download formats
    Dataset updated
    Oct 23, 2019
    Dataset authored and provided by
    California Department of General Services
    Description

    The State Contract and Procurement Registration System (SCPRS) was established in 2003, as a centralized database of information on State contracts and purchases over $5000. eSCPRS represents the data captured in the State's eProcurement (eP) system, Bidsync, as of March 16, 2009. The data provided is an extract from that system for fiscal years 2012-2013, 2013-2014, and 2014-2015

    Data Limitations:
    Some purchase orders have multiple UNSPSC numbers, however only first was used to identify the purchase order. Multiple UNSPSC numbers were included to provide additional data for a DGS special event however this affects the formatting of the file. The source system Bidsync is being deprecated and these issues will be resolved in the future as state systems transition to Fi$cal.

    Data Collection Methodology:

    The data collection process starts with a data file from eSCPRS that is scrubbed and standardized prior to being uploaded into a SQL Server database. There are four primary tables. The Supplier, Department and United Nations Standard Products and Services Code (UNSPSC) tables are reference tables. The Supplier and Department tables are updated and mapped to the appropriate numbering schema and naming conventions. The UNSPSC table is used to categorize line item information and requires no further manipulation. The Purchase Order table contains raw data that requires conversion to the correct data format and mapping to the corresponding data fields. A stacking method is applied to the table to eliminate blanks where needed. Extraneous characters are removed from fields. The four tables are joined together and queries are executed to update the final Purchase Order Dataset table. Once the scrubbing and standardization process is complete the data is then uploaded into the SQL Server database.

    Secondary/Related Resources:

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tori Duckworth (2023). Test Data Dummy CSV [Dataset]. http://doi.org/10.6084/m9.figshare.24500965.v2
Organization logoOrganization logo

Test Data Dummy CSV

Explore at:
txtAvailable download formats
Dataset updated
Nov 6, 2023
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tori Duckworth
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This CSV represents a dummy dataset to test the functionality of trusted repository search capabilities and of research data governance practices. The associated dummy dissertation is entitled Financial Econometrics Dummy Dissertation. The dummy file is a 7KB CSV containing 5000 rows of notional demographic tabular data.

Search
Clear search
Close search
Google apps
Main menu