100+ datasets found
  1. GitTables 1M - CSV files

    • zenodo.org
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  2. Adventure Works 2022 CSVs

    • kaggle.com
    zip
    Updated Nov 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Algorismus (2022). Adventure Works 2022 CSVs [Dataset]. https://www.kaggle.com/datasets/algorismus/adventure-works-in-excel-tables
    Explore at:
    zip(567646 bytes)Available download formats
    Dataset updated
    Nov 2, 2022
    Authors
    Algorismus
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Adventure Works 2022 dataset

    How this Dataset is created?

    On the official website the dataset is available over SQL server (localhost) and CSVs to be used via Power BI Desktop running on Virtual Lab (Virtaul Machine). As per first two steps of Importing data are executed in the virtual lab and then resultant Power BI tables are copied in CSVs. Added records till year 2022 as required.

    How this Dataset may help you?

    this dataset will be helpful in case you want to work offline with Adventure Works data in Power BI desktop in order to carry lab instructions as per training material on official website. The dataset is useful in case you want to work on Power BI desktop Sales Analysis example from Microsoft website PL 300 learning.

    How to use this Dataset?

    Download the CSV file(s) and import in Power BI desktop as tables. The CSVs are named as tables created after first two steps of importing data as mentioned in the PL-300 Microsoft Power BI Data Analyst exam lab.

  3. The Canada Trademarks Dataset

    • zenodo.org
    pdf, zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeremy Sheff; Jeremy Sheff (2024). The Canada Trademarks Dataset [Dataset]. http://doi.org/10.5281/zenodo.4999655
    Explore at:
    zip, pdfAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeremy Sheff; Jeremy Sheff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Canada Trademarks Dataset

    18 Journal of Empirical Legal Studies 908 (2021), prepublication draft available at https://papers.ssrn.com/abstract=3782655, published version available at https://onlinelibrary.wiley.com/share/author/CHG3HC6GTFMMRU8UJFRR?target=10.1111/jels.12303

    Dataset Selection and Arrangement (c) 2021 Jeremy Sheff

    Python and Stata Scripts (c) 2021 Jeremy Sheff

    Contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office.

    This individual-application-level dataset includes records of all applications for registered trademarks in Canada since approximately 1980, and of many preserved applications and registrations dating back to the beginning of Canada’s trademark registry in 1865, totaling over 1.6 million application records. It includes comprehensive bibliographic and lifecycle data; trademark characteristics; goods and services claims; identification of applicants, attorneys, and other interested parties (including address data); detailed prosecution history event data; and data on application, registration, and use claims in countries other than Canada. The dataset has been constructed from public records made available by the Canadian Intellectual Property Office. Both the dataset and the code used to build and analyze it are presented for public use on open-access terms.

    Scripts are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/. Data files are licensed for reuse subject to the Creative Commons Attribution License 4.0 (CC-BY-4.0), https://creativecommons.org/licenses/by/4.0/, and also subject to additional conditions imposed by the Canadian Intellectual Property Office (CIPO) as described below.

    Terms of Use:

    As per the terms of use of CIPO's government data, all users are required to include the above-quoted attribution to CIPO in any reproductions of this dataset. They are further required to cease using any record within the datasets that has been modified by CIPO and for which CIPO has issued a notice on its website in accordance with its Terms and Conditions, and to use the datasets in compliance with applicable laws. These requirements are in addition to the terms of the CC-BY-4.0 license, which require attribution to the author (among other terms). For further information on CIPO’s terms and conditions, see https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html. For further information on the CC-BY-4.0 license, see https://creativecommons.org/licenses/by/4.0/.

    The following attribution statement, if included by users of this dataset, is satisfactory to the author, but the author makes no representations as to whether it may be satisfactory to CIPO:

    The Canada Trademarks Dataset is (c) 2021 by Jeremy Sheff and licensed under a CC-BY-4.0 license, subject to additional terms imposed by the Canadian Intellectual Property Office. It contains data licensed by Her Majesty the Queen in right of Canada, as represented by the Minister of Industry, the minister responsible for the administration of the Canadian Intellectual Property Office. For further information, see https://creativecommons.org/licenses/by/4.0/ and https://www.ic.gc.ca/eic/site/cipointernet-internetopic.nsf/eng/wr01935.html.

    Details of Repository Contents:

    This repository includes a number of .zip archives which expand into folders containing either scripts for construction and analysis of the dataset or data files comprising the dataset itself. These folders are as follows:

    • /csv: contains the .csv versions of the data files
    • /do: contains Stata do-files used to convert the .csv files to .dta format and perform the statistical analyses set forth in the paper reporting this dataset
    • /dta: contains the .dta versions of the data files
    • /py: contains the python scripts used to download CIPO’s historical trademarks data via SFTP and generate the .csv data files

    If users wish to construct rather than download the datafiles, the first script that they should run is /py/sftp_secure.py. This script will prompt the user to enter their IP Horizons SFTP credentials; these can be obtained by registering with CIPO at https://ised-isde.survey-sondage.ca/f/s.aspx?s=59f3b3a4-2fb5-49a4-b064-645a5e3a752d&lang=EN&ds=SFTP. The script will also prompt the user to identify a target directory for the data downloads. Because the data archives are quite large, users are advised to create a target directory in advance and ensure they have at least 70GB of available storage on the media in which the directory is located.

    The sftp_secure.py script will generate a new subfolder in the user’s target directory called /XML_raw. Users should note the full path of this directory, which they will be prompted to provide when running the remaining python scripts. Each of the remaining scripts, the filenames of which begin with “iterparse”, corresponds to one of the data files in the dataset, as indicated in the script’s filename. After running one of these scripts, the user’s target directory should include a /csv subdirectory containing the data file corresponding to the script; after running all the iterparse scripts the user’s /csv directory should be identical to the /csv directory in this repository. Users are invited to modify these scripts as they see fit, subject to the terms of the licenses set forth above.

    With respect to the Stata do-files, only one of them is relevant to construction of the dataset itself. This is /do/CA_TM_csv_cleanup.do, which converts the .csv versions of the data files to .dta format, and uses Stata’s labeling functionality to reduce the size of the resulting files while preserving information. The other do-files generate the analyses and graphics presented in the paper describing the dataset (Jeremy N. Sheff, The Canada Trademarks Dataset, 18 J. Empirical Leg. Studies (forthcoming 2021)), available at https://papers.ssrn.com/abstract=3782655). These do-files are also licensed for reuse subject to the terms of the CC-BY-4.0 license, and users are invited to adapt the scripts to their needs.

    The python and Stata scripts included in this repository are separately maintained and updated on Github at https://github.com/jnsheff/CanadaTM.

    This repository also includes a copy of the current version of CIPO's data dictionary for its historical XML trademarks archive as of the date of construction of this dataset.

  4. UK House Price Index: data downloads November 2023

    • gov.uk
    Updated Jan 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HM Land Registry (2024). UK House Price Index: data downloads November 2023 [Dataset]. https://www.gov.uk/government/statistical-data-sets/uk-house-price-index-data-downloads-november-2023
    Explore at:
    Dataset updated
    Jan 17, 2024
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    HM Land Registry
    Area covered
    United Kingdom
    Description

    The UK House Price Index is a National Statistic.

    Create your report

    Download the full UK House Price Index data below, or use our tool to https://landregistry.data.gov.uk/app/ukhpi?utm_medium=GOV.UK&utm_source=datadownload&utm_campaign=tool&utm_term=9.30_17_01_24" class="govuk-link">create your own bespoke reports.

    Download the data

    Datasets are available as CSV files. Find out about republishing and making use of the data.

    Google Chrome is blocking downloads of our UK HPI data files (Chrome 88 onwards). Please use another internet browser while we resolve this issue. We apologise for any inconvenience caused.

    Full file

    This file includes a derived back series for the new UK HPI. Under the UK HPI, data is available from 1995 for England and Wales, 2004 for Scotland and 2005 for Northern Ireland. A longer back series has been derived by using the historic path of the Office for National Statistics HPI to construct a series back to 1968.

    Download the full UK HPI background file:

    Individual attributes files

    If you are interested in a specific attribute, we have separated them into these CSV files:

  5. Coursera Courses Data

    • kaggle.com
    zip
    Updated Jan 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mihir (2020). Coursera Courses Data [Dataset]. https://www.kaggle.com/datasets/mihirs16/coursera-course-data
    Explore at:
    zip(282010 bytes)Available download formats
    Dataset updated
    Jan 26, 2020
    Authors
    Mihir
    Description

    This started as a project that required course data from coursera and due to the lack of such a dataset on common forums, I decided to webscrape the coursera.org and coursera.org/directory for a list of all the courses of coursera.

    The first file contains a list of names and urls. The second csv file contains details of each course.

    The script to generate/webscrape this dataset directly from coursera can be found at https://github.com/mihirs16/Coursera-Web-Scraper.

    All data scraped is solely owned by Coursera.org and is meant to be used only for educational and experimental purposes.

  6. East Atlantic SWAN Wave Model Significant Wave

    • kaggle.com
    zip
    Updated Jan 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). East Atlantic SWAN Wave Model Significant Wave [Dataset]. https://www.kaggle.com/datasets/thedevastator/east-atlantic-swan-wave-model-significant-wave-h/data
    Explore at:
    zip(6934967 bytes)Available download formats
    Dataset updated
    Jan 19, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    East Atlantic SWAN Wave Model Significant Wave Height

    6 Day Forecast Using NCEP GFS Wind Forcing

    By data.gov.ie [source]

    About this dataset

    This dataset contains data from the East Atlantic SWAN Wave Model, which is a powerful model developed to predict wave parameters in Irish waters. The output features of the model include Significant Wave Height (m), Mean Wave Direction (degreesTrue) and Mean Wave Period (seconds). These predictions are generated with NCEP GFS wind forcing and FNMOC Wave Watch 3 data as boundaries for the wave generation.

    The accuracy of this model is important for safety critical applications as well as research efforts into understanding changes in tides, currents, and sea levels, so users are provided with up-to-date predictions for the previous 30 days and 6 days into the future with download service options that allow selection by date/time, one parameter only and output file type.

    Data providers released this dataset under a Creative Commons Attribution 4.0 license at 2017-09-14. It can be used free of charge within certain restrictions set out by its respective author or publisher

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Introduction:

    Step 1: Acquire the Dataset:
    The first step is getting access to the dataset which is free of cost. The original source of this data is from http://wwave2.marinecstl.org/archive/index?cat=model_height&xsl=download-csv-1 Meanwhile, you can also get your hands on this data by downloading it as a csv file from Kaggle’s website (https://www.kaggle.com/marinecstl/east-atlantic-swan-wave-model). This download should contain seven columns of various parameters; time, latitude, longitude, and significant wave height being the most important ones that you need to be familiar with before using this data set effectively in any project etc..

    Step 2: Understand Data Columns & Parameters :
    Now that you have downloaded the data its time to understand what each column represents and how they are related to each other when comparing datasets from two different locations within one country or across two countries etc.. Time represents daily timestamps for each observation taken at an exact location specified by latitude & longitude parameters respectively while ranging between 0° - +90° (~ 85 degrees) where higher values indicate states closer towards North Pole; inversely lower values indicates states closer towards South Pole respectively.. Significant wave height on other hand represent total displacements in ocean surface due measurable variations within short period caused either due tides or waves i .e caused due weather difference such as wind forcing or during more extreme conditions like oceanic storms etc.,

    Step 3: Understanding Data Limitation & Applying Exclusion Criteria :
    Moreover, keep in mind that since model runs every day across various geographical regions thus inevitable inaccuracy emerges regarding value predictions across any given timeslot; so its essential that users apply advanced criteria during analysis phase taking into consideration natural resource limitation such as current weather conditions and water depth scenarios while compiling buoyancy related readings during particular timestamps respectively when going through information outputted via obtained CSV file OR API services respectively;; however don’t forget these ;predictions may not be used for safety

    Research Ideas

    • Visualizing wave heights in the East Atlantic area over time to map oceanic currents.
    • Finding areas of high-wave activity: using this data, researchers can identify unique areas that experience particularly severe waves, which could be essential to know for protecting maritime vessels and informing navigation strategies.
    • Predicting future wave behavior: by analyzing current and past trends in SWAN Wave Model data, scientists can predict how significant wave heights will change over future timescales in the studied area

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: download-csv-1.csv | Column name | Descrip...

  7. Full oral and gene database (csv format)

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    application/gzip
    Updated May 22, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Braden Tierney (2019). Full oral and gene database (csv format) [Dataset]. http://doi.org/10.6084/m9.figshare.8001362.v1
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 22, 2019
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Braden Tierney
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is our complete database in csv format (with gene names, ID's, annotations, lengths, cluster sizes, and taxonomic classifications) that can be queried on our website. The difference is that it does not have the sequences – those can be downloaded in other files on figshare. This file, as well as those, can be parsed and linked by the gene identifier.We recommend downloading this database and parsing it yourself if you attempt to run a query that is too large for our servers to handle.

  8. C

    Call Center Data (Current)

    • data.milwaukee.gov
    csv
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Information Technology and Management Division (2025). Call Center Data (Current) [Dataset]. https://data.milwaukee.gov/dataset/callcenterdatacurrent
    Explore at:
    csv(63536529)Available download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Information Technology and Management Division
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update Frequency: "An automated scan for dataset updates occurs every day at 3:45 a.m."

    For up to date information on service requests please visit, https://city.milwaukee.gov/ucc .

    A log of the Unified Call Center's service requests.

    From potholes, abandoned vehicles, high weeds on vacant lots, and curbside trash to faulty traffic signals, the City of Milwaukee's Unified Call Center (UCC) makes it easy to submit service requests to solve problems. The UCC also allows you to track your service requests. Each time you complete a service request online, you will be assigned a tracking number that you can use to see when a City of Milwaukee representative expects to investigate or take care of your request.

    To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.

  9. National Teacher and Principal Survey: Tables Library Data

    • datalumos.org
    delimited
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Education (2025). National Teacher and Principal Survey: Tables Library Data [Dataset]. http://doi.org/10.3886/E234604V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Jun 27, 2025
    Dataset authored and provided by
    United States Department of Educationhttps://ed.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2015 - 2021
    Area covered
    United States
    Description

    About NTPSThe National Teacher and Principal Survey (NTPS) is a system of related questionnaires that provide descriptive data on the context of elementary and secondary education while also giving policymakers a variety of statistics on the condition of education in the United States.The NTPS is a redesign of the Schools and Staffing Survey (SASS), which the National Center for Education Statistics (NCES) conducted from 1987 to 2011. The design of the NTPS is a product of three key goals coming out of the SASS program: flexibility, timeliness, and integration with other Department of Education collections. The NTPS collects data on core topics including teacher and principal preparation, classes taught, school characteristics, and demographics of the teacher and principal labor force every two to three years. In addition, each administration of NTPS contains rotating modules on important education topics such as: professional development, working conditions, and evaluation. This approach allows policy makers and researchers to assess trends on both stable and dynamic topics.Data OrganizationEach table has an associated excel and excel SE file, which are grouped together in a folder in the dataset (one folder per table). The folders are named based on the excel file names, as they were when downloaded from the National Center for Education Statistics (NCES) website.In the NTPS folder, there is a catalog csv that provides a crosswalk between the folder names and the table titles.The documentation folder contains (1) codebooks for NTPS generated in NCES datalabs, (2) questionnaires for NTPS downloaded from the study website and (3) reports related to NTPS found in the NCES resource library

  10. m

    Network traffic for machine learning classification

    • data.mendeley.com
    Updated Feb 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Víctor Labayen Guembe (2020). Network traffic for machine learning classification [Dataset]. http://doi.org/10.17632/5pmnkshffm.1
    Explore at:
    Dataset updated
    Feb 12, 2020
    Authors
    Víctor Labayen Guembe
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is a set of network traffic traces in pcap/csv format captured from a single user. The traffic is classified in 5 different activities (Video, Bulk, Idle, Web, and Interactive) and the label is shown in the filename. There is also a file (mapping.csv) with the mapping of the host's IP address, the csv/pcap filename and the activity label.

    Activities:

    Interactive: applications that perform real-time interactions in order to provide a suitable user experience, such as editing a file in google docs and remote CLI's sessions by SSH. Bulk data transfer: applications that perform a transfer of large data volume files over the network. Some examples are SCP/FTP applications and direct downloads of large files from web servers like Mediafire, Dropbox or the university repository among others. Web browsing: contains all the generated traffic while searching and consuming different web pages. Examples of those pages are several blogs and new sites and the moodle of the university. Vídeo playback: contains traffic from applications that consume video in streaming or pseudo-streaming. The most known server used are Twitch and Youtube but the university online classroom has also been used. Idle behaviour: is composed by the background traffic generated by the user computer when the user is idle. This traffic has been captured with every application closed and with some opened pages like google docs, YouTube and several web pages, but always without user interaction.

    The capture is performed in a network probe, attached to the router that forwards the user network traffic, using a SPAN port. The traffic is stored in pcap format with all the packet payload. In the csv file, every non TCP/UDP packet is filtered out, as well as every packet with no payload. The fields in the csv files are the following (one line per packet): Timestamp, protocol, payload size, IP address source and destination, UDP/TCP port source and destination. The fields are also included as a header in every csv file.

    The amount of data is stated as follows:

    Bulk : 19 traces, 3599 s of total duration, 8704 MBytes of pcap files Video : 23 traces, 4496 s, 1405 MBytes Web : 23 traces, 4203 s, 148 MBytes Interactive : 42 traces, 8934 s, 30.5 MBytes Idle : 52 traces, 6341 s, 0.69 MBytes

  11. d

    Professional Services Contracts

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Philadelphia (2025). Professional Services Contracts [Dataset]. https://catalog.data.gov/dataset/professional-services-contracts
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    City of Philadelphia
    Description

    The entire dataset for Professional Services Contracts by fiscal quarter are available for download in a Microsoft Excel® compatible CSV format below. To download: click on the dataset url, which will open in a new browser tab as a text file. Right click on the text and choose 'save as' csv to download it. The City of Philadelphia, through a joint effort among the Finance Department, Chief Integrity Officer and the Office of Innovation and Technology (OIT), launched an online, open data website for City contracts. The site provides data on the City’s professional services contracts in a searchable format and includes a breakdown of contract dollars by vendor, department and service type. It also features a “frequently asked questions” section to help users understand the available data: http://cityofphiladelphia.github.io/contracts/.

  12. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of Córdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  13. iNaturalistScraper

    • kaggle.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregory Duke (2023). iNaturalistScraper [Dataset]. https://www.kaggle.com/datasets/gregoryduke/inaturalistscraper
    Explore at:
    zip(10420101 bytes)Available download formats
    Dataset updated
    Jun 1, 2023
    Authors
    Gregory Duke
    Description

    https://github.com/gregorywduke/iNaturalistScraper/releases/tag/v2.0.0

    Desktop application that downloads list of image urls contained in an "Export Observations" CSV file from iNaturalist. Written in Python, this scraper uses PySimpleGUI and comes packaged in an executable.

    To Use, Follow these Steps:

    Visit the "Export Observations" page on iNaturalist.org, and select your desired parameters. Before exporting your observations, make sure you unselect ALL options in (Section 3, Choose Columns) EXCEPT "image_url". That should be the only option enabled. Export your observations and download the .csv provided. Check the .csv you downloaded. Make sure there is no text besides the urls present. If there is a first line that is NOT a URL, delete it. Download iNaturalistScraper v1.0.0 from the "Releases" section, and run the .exe. In the application, select the .csv file you downloaded. Select the folder you want your images to be deposited. Click "Proceed" and let the program download your images. Upwards of 500 images may take 5-10 minutes. More images will take longer, though the process is CPU-dependant. Notes: (PLEASE READ)

    Ensure you delete the first line, if it is not a URL. Ensure you ONLY select the "image_url" option in Export Observations, and no other option.

  14. n

    Repository Analytics and Metrics Portal (RAMP) 2020 data

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jul 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Wheeler; Kenning Arlitsch (2021). Repository Analytics and Metrics Portal (RAMP) 2020 data [Dataset]. http://doi.org/10.5061/dryad.dv41ns1z4
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 23, 2021
    Dataset provided by
    Montana State University
    University of New Mexico
    Authors
    Jonathan Wheeler; Kenning Arlitsch
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Version update: The originally uploaded versions of the CSV files in this dataset included an extra column, "Unnamed: 0," which is not RAMP data and was an artifact of the process used to export the data to CSV format. This column has been removed from the revised dataset. The data are otherwise the same as in the first version.

    The Repository Analytics and Metrics Portal (RAMP) is a web service that aggregates use and performance use data of institutional repositories. The data are a subset of data from RAMP, the Repository Analytics and Metrics Portal (http://rampanalytics.org), consisting of data from all participating repositories for the calendar year 2020. For a description of the data collection, processing, and output methods, please see the "methods" section below.

    Methods Data Collection

    RAMP data are downloaded for participating IR from Google Search Console (GSC) via the Search Console API. The data consist of aggregated information about IR pages which appeared in search result pages (SERP) within Google properties (including web search and Google Scholar).

    Data are downloaded in two sets per participating IR. The first set includes page level statistics about URLs pointing to IR pages and content files. The following fields are downloaded for each URL, with one row per URL:

    url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    

    Following data processing describe below, on ingest into RAMP a additional field, citableContent, is added to the page level data.

    The second set includes similar information, but instead of being aggregated at the page level, the data are grouped based on the country from which the user submitted the corresponding search, and the type of device used. The following fields are downloaded for combination of country and device, with one row per country/device combination:

    country: The country from which the corresponding search originated.
    device: The device used for the search.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    

    Note that no personally identifiable information is downloaded by RAMP. Google does not make such information available.

    More information about click-through rates, impressions, and position is available from Google's Search Console API documentation: https://developers.google.com/webmaster-tools/search-console-api-original/v3/searchanalytics/query and https://support.google.com/webmasters/answer/7042828?hl=en

    Data Processing

    Upon download from GSC, the page level data described above are processed to identify URLs that point to citable content. Citable content is defined within RAMP as any URL which points to any type of non-HTML content file (PDF, CSV, etc.). As part of the daily download of page level statistics from Google Search Console (GSC), URLs are analyzed to determine whether they point to HTML pages or actual content files. URLs that point to content files are flagged as "citable content." In addition to the fields downloaded from GSC described above, following this brief analysis one more field, citableContent, is added to the page level data which records whether each page/URL in the GSC data points to citable content. Possible values for the citableContent field are "Yes" and "No."

    The data aggregated by the search country of origin and device type do not include URLs. No additional processing is done on these data. Harvested data are passed directly into Elasticsearch.

    Processed data are then saved in a series of Elasticsearch indices. Currently, RAMP stores data in two indices per participating IR. One index includes the page level data, the second index includes the country of origin and device type data.

    About Citable Content Downloads

    Data visualizations and aggregations in RAMP dashboards present information about citable content downloads, or CCD. As a measure of use of institutional repository content, CCD represent click activity on IR content that may correspond to research use.

    CCD information is summary data calculated on the fly within the RAMP web application. As noted above, data provided by GSC include whether and how many times a URL was clicked by users. Within RAMP, a "click" is counted as a potential download, so a CCD is calculated as the sum of clicks on pages/URLs that are determined to point to citable content (as defined above).

    For any specified date range, the steps to calculate CCD are:

    Filter data to only include rows where "citableContent" is set to "Yes."
    Sum the value of the "clicks" field on these rows.
    

    Output to CSV

    Published RAMP data are exported from the production Elasticsearch instance and converted to CSV format. The CSV data consist of one "row" for each page or URL from a specific IR which appeared in search result pages (SERP) within Google properties as described above. Also as noted above, daily data are downloaded for each IR in two sets which cannot be combined. One dataset includes the URLs of items that appear in SERP. The second dataset is aggregated by combination of the country from which a search was conducted and the device used.

    As a result, two CSV datasets are provided for each month of published data:

    page-clicks:

    The data in these CSV files correspond to the page-level data, and include the following fields:

    url: This is returned as a 'page' by the GSC API, and is the URL of the page which was included in an SERP for a Google property.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    citableContent: Whether or not the URL points to a content file (ending with pdf, csv, etc.) rather than HTML wrapper pages. Possible values are Yes or No.
    index: The Elasticsearch index corresponding to page click data for a single IR.
    repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
    

    Filenames for files containing these data end with “page-clicks”. For example, the file named 2020-01_RAMP_all_page-clicks.csv contains page level click data for all RAMP participating IR for the month of January, 2020.

    country-device-info:

    The data in these CSV files correspond to the data aggregated by country from which a search was conducted and the device used. These include the following fields:

    country: The country from which the corresponding search originated.
    device: The device used for the search.
    impressions: The number of times the URL appears within the SERP.
    clicks: The number of clicks on a URL which took users to a page outside of the SERP.
    clickThrough: Calculated as the number of clicks divided by the number of impressions.
    position: The position of the URL within the SERP.
    date: The date of the search.
    index: The Elasticsearch index corresponding to country and device access information data for a single IR.
    repository_id: This is a human readable alias for the index and identifies the participating repository corresponding to each row. As RAMP has undergone platform and version migrations over time, index names as defined for the previous field have not remained consistent. That is, a single participating repository may have multiple corresponding Elasticsearch index names over time. The repository_id is a canonical identifier that has been added to the data to provide an identifier that can be used to reference a single participating repository across all datasets. Filtering and aggregation for individual repositories or groups of repositories should be done using this field.
    

    Filenames for files containing these data end with “country-device-info”. For example, the file named 2020-01_RAMP_all_country-device-info.csv contains country and device data for all participating IR for the month of January, 2020.

    References

    Google, Inc. (2021). Search Console APIs. Retrieved from https://developers.google.com/webmaster-tools/search-console-api-original.

  15. ECB speeches etc.since 1997 (updated weekly)

    • kaggle.com
    zip
    Updated Nov 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Lofaro (2025). ECB speeches etc.since 1997 (updated weekly) [Dataset]. https://www.kaggle.com/robertolofaro/ecb-speeches-1997-to-20191122-frequencies-dm
    Explore at:
    zip(1814115 bytes)Available download formats
    Dataset updated
    Nov 3, 2025
    Authors
    Roberto Lofaro
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Context

    I am preparing a book on change to add to my publications (https://robertolofaro.com/published), and I was looking into speeches delivered by ECB, and the search on the website wasn't what I needed.

    Started posting online updates in late 2019, currently the online webapp that allows to search via a tag cloud is updated on a weekly basis, each Monday evening.

    Search by tag: https://robertolofaro.com/ECBSpeech (links also to dataset on kaggle)

    From 2024-03-25, the dataset contains also the AI-based audio transcripts of any ECB item collected, whenever the audio file is accessible

    source: ECB website

    Content

    In late October/early November 2019, ECB posted on Linkedin a link to a CSV dataset extending from 1997 up to 2019-10-25 with all the speeches delivered, as per their website

    The dataset was "flat"- and I needed to both search quickly for associations of people to concepts, and to see directly the relevant speech in a human-readable format (as some speeches had pictures, tables, attachments, etc)

    So, I recycled a concept that I had developed for other purposes and used in an experimental "search by tag cloud on structured content" on https://robertolofaro.com/BFM2013tag

    The result is https://robertolofaro.com/ECBSpeech, that contains information from the CSV file (see website for the link to the source), with the additional information as shown within the "About this file".

    The concept behind this sharing of the dataset on Kaggle, and releasing on my public website the application I use to navigate date (I have a local Xampp where I use this and other applications to support the research side of my past business and current publication activities) is shared on http://robertolofaro.com/datademocracy

    This tag cloud contains the most common words 1997-2020 across the dataset

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3925987%2Fcf58205d2447ed7355c1a4e213f5b477%2F20200902_kagglerelease.png?generation=1599033600865103&alt=media" alt="">

    Acknowledgements

    Thanks to the ECB for saving my time (I was going to copy-and-paste or "scrape" with R from the speeches posted on their website) by releasing the dataset https://www.ecb.europa.eu/press/key/html/downloads.en.html

    Inspiration

    In my cultural and organizational change activities and within data collection, collation, and processing to support management decision-making (including my own) since the 1980s, I always saw that the more data we collect, the less time to retrieve it when needed there is.

    I usually worked across multiple environments, industries, cultures, and "collecting" was never good enough if I could not then "retrieve by association".

    In storytelling is fine just to roughly remember "cameos from the past", but in data storytelling (or when trying to implement a new organization, process, or even just software or data analysis) being able to pinpoint a source that might have been there before is equally important.

    So, I am simply exploring different ways to cross-reference information from different domains, as I am quite confident that within all the open data (including the ECB speeches) there are the results of what niche experts saw on various items.

    Therefore, why should time and resources be wasted on redoing what was done from others, when you can start from their endpoint, before adapting first, and adopting then (if relevant)?

    Updates

    2020-01-25: added GITHUB repository for versioning and release of additional material as the upload of the new export_datamart.csv wasn't possible, it is now available at: https://github.com/robertolofaro/ecbspeech

    changes in the dataset: 1. fixed language codes 2. added speeches published on the ECB website in January 2020 (up to 2020-01-25 09:00 CET) 3. added all the items listed under the "interview" section of the ECB website

    current content: 340 interviews, 2374 speeches

    2020-01-29: the same file on GITHUB released on 2020-01-25, containing both speeches and interviews, and within an additional column to differentiate between the two, is now available on Kaggle

    current content: 340 interviews, 2374 speeches

    2020-02-26: monthly update, with items released on the ECB website up to 2020-02-22

    current content: 2731 items, 345 interviews, 2386 speeches

    2020-03-25: monthly update, with items released on the ECB website up to 2020-03-20

    since March 2020, the dataset includes also press conferences available on he ECB website

    current content: 2988 records (2392 speeches, 351 interviews, 245 press conferences)

    2020-06-07: update, with items released on the ECB website up to 2020-06-07

    since June 2020, the dataset includes also press conferences, blog posts, and podcasts available on the ECB website

    current content: 3030 records (2399 speeches, 369 interviews, 247 press conferences, 8 blog posts, 7 ECB Podcast). ...

  16. NIH Vaccine Adverse Event Reporting System Data Sets

    • datalumos.org
    Updated Mar 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Department of Health and Human Services (2025). NIH Vaccine Adverse Event Reporting System Data Sets [Dataset]. http://doi.org/10.3886/E221805V1
    Explore at:
    Dataset updated
    Mar 6, 2025
    Dataset authored and provided by
    United States Department of Health and Human Serviceshttp://www.hhs.gov/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    1990 - Jan 2025
    Description

    Includes Vaccine Adverse Event Reporting System Data Files from 1990 to January 2025. A reproduceable Rmd file, website screenshots, and documentation have been included in the Supplementary Information folder.From the VAERS Data Sets website:VAERS data CSV and compressed (ZIP) files are available for download in the table below. For information about VAERS data, please view the VAERS Data Use Guide [PDF - 310KB], which contains the following information:Important information about VAERS from the FDABrief description of VAERSCautions on interpreting VAERS dataDefinitions of termsDescription of filesList of commonly used abbreviationsSelect the desired time interval to download VAERS data. Each data set is available for download as a compressed (ZIP) file or as individual CSV files. Each compressed file contains the three CSV files listed for a specific data set.Last updated: February 7, 2025.

  17. Stack Overflow Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stack Overflow (2019). Stack Overflow Data [Dataset]. https://www.kaggle.com/datasets/stackoverflow/stackoverflow
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    Stack Overflowhttp://stackoverflow.com/
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Context

    Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their careers.

    Content

    Updated on a quarterly basis, this BigQuery dataset includes an archive of Stack Overflow content, including posts, votes, tags, and badges. This dataset is updated to mirror the Stack Overflow content on the Internet Archive, and is also available through the Stack Exchange Data Explorer.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    Dataset Source: https://archive.org/download/stackexchange

    https://bigquery.cloud.google.com/dataset/bigquery-public-data:stackoverflow

    https://cloud.google.com/bigquery/public-data/stackoverflow

    Banner Photo by Caspar Rubin from Unplash.

    Inspiration

    What is the percentage of questions that have been answered over the years?

    What is the reputation and badge count of users across different tenures on StackOverflow?

    What are 10 of the “easier” gold badges to earn?

    Which day of the week has most questions answered within an hour?

  18. online review.csv

    • kaggle.com
    zip
    Updated Jun 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Farha Kousar (2024). online review.csv [Dataset]. https://www.kaggle.com/datasets/farhakouser/online-review-csv
    Explore at:
    zip(1747813 bytes)Available download formats
    Dataset updated
    Jun 22, 2024
    Authors
    Farha Kousar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The /kaggle/input/online-review-csv/online_review.csv file contains customer reviews from Flipkart. It includes the following columns:

    review_id: Unique identifier for each review. product_id: Unique identifier for each product. user_id: Unique identifier for each user. rating: Star rating (1 to 5) given by the user. title: Summary of the review. review_text: Detailed feedback from the user. review_date: Date the review was submitted. verified_purchase: Indicates if the purchase was verified (true/false). helpful_votes: Number of users who found the review helpful. reviewer_name: Name or alias of the reviewer. Uses Sentiment Analysis: Understand customer sentiments. Product Improvement: Identify areas for product enhancement. Market Research: Analyze customer preferences. Recommendation Systems: Improve recommendation algorithms. This dataset is ideal for practicing data analysis and machine learning techniques.

  19. OpenCon Application Data

    • figshare.com
    txt
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCon 2015; SPARC; Right to Research Coalition (2023). OpenCon Application Data [Dataset]. http://doi.org/10.6084/m9.figshare.1512496.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    OpenCon 2015; SPARC; Right to Research Coalition
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OpenCon 2015 Application Open Data

    The purpose of this document is to accompany the public release of data collected from OpenCon 2015 applications.Download & Technical Information The data can be downloaded in CSV format from GitHub here: https://github.com/RightToResearch/OpenCon-2015-Application-Data The file uses UTF8 encoding, comma as field delimiter, quotation marks as text delimiter, and no byte order mark.

    License and Requests

    This data is released to the public for free and open use under a CC0 1.0 license. We have a couple of requests for anyone who uses the data. First, we’d love it if you would let us know what you are doing with it, and share back anything you develop with the OpenCon community (#opencon / @open_con ). Second, it would also be great if you would include a link to the OpenCon 2015 website (www.opencon2015.org) wherever the data is used. You are not obligated to do any of this, but we’d appreciate it!

    Data Fields

    Unique ID

    This is a unique ID assigned to each applicant. Numbers were assigned using a random number generator.

    Timestamp

    This was the timestamp recorded by google forms. Timestamps are in EDT (Eastern U.S. Daylight Time). Note that the application process officially began at 1:00pm EDT June 1 ended at 6:00am EDT on June 23. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. [a]

    Gender

    Mandatory. Choose one from list or fill-in other. Options provided: Male, Female, Other (fill in).

    Country of Nationality

    Mandatory. Choose one option from list.

    Country of Residence

    Mandatory. Choose one option from list.

    What is your primary occupation?

    Mandatory. Choose one from list or fill-in other. Options provided: Undergraduate student; Masters/professional student; PhD candidate; Faculty/teacher; Researcher (non-faculty); Librarian; Publisher; Professional advocate; Civil servant / government employee; Journalist; Doctor / medical professional; Lawyer; Other (fill in).

    Select the option below that best describes your field of study or expertise

    Mandatory. Choose one option from list.

    What is your primary area of interest within OpenCon’s program areas?

    Mandatory. Choose one option from list. Note: for the first approximately 24 hours the options were listed in this order: Open Access, Open Education, Open Data. After that point, we set the form to randomize the order, and noticed an immediate shift in the distribution of responses.

    Are you currently engaged in activities to advance Open Access, Open Education, and/or Open Data?

    Mandatory. Choose one option from list.

    Are you planning to participate in any of the following events this year?

    Optional. Choose all that apply from list. Multiple selections separated by semi-colon.

    Do you have any of the following skills or interests?

    Mandatory. Choose all that apply from list or fill-in other. Multiple selections separated by semi-colon. Options provided: Coding; Website Management / Design; Graphic Design; Video Editing; Community / Grassroots Organizing; Social Media Campaigns; Fundraising; Communications and Media; Blogging; Advocacy and Policy; Event Logistics; Volunteer Management; Research about OpenCon's Issue Areas; Other (fill-in).

    Data Collection & Cleaning

    This data consists of information collected from people who applied to attend OpenCon 2015. In the application form, questions that would be released as Open Data were marked with a caret (^) and applicants were asked to acknowledge before submitting the form that they understood that their responses to these questions would be released as such. The questions we released were selected to avoid any potentially sensitive personal information, and to minimize the chances that any individual applicant can be positively identified. Applications were formally collected during a 22 day period beginning on June 1, 2015 at 13:00 EDT and ending on June 23 at 06:00 EDT. Some applications have timestamps later than this date, and this is due to a variety of reasons including exceptions granted for technical difficulties, error corrections (which required re-submitting the form), and applications sent in via email and later entered manually into the form. Applications were collected using a Google Form embedded at http://www.opencon2015.org/attend, and the shortened bit.ly link http://bit.ly/AppsAreOpen was promoted through social media. The primary work we did to clean the data focused on identifying and eliminating duplicates. We removed all duplicate applications that had matching e-mail addresses and first and last names. We also identified a handful of other duplicates that used different e-mail addresses but were otherwise identical. In cases where duplicate applications contained any different information, we kept the information from the version with the most recent timestamp. We made a few minor adjustments in the country field for cases where the entry was obviously an error (for example, electing a country listed alphabetically above or below the one indicated elsewhere in the application). We also removed one potentially offensive comment (which did not contain an answer to the question) from the Gender field and replaced it with “Other.”

    About OpenCon

    OpenCon 2015 is the student and early career academic professional conference on Open Access, Open Education, and Open Data and will be held on November 14-16, 2015 in Brussels, Belgium. It is organized by the Right to Research Coalition, SPARC (The Scholarly Publishing and Academic Resources Coalition), and an Organizing Committee of students and early career researchers from around the world. The meeting will convene students and early career academic professionals from around the world and serve as a powerful catalyst for projects led by the next generation to advance OpenCon's three focus areas—Open Access, Open Education, and Open Data. A unique aspect of OpenCon is that attendance at the conference is by application only, and the majority of participants who apply are awarded travel scholarships to attend. This model creates a unique conference environment where the most dedicated and impactful advocates can attend, regardless of where in the world they live or their access to travel funding. The purpose of the application process is to conduct these selections fairly. This year we were overwhelmed by the quantity and quality of applications received, and we hope that by sharing this data, we can better understand the OpenCon community and the state of student and early career participation in the Open Access, Open Education, and Open Data movements.

    Questions

    For inquires about the OpenCon 2015 Application data, please contact Nicole Allen at nicole@sparc.arl.org.

  20. H

    JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at...

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Feb 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irene Garousi-Nejad; David Tarboton (2022). JavaScript code for retrieval of MODIS Collection 6 NDSI snow cover at SNOTEL sites and a Jupyter Notebook to merge/reprocess data [Dataset]. http://doi.org/10.4211/hs.d287f010b2dd48edb0573415a56d47f8
    Explore at:
    zip(52.2 KB)Available download formats
    Dataset updated
    Feb 11, 2022
    Dataset provided by
    HydroShare
    Authors
    Irene Garousi-Nejad; David Tarboton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This JavaScript code has been developed to retrieve NDSI_Snow_Cover from MODIS version 6 for SNOTEL sites using the Google Earth Engine platform. To successfully run the code, you should have a Google Earth Engine account. An input file, called NWM_grid_Western_US_polygons_SNOTEL_ID.zip, is required to run the code. This input file includes 1 km grid cells of the NWM containing SNOTEL sites. You need to upload this input file to the Assets tap in the Google Earth Engine code editor. You also need to import the MOD10A1.006 Terra Snow Cover Daily Global 500m collection to the Google Earth Engine code editor. You may do this by searching for the product name in the search bar of the code editor.

    The JavaScript works for s specified time range. We found that the best period is a month, which is the maximum allowable time range to do the computation for all SNOTEL sites on Google Earth Engine. The script consists of two main loops. The first loop retrieves data for the first day of a month up to day 28 through five periods. The second loop retrieves data from day 28 to the beginning of the next month. The results will be shown as graphs on the right-hand side of the Google Earth Engine code editor under the Console tap. To save results as CSV files, open each time-series by clicking on the button located at each graph's top right corner. From the new web page, you can click on the Download CSV button on top.

    Here is the link to the script path: https://code.earthengine.google.com/?scriptPath=users%2Figarousi%2Fppr2-modis%3AMODIS-monthly

    Then, run the Jupyter Notebook (merge_downloaded_csv_files.ipynb) to merge the downloaded CSV files that are stored for example in a folder called output/from_GEE into one single CSV file which is merged.csv. The Jupyter Notebook then applies some preprocessing steps and the final output is NDSI_FSCA_MODIS_C6.csv.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
Organization logo

GitTables 1M - CSV files

Explore at:
zipAvailable download formats
Dataset updated
Jun 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

This dataset contains >800K CSV files behind the GitTables 1M corpus.

For more information about the GitTables corpus, visit:

- our website for GitTables, or

- the main GitTables download page on Zenodo.

Search
Clear search
Close search
Google apps
Main menu